Computational aspects of DNA mixture analysis

Computational aspects of DNA mixture analysis
Therese Graversen, Steffen Lauritzen
(Submitted on 18 Jul 2013)

Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset.

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, Marc Salit
(Submitted on 17 Jul 2013)

Clinical adoption of human genome sequencing requires methods with known accuracy of genotype calls at millions or billions of positions across a genome. Previous work showing discordance amongst sequencing methods and algorithms has made clear the need for a highly accurate set of genotypes across a whole genome that could be used as a benchmark. We present methods we used to make highly confident SNP, indel, and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. To minimize bias towards any sequencing method, we integrate 9 whole genome and 3 exome datasets from 5 different sequencing platforms (Illumina, Complete Genomics, SOLiD, 454, and Ion Torrent), 7 mappers, and 3 variant callers. The resulting genotype calls are highly sensitive and specific, and allow performance assessment of more difficult variants than typically investigated using microarrays as a benchmark. Regions for which no confident genotype call could be made are identified as uncertain, and classified into different reasons for uncertainty (e.g. low coverage, mapping/alignment bias, etc.). As a community resource, we have integrated our highly confident genotype calls into the GCAT website for interactive assessment of false positive and negative rates of different datasets and bioinformatics methods using our highly confident calls. Application of the concepts of our integration process may be interesting beyond whole genome sequencing, for other measurement problems with large datasets from multiple methods, where none of the methods is a Reference Method that can be relied upon as highly sensitive and specific.

QuorUM: an error corrector for Illumina reads

QuorUM: an error corrector for Illumina reads
Guillaume Marçais, James A. Yorke, Aleksey Zimin
(Submitted on 12 Jul 2013)

Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (100 bp150 bp) reads at a low cost. Our goal is to produce trimmed and error-corrected reads to improve genome assemblies. Our error correction procedure aims at producing a set of error-corrected reads (1) minimizing the number of distinct false k-mers, i.e. that are not present in the genome, in the set of reads and (2) maximizing the number that are true, i.e. that are present in the genome. Because coverage of a genome by Illumina reads varies greatly from point to point, we cannot simply eliminate k-mers that occur rarely.
Results: Our software, called QuorUM, provides reasonably accurate correction and is suitable for large data sets (1 billion bases checked and corrected per day per core).
Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at this http URL

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
Swetansu Pattnaik, Saurabh Gupta, Arjun A Rao, Binay Panda
(Submitted on 12 Jul 2013)

We report SInC (SNV, Indel and CNV) simulator and read generator, an open-source tool capable of simulating biological variants taking into account a platform-specific error model. SInC is capable of simulating and generating single- and paired-end reads with user-defined insert size with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes. Sinc can be downloaded from this https URL

kruX: Matrix-based non-parametric eQTL discovery

kruX: Matrix-based non-parametric eQTL discovery
Jianlong Qi, Hassan Foroughi Asl, Johan Bjorkegren, Tom Michoel
(Submitted on 12 Jul 2013)

The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitative trait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of marker-trait combinations one-by-one can become computationally prohibitive. We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-trait combinations at once. KruX is more than 3,000 times faster than computing associations one-by-one on a typical human dataset.

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce
Christopher W. Whelan, Jeffrey Tyner, Alberto L’Abbate, Clelia Tiziana Storlazzi, Lucia Carbone, Kemal Sönmez
(Submitted on 9 Jul 2013)

The detection of genomic structural variations (SV) remains a difficult challenge in analyzing sequencing data, and the growing size and number of sequenced genomes have rendered SV detection a bona fide big data problem. MapReduce is a proven, scalable solution for distributed computing on huge data sets. We describe a conceptual framework for SV detection algorithms in MapReduce based on computing local genomic features, and use it to develop a deletion and insertion detection algorithm, Cloudbreak. On simulated and real data sets, Cloudbreak achieves accuracy improvements over popular SV detection algorithms, and genotypes variants from diploid samples. It provides dramatically shorter runtimes and the ability to scale to big data volumes on large compute clusters. Cloudbreak includes tools to set up and configure MapReduce (Hadoop) clusters on cloud services, enabling on-demand cluster computing. Our implementation and source code are available at this http URL

A hierarchical network heuristic for solving the orientation problem in genome assembly

A hierarchical network heuristic for solving the orientation problem in genome assembly
Karl R. B. Schmitt, Aleksey V. Zimin, Guillaume Marcaçs, James A. Yorke, Michelle Girvan
(Submitted on 1 Jul 2013)

In the past several years, the problem of genome assembly has received considerable attention from both biologists and computer scientists. An important component of current assembly methods is the scaffolding process. This process involves building ordered and oriented linear collections of contigs (continuous overlapping sequence reads) called scaffolds and relies on the use of mate pair data. A mate pair is a set of two reads that are sequenced from the ends of a single fragment of DNA, and therefore have opposite mutual orientations. When two reads of a mate-pair are placed into two different contigs, one can infer the mutual orientation of these contigs. While several orientation algorithms exist as part of assembly programs, all encounter challenges while solving the orientation problem due to errors from mis-assemblies in contigs or errors in read placements. In this paper we present an algorithm based on hierarchical clustering that independently solves the orientation problem and is robust to errors. We show that our algorithm can correctly solve the orientation problem for both faux (generated) assembly data and real assembly data for {\em R. sphaeroides bacteria}. We demonstrate that our algorithm is stable to both changes in the initial orientations as well as noise in the data, making it advantageous compared to traditional approaches.

A new DNA alignment method based on inverted index

A new DNA alignment method based on inverted index
Wang Liang, Zhao KaiYong
(Submitted on 30 Jun 2013)

This paper presents a novel DNA sequences alignment method based on inverted index. Now most large scale information retrieval system are all use inverted index as the basic data structure. But its application in DNA sequence alignment is still not found. This paper just discuss such applications. Three main problems, DNA segmenting, long DNA query search, DNA search ranking algorithm and evaluation method are detailed respectively. This research presents a new avenue to build more effective DNA alignment methods.

Bound to succeed: Transcription factor binding site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens

Bound to succeed: Transcription factor binding site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens
Surya Saha, Magdalen Lindeberg
(Submitted on 26 Jun 2013)

Bacterial plant pathogens rely on a battalion of transcription factors to fine-tune their response to changing environmental conditions and marshal the genetic resources required for successful pathogenesis. Prediction of transcription factor binding sites represents an important tool for elucidating regulatory networks, and has been conducted in multiple genera of plant pathogenic bacteria for the purpose of better understanding mechanisms of survival and pathogenesis. The major categories of transcription factor binding sites that have been characterized are reviewed here with emphasis on in silico methods used for site identification and challenges therein, their applicability to different types of sequence datasets, and insights into mechanisms of virulence and survival that have been gained through binding site mapping. An improved strategy for establishing E value cutoffs when using existing models to screen uncharacterized genomes is also discussed.

Efficient Two-Stage Group Testing Algorithms for Genetic Screening

Efficient Two-Stage Group Testing Algorithms for Genetic Screening
Michael Huber
(Submitted on 19 Jun 2013)

Efficient two-stage group testing algorithms that are particularly suited for rapid and less-expensive DNA library screening and other large scale biological group testing efforts are investigated in this paper. The main focus is on novel combinatorial constructions in order to minimize the number of individual tests at the second stage of a two-stage disjunctive testing procedure. Building on recent work by Levenshtein (2003) and Tonchev (2008), several new infinite classes of such combinatorial designs are presented.