Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads
Laurent Gautier, Ole Lund
(Submitted on 6 Jun 2013)

Cheap high-throughput DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples.
We propose a novel general approach to the analysis of sequencing data in which the reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data, and the hints can be used for more computationally-demanding work.
Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references known to the server. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment.
To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients, one of them running in a web browser, in order to demonstrate that gigabytes of raw sequencing reads of unknown origin could be identified without the need to transfer a very large volume of data, and on modestly powered computing devices.
A web access is available at this http URL. The source code for a python command-line client, a server, and supplementary data is available at this http URL.

SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly

SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly
Tin Chi Nguyen, Zhiyu Zhao, Dongxiao Zhu
(Submitted on 6 Jun 2013)

Transcriptome assembly from RNA-Seq reads is an active area of bioinformatics research. The ever-declining cost and the increasing depth of RNA-Seq have provided unprecedented opportunities to better identify expressed transcripts. However, the nonlinear transcript structures and the ultra-high throughput of RNA-Seq reads pose significant algorithmic and computational challenges to the existing transcriptome assembly approaches, either reference-guided or de novo. While reference-guided approaches offer good sensitivity, they rely on alignment results of the splice-aware aligners and are thus unsuitable for species with incomplete reference genomes. In contrast, de novo approaches do not depend on the reference genome but face a computational daunting task derived from the complexity of the graph built for the whole transcriptome. In response to these challenges, we present a hybrid approach to exploit an incomplete reference genome without relying on splice-aware aligners. We have designed a split-and-align procedure to efficiently localize the reads to individual genomic loci, which is followed by an accurate de novo assembly to assemble reads falling into each locus. Using extensive simulation data, we demonstrate a high accuracy and precision in transcriptome reconstruction by comparing to selected transcriptome assembly tools. Our method is implemented in assemblySAM, a GUI software freely available at this http URL.

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences
Dietrich Radel, Andreas Sand, Mike Steel
(Submitted on 6 Jun 2013)

Finding optimal evolutionary trees from sequence data is typically an intractable problem, and there is usually no way of knowing how close to optimal the best tree from some search truly is. The problem would seem to be particularly acute when we have many taxa and when that data has high levels of homoplasy, in which the individual characters require many changes to fit on the best tree. However, a recent mathematical result has provided a precise tool to generate a short number of high-homoplasy characters for any given tree, so that this tree is provably the optimal tree under the maximum parsimony criterion. This provides, for the first time, a rigorous way to test tree search algorithms on homoplasy-rich data, where we know in advance what the `best’ tree is. In this short note we consider just one search program (TNT) but show that it is able to locate the globally optimal tree correctly for 32,768 taxa, even though the characters in the dataset requires, on average, 1148 state-changes each to fit on this tree, and the number of characters is only 57.

Density behavior of spatial birth-and-death stochastic evolution of mutating genotypes under selection rates

Density behavior of spatial birth-and-death stochastic evolution of mutating genotypes under selection rates
Dmitri Finkelshtein, Yuri Kondratiev, Oleksandr Kutoviy, Stanislav Molchanov, Elena Zhizhina
(Submitted on 5 Jun 2013)

We consider birth-and-death stochastic evolution of genotypes with different lengths. The genotypes might mutate that provides a stochastic changing of lengthes by a free diffusion law. The birth and death rates are length dependent which corresponds to a selection effect. We study an asymptotic behavior of a density for an infinite collection of genotypes. The cases of space homogeneous and space heterogeneous densities are considered.

Coalescence, genetic diversity and adaptation in sexual populations

Coalescence, genetic diversity and adaptation in sexual populations
Richard A. Neher, Taylor A. Kessinger, Boris I. Shraiman
(Submitted on 5 Jun 2013)

In diverse sexual populations, selection operates neither on the whole genome — which is repeatedly taken apart and reassembled by recombination — nor on individual alleles which are tightly linked to the chromosomal neighborhood. Those tightly linked alleles affect each others dynamics which reduces the efficiency of selection and distorts patterns of genetic diversity. Inference of evolutionary history from diversity shaped by linked selection requires an understanding of these patterns. Here, we reexamine this problem in the light of recent progress in coalescent theory of rapidly adapting asexual populations. We present a simple but powerful scaling analysis identifying the unit of selection as the genomic “linkage block” with characteristic length \xi_b, which is determined in a self-consistent manner by the condition that the rate of recombination within the block is comparable to the fitness differences between different alleles of the block. We find that an asexual model with strength of selection tuned to that of the linkage block provides an excellent description of genetic diversity and the site frequency spectra when compared to computer simulations of population dynamics. This correspondence holds for the entire spectrum of strength of selection. When fitness differentials arise from the collective contribution of numerous weakly selected polymorphisms, the rate of adaptation increases as the square root of the recombination rate. Linkage block approximation thus provides a simple but powerful tool for understanding interference and collective behavior of dense weakly selected loci.

Interference limits resolution of selection pressures from linked neutral diversity

Interference limits resolution of selection pressures from linked neutral diversity
Benjamin H. Good, Aleksandra M. Walczak, Richard A. Neher, Michael M. Desai
(Submitted on 5 Jun 2013)

Pervasive natural selection can strongly influence observed patterns of genetic variation, but these effects remain poorly understood when multiple selected variants segregate in nearby regions of the genome. Classical population genetics fails to account for interference between linked mutations, which grows increasingly severe as the density of selected polymorphisms increases. Here, we describe a simple limit that emerges when interference is common, in which the fitness effects of individual mutations play a relatively minor role. Instead, molecular evolution is determined by the variance in fitness within the population, defined over an effectively asexual segment of the genome (a “linkage block”). We exploit this insensitivity in a new “coarse-grained” coalescent framework, which approximates the effects of many weakly selected mutations with a smaller number of strongly selected mutations with the same variance in fitness. This approximation generates accurate and efficient predictions for the genetic diversity that cannot be summarized by a simple reduction in effective population size. However, these results suggest a fundamental limit on our ability to resolve individual selection pressures from contemporary sequence data alone, since a wide range of parameters yield nearly identical patterns of sequence variability.

Reconstructing the Population Genetic History of the Caribbean

Reconstructing the Population Genetic History of the Caribbean
Andres Moreno-Estrada, Simon Gravel, Fouad Zakharia, Jacob L. McCauley, Jake K. Byrnes, Christopher R. Gignoux, Patricia A. Ortiz-Tello, Ricardo J. Martinez, Dale J. Hedges, Richard W. Morris, Celeste Eng, Karla Sandoval, Suehelay Acevedo-Acevedo, Juan Carlos Martinez-Cruzado, Paul J. Norman, Zulay Layrisse, Peter Parham, Esteban Gonzalez Burchard, Michael L. Cuccaro, Eden R. Martin, Carlos D. Bustamante
(Submitted on 3 Jun 2013)

The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, by making use of genome-wide SNP array data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures in present day Afro-Caribbean genomes and shedding light on the genetic impact of the dynamics occurring during the slave trade in the Caribbean.

biobambam: tools for read pair collation based algorithms on BAM files

biobambam: tools for read pair collation based algorithms on BAM files
German Tischler, Steven Leonard
(Submitted on 4 Jun 2013)

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, an API for efficient BAM file reading supporting the efficient collation of alignments by read name without performing a complete resorting of the input file and some tools based on this API performing tasks like marking duplicate reads and conversion to the FastQ format. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities in the Picard suite the approach of biobambam can often perform an equivalent task more efficiently in terms of the required main memory and run-time.

Populations in statistical genetic modelling and inference

Populations in statistical genetic modelling and inference

Daniel John Lawson
(Submitted on 4 Jun 2013)

What is a population? This review considers how a population may be defined in terms of understanding the structure of the underlying genetics of the individuals involved. The main approach is to consider statistically identifiable groups of randomly mating individuals, which is well defined in theory for any type of (sexual) organism. We discuss generative models using drift, admixture and spatial structure, and the ancestral recombination graph. These are contrasted with statistical models for inference, principle component analysis and other `non-parametric’ methods. The relationships between these approaches are explored with both simulated and real-data examples. The state-of-the-art practical software tools are discussed and contrasted. We conclude that populations are a useful theoretical construct that can be well defined in theory and often approximately exist in practice.

Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Soo-Young Park, Michael Z. Ludwig, Natalia A. Tamarina, Bin Z. He, Sarah H. Carl, Desiree A. Dickerson, Levi Barse, Bharath Arun, Calvin Williams, Cecelia M. Miles, Louis H. Philipson, Donald F. Steiner, Graeme I. Bell, Martin Kreitman
(Submitted on 31 May 2013)

Here we use Drosophila melanogaster to create a genetic model of human permanent neonatal diabetes mellitus and present experimental results describing dimensions of this complexity. The approach involves the transgenic expression of a misfolded mutant of human preproinsulin, hINSC96Y, which is a cause of the disease. When expressed in fly imaginal discs, hINSC96Y causes a reduction of adult structures, including the eye, wing and notum. Eye imaginal discs exhibit defects in both the structure and arrangement of ommatidia. In the wing, expression of hINSC96Y leads to ectopic expression of veins and mechano-sensory organs, indicating disruption of wild type signaling processes regulating cell fates. These readily measurable disease phenotypes are sensitive to temperature, gene dose and sex. Mutant (but not wild type) proinsulin expression in the eye imaginal disc induces IRE1-mediated Xbp1 alternative splicing, a signal for endoplasmic reticulum stress response activation, and produces global change in gene expression. Mutant hINS transgene tester strains, when crossed to stocks from the Drosophila Genetic Reference Panel produces F1 adults with a continuous range of disease phenotypes and large broad-sense heritability. Surprisingly, the severity of mutant hINS-induced disease in the eye is not correlated with that in the notum in these crosses, nor with eye reduction phenotypes caused by the expression of two dominant eye mutants acting in two different eye development pathways, Drop (Dr) or Lobe (L) when crossed into the same genetic backgrounds. The tissue specificity of genetic variability for mutant hINS-induced disease thus has its own distinct signature. The genetic dominance of disease-specific phenotypic variability makes this approach amenable to genome-wide association study (GWAS) in a simple F1 screen of natural variation.