Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data
Richard W Lusk
(Submitted on 30 Jan 2014)

Background: Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species.
Results: I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from “blank” samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood.
Conclusions: Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

Impact of RNA degradation on measurements of gene expression

Impact of RNA degradation on measurements of gene expression

Irene Gallego Romero, Athma A. Pai, Jenny Tung, Yoav Gilad

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be normalized, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples are at times the sole means of addressing specific questions – e.g., samples collected in the course of fieldwork. We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a quality metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples. While standard normalizations failed to account for the effects of degradation, we found that a simple linear model that controls for the effects of RIN can correct for the majority of these effects. We conclude that in instances where RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans

Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans
Rebekah L. Rogers, Julie M. Cridland, Ling Shao, Tina T. Hu, Peter Andolfatto, Kevin R. Thornton
(Submitted on 28 Jan 2014)

We have used whole genome paired-end Illumina sequence data to identify tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines of D. simulans and performed genome wide validation with PacBio long molecule sequencing. We identify 1,415 tandem duplications that are segregating in D. yakuba as well as 975 duplications in D. simulans, indicating greater variation in D. yakuba. Additionally, we observe high rates of secondary deletions at duplicated sites, with 8% of duplicated sites in D. simulans and 17% of sites in D. yakuba modified with deletions. These secondary deletions are consistent with the action of the large loop mismatch repair system acting to remove polymorphic tandem duplication, resulting in rapid dynamics of gain and loss in duplicated alleles and a richer substrate of genetic novelty than has been previously reported. Most duplications are present in only single strains, suggesting deleterious impacts are common. However, we do observe signals consistent with adaptive evolution. D. simulans shows an excess of whole gene duplications and an excess of high frequency variants on the X chromosome, consistent with adaptive evolution through duplications on the D. simulans X. We identify 79 chimeric genes in D. yakuba and 38 chimeric genes in D. simulans, as well as 143 cases of recruited non-coding sequence in D. yakuba and 96 in D. simulans, in agreement with rates of chimeric gene origination in D. melanogaster. Together, these results suggest that tandem duplications often result in complex variation beyond whole gene duplications that offers a rich substrate of standing variation that is likely to contribute both to detrimental phenotypes and disease, as well as to adaptive evolutionary change.

Footprints of ancient balanced polymorphisms in genetic variation data

Footprints of ancient balanced polymorphisms in genetic variation data
Ziyue Gao, Molly Przeworski, Guy Sella
(Submitted on 29 Jan 2014)

When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic variation data, notably the presence of additional neutral polymorphisms shared between species and the absence of fixed differences, the effects remain poorly characterized. We focus on the case of a single site under long-lived balancing selection and derive approximations for summaries of the data that are sensitive to a trans-species polymorphism: the length of the segment that carries most of the signals, the expected number of shared neutral SNPs within the segment and the patterns of allelic associations among them. Coalescent simulations of ancient balancing selection confirm the accuracy of our approximations. We further show that for humans and chimpanzees, and more generally for pairs of species with low genetic diversity levels, the patterns of genetic variation on which we focus are highly unlikely to be generated by neutral recurrent mutations, so these statistics are specific as well as sensitive. We discuss the implications of our results for the design and interpretation of genome scans for ancient balancing selection in apes and other taxa.

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Daniel L. Rabosky
(Submitted on 26 Jan 2014)

A number of methods have been developed to infer differential rates of species diversification through time and among clades using time-calibrated phylogenetic trees. However, we lack a general framework that can delineate and quantify heterogeneous mixtures of dynamic processes within single phylogenies. I developed a method that can identify arbitrary numbers of time-varying diversification processes on phylogenies without specifying their locations in advance. The method uses reversible-jump Markov Chain Monte Carlo to move between model subspaces that vary in the number of distinct diversification regimes. The model assumes that changes in evolutionary regimes occur across the branches of phylogenetic trees under a compound Poisson process and explicitly accounts for rate variation through time and among lineages. Using simulated datasets, I demonstrate that the method can be used to quantify complex mixtures of time-dependent, diversity-dependent, and constant-rate diversification processes. I compared the performance of the method to the MEDUSA model of rate variation among lineages. As an empirical example, I analyzed the history of speciation and extinction during the radiation of modern whales. The method described here will greatly facilitate the exploration of macroevolutionary dynamics across large phylogenetic trees, which may have been shaped by heterogeneous mixtures of distinct evolutionary processes.

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees
Gilad Liberman, Jennifer Benichou, Lea Tsaban, yaakov maman, Jacob Glanville, yoram louzoun

Incremental selection within a population, defined as a limited fitness change following a mutation, is an important aspect of many evolutionary processes and can significantly affect a large number of mutations through the genome. Strongly advantageous or deleterious mutations are detected through the fixation of mutations in the population, using the synonymous to non-synonymous mutations ratio in sequences. There are currently to precise methods to estimate incremental selection occurring over limited periods. We here provide for the first time such a detailed method and show its precision and its applicability to the genomic analysis of selection. A special case of evolution is rapid, short term micro-evolution, where organism are under constant adaptation, occurring for example in viruses infecting a new host, B cells mutating during a germinal center reactions or mitochondria evolving within a given host. The proposed method is a novel mixed lineage tree/sequence based method to detect within population selection as defined by the effect of mutations on the average number of offspring. Specifically, we pro-pose to measure the log of the ratio between the number of leaves in lineage trees branches following synonymous and non-synonymous mutations. This method does not suffer from the need of a baseline model and is practically not affected by sampling biases. In order to show the wide applicability of this method, we apply it to multiple cases of micro-evolution, and show that it can detect genes and inter-genic regions using the selection rate and detect selection pressures in viral proteins and in the immune response to pathogens.

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

Xilin Deng, Philippe Henry

Identifying the genetic and ecological basis of adaptation is of immense importance in evolutionary biology. In our study, we applied a panel of 58 biallelic single nucleotide polymorphisms (SNPs) for the economically and culturally important salmonid Oncorhynchus keta. Samples included 4164 individuals from 43 populations ranging from Coastal Western Alaska to southern British Colombia and northern Washington. Signatures of natural selection were detected by identifying seven outlier loci using two independent approaches: one based on outlier detection and another based on environmental correlations. Evidence of divergent selection at two candidate SNP loci, Oke_RFC2-168 and Oke_MARCKS-362, indicates significant environmental correlations, particularly with the number of frost-free days (NFFD). Important associations found between environmental variables and outlier loci indicate that those environmental variables could be the major driving forces of allele frequency divergence at the candidate loci. NFFD, in particular, may play an important adaptive role in shaping genetic variation in O. keta. Correlations between divergent selection and local environmental variables will help shed light on processes of natural selection and molecular adaptation to local environmental conditions.

On the representation of de Bruijn graphs

On the representation of de Bruijn graphs
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson, Paul Medvedev
(Submitted on 21 Jan 2014)

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Rajiv C McCoy, Ryan W Taylor, Timothy A Blauwkamp, Joanna L Kelley, Michael Kertesz, Dmitry Pushkarev, Dmitri A Petrov, Anna-Sophie Fiston-Lavier

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, mostly due to the presence of repeats, which cannot be reconstructed unambiguously with short read data alone. One class of repeats, called transposable elements (TEs), is particularly problematic due to high sequence identity, high copy number, and a capacity to induce complex genomic rearrangements. Despite their importance to genome function and evolution, most current de novo assembly approaches cannot resolve TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 2-15 Kbp with an extremely low error rate (0.05%). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain yw;cn,bw,sp) achieving an NG50 contig size of 77.9 Kbp and covering 97.2% of the current reference genome (including heterochromatin). TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recover and accurately place 80.4% of annotated transposable elements with perfect identity to the current reference genome. As TEs are complex and highly repetitive features that are ubiquitous in genomes across the tree of life, TruSeq synthetic long-read technology offers a powerful and inexpensive approach to drastically improve de novo assemblies of whole genomes.