Hot RAD: A Tool for Analysis of Next-Gen RAD Tag Data

Posted on November 30, 2015 by schraib

Hot RAD: A Tool for Analysis of Next-Gen RAD Tag Data
Lauren A. Assour, Nicholas LaRosa, Scott J. Emrich

Restriction site Associated DNA (RAD) tagging (also known as RAD-seq, etc.) is an emerging method for analyzing an organism’s genome without completely sequencing it. This can be applied to a non-model organism without a reference genome, though this creates the problem of how to begin data analysis on unmapped and unannotated reads. Our program, Hot RAD, presents a straightforward and easy-to-use method to take raw Illumina data that has been RAD tagged and produce consensus contigs or sequence stacks using a distributed framework, creating a basis on which to begin analyzing an organism’s DNA. The GUI (graphical user interface) element of our tool makes it easy for those not familiar with the command line to take raw sequence files and produce usable data in a timely manner.

Calculating the Unrooted Subtree Prune-and-Regraft Distance

Posted on November 30, 2015 by schraib

Calculating the Unrooted Subtree Prune-and-Regraft Distance
Chris Whidden, Frederick A. Matsen IV

The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be computed relatively efficiently between rooted trees using fixed-parameter-tractable maximum agreement forest (MAF) algorithms, no MAF formulation is known for the unrooted case. Correspondingly, previous algorithms are unable to compute unrooted SPR distances larger than 7.
In this paper, we substantially advance understanding of and computational algorithms for the unrooted SPR distance. First we identify four properties of minimal SPR paths, each of which suggests that no MAF formulation exists in the unrooted case. We then prove the 2008 conjecture of Hickey et al. that chain reduction preserves the unrooted SPR distance. This reduces the problem to a linear size problem kernel, substantially improving on the previous best quadratic size kernel. Then we introduce a new lower bound on the unrooted SPR distance called the replug distance that is amenable to MAF methods, and give an efficient fixed-parameter algorithm for calculating it. Finally, we develop a “progressive A*” search algorithm using multiple heuristics, including the TBR and replug distances, to exactly compute the unrooted SPR distance. Our algorithm is nearly two orders of magnitude faster than previous methods on small trees, and allows computation of unrooted SPR distances as large as 14 on trees with 50 leaves.

MetaScope – Fast and accurate identification of microbes in metagenomic sequencing data

Posted on November 30, 2015 by schraib

MetaScope – Fast and accurate identification of microbes in metagenomic sequencing data
Benjamin Buchfink, Daniel H. Huson, Chao Xie

MetaScope is a fast and accurate tool for analyzing (host-associated) metagenome datasets. Sequence alignment of reads against the host genome (if requested) and against microbial Genbank is performed using a new DNA aligner called SASS. The output of SASS is processed so as to assign all microbial reads to taxa and genes, using a new weighted version of the LCA algorithm. MetaScope is the winner of the 2013 DTRA software challenge entitled “Identify Organisms from a Stream of DNA Sequences”.

Author Post: Natural selection reduces linked neutral divergence between distantly related species

Posted on November 30, 2015 by schraib

This is a guest post by Tanya Phung on her recent preprint Natural selection reduces linked neutral divergence between distantly related species

Our recent paper on natural selection reducing divergence between distantly related species has generated interesting discussions. I started this project just a little over a year ago as a rotation student in Kirk Lohmueller’s lab at UCLA. I am now a full-time member in Kirk’s group and a 2^nd year Ph.D. student in the Bioinformatics program.

This project began when, in 2011, Kirk published a paper that documented signatures of natural selection affecting genetic variation at neutral sites across the human genome (Lohmueller et al., 2011). In that paper, among other things, he found a positive correlation between human-chimp divergence and recombination. This correlation is indicative of selection at linked neutral sites affecting divergence, mutagenic recombination, or possibly biased gene conversion. Based on the results of forward simulations, he concluded that background selection can drive much of this correlation. After publishing the paper, Kirk looked at divergence between humans and more distantly related species. Surprising to him, he also observed a positive correlation between human-mouse neutral divergence and recombination. This signal was unexpected. It was already shown in Birky and Walsh (1988) that selection does not affect substitution at linked neutral sites, and Kirk was carefully filtering out sites that are thought to be under direct effects of selection. Consequently, if selection was driving the correlation, it would have to be by patterns of polymorphism in the human-mouse ancestor which existed long ago. Thus, he thought there shouldn’t be any remaining signal. Kirk did not have time to follow-up this finding until a few years later when I showed up as a rotation student in his group in the Fall of 2014. While he suggested three different ideas as potential rotation projects, investigating how natural selection has affected divergence stood out to me in particular. As I read his 2011 paper and followed the references within, I was intrigued by conflicting reports in the literature about whether divergence showed a correlation with recombination and the mechanism for this potential correlation. Therefore, I set out to investigate this problem.

By the end of the rotation, I replicated what Kirk found earlier: a positive correlation between recombination and divergence in both closely and distantly related species. Then, using a coalescent simulation approach, I showed that simulations incorporating background selection in the ancestral population could recapitulate the correlation between neutral divergence and recombination observed in the empirical data.

My results indicated that natural selection could affect neutral divergence even between distantly related species. We were ready to prepare a manuscript. At the time, there were a few studies coming out reporting the importance of biased gene conversion. We did a bit more thinking about how biased gene conversion could affect our empirical correlation between neutral divergence and recombination. We decided to control for the potential effect of biased gene conversion by filtering out sites that could have been affected by it by filtering the weak to strong mutations (where an A or a T mutates to a C or a G). Filtering out weak to strong differences did not significantly affect the correlation between human-chimp neutral divergence and recombination. But to our surprise, the correlation between human-mouse neutral divergence and recombination all but vanished with our most stringent filtering. This means that much of that correlation could be driven by biased gene conversion. We thought that if background selection has affected human-mouse divergence, the signal ought to be stronger at regions near genes. When we partitioned the genome into regions near genes and far from genes, the positive correlation between human-mouse divergence and recombination was restored at regions near genes (albeit more weakly than before filtering sites that could have undergone biased gene conversion).

We realized that recombination rates are transient and have probably changed throughout the course of evolution. In fact, changing recombination rates could be obscuring the correlation between recombination and divergence after removing the confounding effects of biased gene conversion. So, we wanted to look for other signatures of how natural selection reduced neutral divergence even between distantly related species. This led us to investigate the relationship between divergence and functional content (amount of coding bases and conserved non-coding sequence in each window), and between divergence and measures of background selection represented by B-values estimated in McVicker et al. (B-values measure the strength of background selection in that region of the genome; see McVicker et al., 2009). In all pairs of species considered, we found a negative correlation between neutral divergence and functional content. This means that windows that have more functional sites tend to have less divergence at the nearby putatively neutral sites. We also found a positive correlation between neutral divergence and B-values, suggesting that regions of the genome that are under greater background selection within primates are also under greater background selection in the human-mouse ancestor. Both these analyses provide empirical evidence that natural selection has reduced neutral divergence in both recently and distantly related species.

Conventional wisdom holds that ancestral polymorphism does not affect divergence when considering species with long split times (such as human and mouse). The rationale is that the split time has been long enough for many new mutations to accumulate post-split, and any signal in the ancestral population would be diluted. While our empirical and simulation results clearly indicated otherwise, we wanted to gain some theoretical intuition on why we were still seeing these correlations. This is when Christian Huber, a post-doc who joined the lab recently from Vienna, joined in. Using a two-locus model, he showed that background selection can have a strong influence on the variation in divergence between genomic regions, even when the contribution of ancestral polymorphism to total divergence is vanishingly small. The key condition is a reasonably large ancestral population size.

Now we have empirical, theoretical, and simulation results which strongly argue that background selection contributes to reducing divergence at linked neutral sites. Our results question the commonly held notion that ancestral polymorphism does not measurably affect divergence in distantly related species. Further, our results indicate the importance of background selection at shaping genetic variation across the genome. Many current popular methods to infer demographic parameters from whole genomes (e.g. PSMC, G-phos) do not take background selection into account. Our work suggests that because background selection has a large effect on the variance in coalescent times across the genome, incorporating its effects into estimates of demographic parameters should yield more accurate results.

Summary

When I started working on this project as a rotation student, I had no idea that it would turn out to address a controversy and challenge a commonly held notion in population genetics. As I transitioned from an experimental microbiologist to a population geneticist, this project has given me many opportunities to learn important concepts and theories in the field. This paper not only opens opportunities to revise methods in the field but also gives me the foundation to continue working on understanding evolutionary forces that influence genetic variation across the genome.

Birky, C.W., and Walsh, J.B. (1988). Effects of linkage on rates of molecular evolution. Proc. Natl. Acad. Sci. U. S. A. 85, 6414–6418.

Lohmueller, K.E., Albrechtsen, A., Li, Y., Kim, S.Y., Korneliussen, T., Vinckenbosch, N., Tian, G., Huerta-Sanchez, E., Feder, A.F., Grarup, N., et al. (2011). Natural Selection Affects Multiple Aspects of Genetic Variation at Putatively Neutral Sites across the Human Genome. PLoS Genet 7, e1002326.

McVicker, G., Gordon, D., Davis, C., and Green, P. (2009). Widespread genomic signatures of natural selection in hominid evolution. PLoS Genet. 5, e1000471.

Direct estimate of the spontaneous mutation rate uncovers the effects of drift and recombination in the Chlamydomonas reinhardtii plastid genome

Posted on November 18, 2015 by schraib

Direct estimate of the spontaneous mutation rate uncovers the effects of drift and recombination in the Chlamydomonas reinhardtii plastid genome

Rob W Ness, Susanne A Kraemer, Nick Colegrave, Peter D Keightley

bioRxiv doi: http://dx.doi.org/10.1101/031898

Plastids perform crucial cellular functions, including photosynthesis, across a wide variety of eukaryotes. Since endosymbiosis, plastids have maintained independent genomes that now display a wide diversity of gene content, genome structure, gene regulation mechanisms, and transmission modes. The evolution of plastid genomes depends on an input of de novo mutation, but our knowledge of mutation in the plastid is limited to indirect inference from patterns of DNA divergence between species. Here, we use a mutation accumulation experiment, where selection acting on mutations is rendered ineffective, combined with whole-plastid genome sequencing to directly characterize de novo mutation in Chlamydomonas reinhardtii. We show that the mutation rates of the plastid and nuclear genomes are similar, but that the base spectra of mutations differ significantly. We integrate our measure of the mutation rate with a population genomic dataset of 20 individuals, and show that the plastid genome is subject to substantially stronger genetic drift than the nuclear genome. We also show that high levels of linkage disequilibrium in the plastid genome are not due to restricted recombination, but are instead a consequence of increased genetic drift. One likely explanation for increased drift in the plastid genome is that there are stronger effects of genetic hitchhiking. The presence of recombination in the plastid is consistent with laboratory studies in C. reinhardtii and demonstrates that although the plastid genome is thought to be uniparentally inherited, it recombines in nature at a rate similar to the nuclear genome.

A statistical approach to genome size evolution: Observations and explanations

Posted on November 18, 2015 by schraib

A statistical approach to genome size evolution: Observations and explanations

Dirson Jian Li

bioRxiv doi: http://dx.doi.org/10.1101/031963

Genome size evolution is a fundamental problem in molecular evolution. Statistical analysis of genome sizes brings new insight into the evolution of genome size. Although the variation of genome sizes is complicated, it is indicated that the genome size evolution can be explained more clearly at taxon level than at species level. I find that the genome size distribution for species in a taxon fits log-normal distribution. And I find a relationship between the phylogeny of life and the statistical features of genome size distributions among taxa. I observed different statistical features of genome size distributions between animal taxa and plant taxa. A log-normal stochastic process model is developed to simulate the genome size evolution. The simulation results on the log-normal distributions of genome sizes and their statistical features agree with the observations.

Kaiju: Fast and sensitive taxonomic classification for metagenomics

Posted on November 18, 2015 by schraib

Kaiju: Fast and sensitive taxonomic classification for metagenomics

Peter Menzel, Kim Lee Ng, Anders Krogh

bioRxiv doi: http://dx.doi.org/10.1101/031229

The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 5 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: github.com/bioinformatics-centre/kaiju

The State of Software in Evolutionary Biology

Posted on November 18, 2015 by schraib

The State of Software in Evolutionary Biology

Diego Darriba, Tomas Flouri, Alexandros Stamatakis

bioRxiv doi: http://dx.doi.org/10.1101/031930

With Next Generation Sequencing Data (NGS) coming off age and being routinely used, evolutionary biology is transforming into a data-driven science. As a consequence, researchers have to rely on a growing number of increasingly complex software. All widely used tools in our field have grown considerably, in terms of the number of features as well as lines of code. In addition, analysis pipelines now include substantially more components than 5-10 years ago. A topic that has received little attention in this context is the code quality of widely used codes. Unfortunately, the majority of users tend to blindly trust software and the results it produces. To this end, we assessed the code quality of 15 highly cited tools (e.g., MrBayes, MAFFT, SweepFinder etc.) from the broader area of evolutionary biology that are used in current data analysis pipelines. We also discuss widely unknown problems associated with floating point arithmetics for representing real numbers on computer systems. Since, the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools, but also list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal and science policy as well as funding issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software quality issues and to emphasize the substantial lack of funding for scientific software development.

Testing Rare-Variant Association without Calling Genotypes Allows for Systematic Differences in Sequencing between Cases and Controls

Posted on November 18, 2015 by schraib

Testing Rare-Variant Association without Calling Genotypes Allows for Systematic Differences in Sequencing between Cases and Controls

Yi-Juan Hu, Peizhou Liao, Henry Richard Johnston, Andrew Allen, Glen Satten

bioRxiv doi: http://dx.doi.org/10.1101/032037

Next-generation sequencing of DNA provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, when testing the association between rare variants and traits of interest, the current practice of first calling underlying genotypes and then treating the called values as known is prone to false positive findings, especially when genotyping errors are systematically different between cases and controls. This happens whenever cases and controls are sequenced at different depths or on different platforms. In this article, we provide a likelihood-based approach to testing rare variant associations that directly models sequencing reads without calling genotypes. We consider the (weighted) burden test statistic, which is the (weighted) sum of the score statistic for assessing effects of individual variants on the trait of interest. Because variant locations are unknown, we develop a simple, computationally efficient screening algorithm to estimate the loci that are variants. Because our burden statistic may not have mean zero after screening, we develop a novel bootstrap procedure for assessing the significance of the burden statistic. We demonstrate through extensive simulation studies that the proposed tests are robust to a wide range of differential sequencing qualities between cases and controls, and are at least as powerful as the standard genotype calling approach when the latter controls type I error. An application to the UK10K data reveals novel rare variants in gene BTBD18 associated with childhood onset obesity. The relevant software is freely available.

Spatial selection and local adaptation jointly shape life-history evolution during range expansion

Posted on November 18, 2015 by schraib

Spatial selection and local adaptation jointly shape life-history evolution during range expansion

Katrien Van Petegem, Jeroen Boeye, Robby Stoks, Dries Bonte

bioRxiv doi: http://dx.doi.org/10.1101/031922

In the context of climate change and species invasions, range shifts increasingly gain attention because the rates at which they occur in the Anthropocene induce fast shifts in biological assemblages. During such range shifts, species experience multiple selection pressures. Especially for poleward expansions, a straightforward interpretation of the observed evolutionary dynamics is hampered because of the joint action of evolutionary processes related to spatial selection and to adaptation towards local climatic conditions. To disentangle the effects of these two processes, we integrated stochastic modeling and empirical approaches, using the spider mite Tetranychus urticae as a model species. We demonstrate considerable latitudinal quantitative genetic divergence in life-history traits in T. urticae, that was shaped by both spatial selection and local adaptation. The former mainly affected dispersal behavior, while development was mainly shaped by adaptation to the local climate. Divergence in life-history traits in species shifting their range poleward can consequently be jointly determined by fast local adaptation to the environmental gradient and contemporary evolutionary dynamics resulting from spatial selection. The integration of modeling with common garden experiments provides a powerful tool to study the contribution of these two evolutionary processes on life-history evolution during range expansion.

Haldane's Sieve

Discussing preprints in population and evolutionary genetics

Monthly Archives: November 2015

Hot RAD: A Tool for Analysis of Next-Gen RAD Tag Data

Calculating the Unrooted Subtree Prune-and-Regraft Distance

MetaScope – Fast and accurate identification of microbes in metagenomic sequencing data

Author Post: Natural selection reduces linked neutral divergence between distantly related species

Direct estimate of the spontaneous mutation rate uncovers the effects of drift and recombination in the Chlamydomonas reinhardtii plastid genome

A statistical approach to genome size evolution: Observations and explanations

Kaiju: Fast and sensitive taxonomic classification for metagenomics

The State of Software in Evolutionary Biology

Testing Rare-Variant Association without Calling Genotypes Allows for Systematic Differences in Sequencing between Cases and Controls

Spatial selection and local adaptation jointly shape life-history evolution during range expansion

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: