Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Daniel L. Rabosky
(Submitted on 26 Jan 2014)

A number of methods have been developed to infer differential rates of species diversification through time and among clades using time-calibrated phylogenetic trees. However, we lack a general framework that can delineate and quantify heterogeneous mixtures of dynamic processes within single phylogenies. I developed a method that can identify arbitrary numbers of time-varying diversification processes on phylogenies without specifying their locations in advance. The method uses reversible-jump Markov Chain Monte Carlo to move between model subspaces that vary in the number of distinct diversification regimes. The model assumes that changes in evolutionary regimes occur across the branches of phylogenetic trees under a compound Poisson process and explicitly accounts for rate variation through time and among lineages. Using simulated datasets, I demonstrate that the method can be used to quantify complex mixtures of time-dependent, diversity-dependent, and constant-rate diversification processes. I compared the performance of the method to the MEDUSA model of rate variation among lineages. As an empirical example, I analyzed the history of speciation and extinction during the radiation of modern whales. The method described here will greatly facilitate the exploration of macroevolutionary dynamics across large phylogenetic trees, which may have been shaped by heterogeneous mixtures of distinct evolutionary processes.

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

Xilin Deng, Philippe Henry

Identifying the genetic and ecological basis of adaptation is of immense importance in evolutionary biology. In our study, we applied a panel of 58 biallelic single nucleotide polymorphisms (SNPs) for the economically and culturally important salmonid Oncorhynchus keta. Samples included 4164 individuals from 43 populations ranging from Coastal Western Alaska to southern British Colombia and northern Washington. Signatures of natural selection were detected by identifying seven outlier loci using two independent approaches: one based on outlier detection and another based on environmental correlations. Evidence of divergent selection at two candidate SNP loci, Oke_RFC2-168 and Oke_MARCKS-362, indicates significant environmental correlations, particularly with the number of frost-free days (NFFD). Important associations found between environmental variables and outlier loci indicate that those environmental variables could be the major driving forces of allele frequency divergence at the candidate loci. NFFD, in particular, may play an important adaptive role in shaping genetic variation in O. keta. Correlations between divergent selection and local environmental variables will help shed light on processes of natural selection and molecular adaptation to local environmental conditions.

On the representation of de Bruijn graphs

On the representation of de Bruijn graphs
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson, Paul Medvedev
(Submitted on 21 Jan 2014)

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Rajiv C McCoy, Ryan W Taylor, Timothy A Blauwkamp, Joanna L Kelley, Michael Kertesz, Dmitry Pushkarev, Dmitri A Petrov, Anna-Sophie Fiston-Lavier

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, mostly due to the presence of repeats, which cannot be reconstructed unambiguously with short read data alone. One class of repeats, called transposable elements (TEs), is particularly problematic due to high sequence identity, high copy number, and a capacity to induce complex genomic rearrangements. Despite their importance to genome function and evolution, most current de novo assembly approaches cannot resolve TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 2-15 Kbp with an extremely low error rate (0.05%). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain yw;cn,bw,sp) achieving an NG50 contig size of 77.9 Kbp and covering 97.2% of the current reference genome (including heterochromatin). TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recover and accurately place 80.4% of annotated transposable elements with perfect identity to the current reference genome. As TEs are complex and highly repetitive features that are ubiquitous in genomes across the tree of life, TruSeq synthetic long-read technology offers a powerful and inexpensive approach to drastically improve de novo assemblies of whole genomes.

The evolution of moment generating functions for the Wright Fisher model of population genetics

The evolution of moment generating functions for the Wright Fisher model of population genetics
Tat Dat Tran, Julian Hofrichter, Juergen Jost
(Submitted on 21 Jan 2014)

We derive and apply a partial differential equation for the moment generating function of the Wright-Fisher model of population genetics.

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data
David Coil, Guillaume Jospin, Aaron E. Darling
(Submitted on 21 Jan 2014)

Motivation: Open-source bacterial genome assembly remains inaccessible to many biologists due to its complexity. Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data.
Results: A5-miseq can produce high quality and microbial genome assemblies on a laptop computer without any parameter tuning. A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation, and detection of misassemblies. Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation, and includes several improvements to read trimming. Together these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods.
Availability: A5-miseq is licensed under the GPL open source license. Source code and precompiled binaries for Mac OS X 10.6+ and Linux 2.6.15+ are available from this http URL

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications
Aurelien Tellier, Christophe Lemaire
(Submitted on 21 Jan 2014)

Population genetics theory has laid the foundations for genomics analyses including the recent burst in genome scans for selection and statistical inference of past demographic events in many prokaryote, animal and plant species. Identifying SNPs under natural selection and underpinning species adaptation relies on disentangling the respective contribution of random processes (mutation, drift, migration) from that of selection on nucleotide variability. Most theory and statistical tests have been developed using the Kingman coalescent theory based on the Wright-Fisher population model. However, these theoretical models rely on biological and life-history assumptions which may be violated in many prokaryote, fungal, animal or plant species. Recent theoretical developments of the so called multiple merger coalescent models are reviewed here ({\Lambda}-coalescent, beta-coalescent, Bolthausen-Snitzman, {\Xi}-coalescent). We explicit how these new models take into account various pervasive ecological and biological characteristics, life history traits or life cycles which were not accounted in previous theories such as 1) the skew in offspring production typical of marine species, 2) fast adapting microparasites (virus, bacteria and fungi) exhibiting large variation in population sizes during epidemics, 3) the peculiar life cycles of fungi and bacteria alternating sexual and asexual cycles, and 4) the high rates of extinction-recolonization in spatially structured populations. We finally discuss the relevance of multiple merger models for the detection of SNPs under selection in these species, for population genomics of very large sample size and advocate to potentially examine the conclusion of previous population genetics studies.

The life cycle of Drosophila orphan genes

The life cycle of Drosophila orphan genes

Nicola Palmieri, Carolin Kosiol, Christian Schlötterer
(Submitted on 20 Jan 2014)

Orphans are genes restricted to a single phylogenetic lineage and emerge at high rates. While this predicts an accumulation of genes, the gene number has remained remarkably constant through evolution. This paradox has not yet been resolved. Because orphan genes have been mainly analyzed over long evolutionary time scales, orphan loss has remained unexplored. Here we study the patterns of orphan turnover among close relatives in the Drosophila obscura group. We show that orphans are not only emerging at a high rate, but that they are also rapidly lost. Interestingly, recently emerged orphans are more likely to be lost than older ones. Furthermore, highly expressed orphans with a strong male-bias are more likely to be retained. Since both lost and retained orphans show similar evolutionary signatures of functional conservation, we propose that orphan loss is not driven by high rates of sequence evolution, but reflects lineage specific functional requirements.

Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs

This guest post is by Ilan Gronau on Freedman et al. Genome Sequencing Highlights the Dynamic Early History of Dogs. This preprint was one of the most popular on Haldane’s Sieve last year, and was published yesterday in PLoS Genetics.

Earlier this week, our paper entitled “Genome Sequencing Highlights the Dynamic Early History of Dogs” was published in PLoS Genetics. This paper explores central questions having to do with the origin of domestic canines by sequencing and analyzing the complete genomes of six individuals from carefully selected dog and wolf lineages. We posted an earlier version of this manuscript on ArXiv, and received quite a lot of comments and questions regarding the methodology we employed for demography inference (many through Haldane’s Sieve). This feedback exposed an increasing interest in demography inference methods that utilize small numbers of complete individual genomes, but it also highlighted the need to examine the strengths and weaknesses of the different methods we employed, and how best to combine them to obtain a unified and robust picture of demographic history. We discuss some of these issues in the revised version published this week, but I thought some of the points were worth spelling out a bit more explicitly, which is the purpose of this post. I’d like to thank Adam Siepel, John Novembre, and Adam Freedman for sharing their insights through the process of writing this post.

The Methods

Our study takes advantage of three recently developed demography inference methods:
the Pairwise Sequential Markovian Coalescent (PSMC; Li and Durbin, 2011), the D statistic, or as it is more commonly referred to, the ABBA/BABA test (Durand et al., 2011), and the Generalized Phylogenetic Coalescent Sampler (G-PhoCS; Gronau, et al., 2011). All three methods base their inferences on the genealogical relationships among a relatively small number of individuals, taking advantage of the wealth of genealogical information encoded in individual genomes due to genetic recombination. PSMC, for instance, makes use of the information on changes in coalescent times between the two chromosome copies within a single individual to infer ancestral population sizes. The ABBA/BABA test makes use of asymmetries in genealogies spanning four chromosomes to detect post-divergence gene flow. G-PhoCS jointly considers all individuals using a multi-population coalescent-based demographic model, which includes population divergence times, changes in ancestral population size, and gene flow. A major advantage of G-PhoCS is that it produces a single detailed image of demographic history, inferred using a unified probabilistic model for all individuals. However, in the interest of computational tractability, the method relies on several simplifying modeling assumptions not required by the other methods. In addition, by being constrained to subsets of individuals, the PSMC and ABBA/BABA approaches are free to specialize in capturing particular aspects of demographic history (ancestral Ne and admixture, resp.), which G-PhoCS treats more coarsely. Thus, we found all three methods to be complementary and their combination to be particularly informative about the demographic history of these wild and domestic canids.

Inference of Ancestral Population Sizes

Both PSMC and G-PhoCS provide information about ancestral population sizes (Ne). PSMC is specifically designed for this task, and provides a high-resolution trace of changes in ancestral Ne by separately analyzing each diploid genome. Using PSMC, we could detect sharp declines in Ne for wolves and dogs without making any assumptions regarding the canid phylogeny. However, we found that the traces of ancestral Ne inferred by PSMC should be interpreted with care; simulations we conducted showed that the gradual reduction in Ne inferred by PSMC was also consistent with a more recent severe population bottleneck. Eventually, the coarser model inferred by G-PhoCS, in which shortly after the divergence of dogs and wolves the two ancestral populations suffered severe bottlenecks, ended up fitting the data better than the model implied by the PSMC traces. Thus in our case we found the phylogenetic context to be quite useful for dating the major changes in canid population size. Ideally, it should be possible to directly infer PSMC-style traces along the branches of the population phylogeny (alongside inference of divergence times), but this would involve a fairly major methodological undertaking.

Detection of Admixture and Post Divergence Gene Flow

One of the central findings in our study was that gene flow, particularly between dogs and wolves, played a prominent role in the history of canids. Indeed, we found that several previous claims about dog origins in the Middle East and East Asia were likely influenced by ancient gene flow from wolves to dogs in these regions. The ABBA/BABA test was an obvious choice for a method to detect admixture. It is a fairly simple method, sensitive to even low amounts of gene flow, and robust to assumptions about the demographic history of the populations being tested. By applying this test separately to all sample quartets that include the jackal outgroup, we were able to obtain a good set of candidate ancestral admixture events between dogs and wolves. However, interpreting some of these signals and combining them into a single unified hypothesis was not straightforward, especially since we found signatures for multiple ancestral admixture events among overlapping pairs of populations (e.g., Basenji-Israeli wolf and Boxer-Israeli wolf). G-PhoCS is better suited to deal with this more complex scenario of gene flow, because it jointly analyzes all samples and can consider multiple migration bands in a single analysis. We exploited this feature to find strong evidence of gene flow between wolves and jackals and to show that the signal found for Boxer-Israeli wolf admixture in the ABBA/BABA test was a result of ancestral gene flow from Basenji to Israeli wolf. Still, coming up with a scenario of ancestral gene flow that best fit the data required developing a fairly complex framework for model comparison that involved a combination of multiple separate G-PhoCS runs and comparison with simulated data (see below).

Model Comparison

Addressing subtle questions having to do with the origin of dogs and post-divergence gene flow with wolves required the ability to compare alternative hypotheses for dog domestication in terms of their fit to the data. We did this by considering a collection of plausible topologies for the population phylogeny augmented with different sets of migration bands, and using G-PhoCS to infer demographic parameters for each case. This provided us with a complete demographic model we could associate with each alternative hypothesis and then use to simulate data representing that hypothesis. The hypotheses were then assessed by comparing the simulated data with the real data. While this approach does not constitute a formal model-testing method, it did allow us to explore the space of plausible models in a systematic way and show that the data supports a model with single origin for dogs and that the origin was ancient and similarly distant from all sampled wolf populations.

Future Development

This study allowed us to closely examine recently developed methods for demography inference and ways of combining them to obtain a unified and robust inference of demographic history. While the different methods used in our study were all shown to be quite powerful, particularly when combined, there is obvious room for improvement. In my view, the most promising developments in this field will come from methods (such as G-PhoCS) that capture all major aspects of the demographic history—divergence times, ancestral population sizes and post-divergence gene flow—in a single analysis. The great advantage of such methods is that they provide a framework for rigorous hypothesis testing and model comparison. In principle, the fully Bayesian nature of G-PhoCS enables this quite naturally through the use of Bayes factors for comparison of different sets of model assumptions. Bayes factors are essentially the relative probabilities of different models given the data. Throughout this study, we experimented with various simple ways of estimating Bayes factors based on the data likelihoods of the MCMC samples generated by G-PhoCS, but we were not able to robustly capture the differences in likelihoods of genealogies sampled for the different hypotheses. Solving this important problem will require additional work, but it is definitely within reach. Another important set of extensions involves using richer models that rely on weaker sets of assumptions. This includes modeling recombination, gradual changes in ancestral population sizes, and more realistic models for gene flow. Progress is being made in these directions as well (see, for example, our recent work on ancestral recombination graph inference), and there is much room for optimism that the next generation of demography inference methods, coupled with emerging genomic data sets, will allow researchers an unprecedented capability to investigate the demographic history of additional species.

A C++ template library for efficient forward-time population genetic simulation of large populations

A C++ template library for efficient forward-time population genetic simulation of large populations
Kevin R. Thornton
(Submitted on 15 Jan 2014)

fwdpp is a C++ library of routines intended to facilitate the development of forward-time simulations under arbitrary mutation and fitness models. The library design provides a combination of speed, low memory overhead, and modeling flexibility not currently available from other forward simulation tools. The library is particularly useful when the simulation of large populations is required, as programs implemented using the library are much more efficient that other available forward simulation programs.