Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Daniel L. Rabosky
(Submitted on 26 Jan 2014)

A number of methods have been developed to infer differential rates of species diversification through time and among clades using time-calibrated phylogenetic trees. However, we lack a general framework that can delineate and quantify heterogeneous mixtures of dynamic processes within single phylogenies. I developed a method that can identify arbitrary numbers of time-varying diversification processes on phylogenies without specifying their locations in advance. The method uses reversible-jump Markov Chain Monte Carlo to move between model subspaces that vary in the number of distinct diversification regimes. The model assumes that changes in evolutionary regimes occur across the branches of phylogenetic trees under a compound Poisson process and explicitly accounts for rate variation through time and among lineages. Using simulated datasets, I demonstrate that the method can be used to quantify complex mixtures of time-dependent, diversity-dependent, and constant-rate diversification processes. I compared the performance of the method to the MEDUSA model of rate variation among lineages. As an empirical example, I analyzed the history of speciation and extinction during the radiation of modern whales. The method described here will greatly facilitate the exploration of macroevolutionary dynamics across large phylogenetic trees, which may have been shaped by heterogeneous mixtures of distinct evolutionary processes.

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees
Gilad Liberman, Jennifer Benichou, Lea Tsaban, yaakov maman, Jacob Glanville, yoram louzoun

Incremental selection within a population, defined as a limited fitness change following a mutation, is an important aspect of many evolutionary processes and can significantly affect a large number of mutations through the genome. Strongly advantageous or deleterious mutations are detected through the fixation of mutations in the population, using the synonymous to non-synonymous mutations ratio in sequences. There are currently to precise methods to estimate incremental selection occurring over limited periods. We here provide for the first time such a detailed method and show its precision and its applicability to the genomic analysis of selection. A special case of evolution is rapid, short term micro-evolution, where organism are under constant adaptation, occurring for example in viruses infecting a new host, B cells mutating during a germinal center reactions or mitochondria evolving within a given host. The proposed method is a novel mixed lineage tree/sequence based method to detect within population selection as defined by the effect of mutations on the average number of offspring. Specifically, we pro-pose to measure the log of the ratio between the number of leaves in lineage trees branches following synonymous and non-synonymous mutations. This method does not suffer from the need of a baseline model and is practically not affected by sampling biases. In order to show the wide applicability of this method, we apply it to multiple cases of micro-evolution, and show that it can detect genes and inter-genic regions using the selection rate and detect selection pressures in viral proteins and in the immune response to pathogens.

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

SINGLE NUCLEOTIDE POLYMORPHISMS SHED LIGHT ON CORRELATIONS BETWEEN ENVIRONMENTAL VARIABLES AND ADAPTIVE GENETIC DIVERGENCE AMONG POPULATIONS IN ONCORHYNCHUS KETA

Xilin Deng, Philippe Henry

Identifying the genetic and ecological basis of adaptation is of immense importance in evolutionary biology. In our study, we applied a panel of 58 biallelic single nucleotide polymorphisms (SNPs) for the economically and culturally important salmonid Oncorhynchus keta. Samples included 4164 individuals from 43 populations ranging from Coastal Western Alaska to southern British Colombia and northern Washington. Signatures of natural selection were detected by identifying seven outlier loci using two independent approaches: one based on outlier detection and another based on environmental correlations. Evidence of divergent selection at two candidate SNP loci, Oke_RFC2-168 and Oke_MARCKS-362, indicates significant environmental correlations, particularly with the number of frost-free days (NFFD). Important associations found between environmental variables and outlier loci indicate that those environmental variables could be the major driving forces of allele frequency divergence at the candidate loci. NFFD, in particular, may play an important adaptive role in shaping genetic variation in O. keta. Correlations between divergent selection and local environmental variables will help shed light on processes of natural selection and molecular adaptation to local environmental conditions.

On the representation of de Bruijn graphs

On the representation of de Bruijn graphs
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson, Paul Medvedev
(Submitted on 21 Jan 2014)

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Rajiv C McCoy, Ryan W Taylor, Timothy A Blauwkamp, Joanna L Kelley, Michael Kertesz, Dmitry Pushkarev, Dmitri A Petrov, Anna-Sophie Fiston-Lavier

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, mostly due to the presence of repeats, which cannot be reconstructed unambiguously with short read data alone. One class of repeats, called transposable elements (TEs), is particularly problematic due to high sequence identity, high copy number, and a capacity to induce complex genomic rearrangements. Despite their importance to genome function and evolution, most current de novo assembly approaches cannot resolve TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 2-15 Kbp with an extremely low error rate (0.05%). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain yw;cn,bw,sp) achieving an NG50 contig size of 77.9 Kbp and covering 97.2% of the current reference genome (including heterochromatin). TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recover and accurately place 80.4% of annotated transposable elements with perfect identity to the current reference genome. As TEs are complex and highly repetitive features that are ubiquitous in genomes across the tree of life, TruSeq synthetic long-read technology offers a powerful and inexpensive approach to drastically improve de novo assemblies of whole genomes.

The evolution of moment generating functions for the Wright Fisher model of population genetics

The evolution of moment generating functions for the Wright Fisher model of population genetics
Tat Dat Tran, Julian Hofrichter, Juergen Jost
(Submitted on 21 Jan 2014)

We derive and apply a partial differential equation for the moment generating function of the Wright-Fisher model of population genetics.

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data
David Coil, Guillaume Jospin, Aaron E. Darling
(Submitted on 21 Jan 2014)

Motivation: Open-source bacterial genome assembly remains inaccessible to many biologists due to its complexity. Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data.
Results: A5-miseq can produce high quality and microbial genome assemblies on a laptop computer without any parameter tuning. A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation, and detection of misassemblies. Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation, and includes several improvements to read trimming. Together these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods.
Availability: A5-miseq is licensed under the GPL open source license. Source code and precompiled binaries for Mac OS X 10.6+ and Linux 2.6.15+ are available from this http URL

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications
Aurelien Tellier, Christophe Lemaire
(Submitted on 21 Jan 2014)

Population genetics theory has laid the foundations for genomics analyses including the recent burst in genome scans for selection and statistical inference of past demographic events in many prokaryote, animal and plant species. Identifying SNPs under natural selection and underpinning species adaptation relies on disentangling the respective contribution of random processes (mutation, drift, migration) from that of selection on nucleotide variability. Most theory and statistical tests have been developed using the Kingman coalescent theory based on the Wright-Fisher population model. However, these theoretical models rely on biological and life-history assumptions which may be violated in many prokaryote, fungal, animal or plant species. Recent theoretical developments of the so called multiple merger coalescent models are reviewed here ({\Lambda}-coalescent, beta-coalescent, Bolthausen-Snitzman, {\Xi}-coalescent). We explicit how these new models take into account various pervasive ecological and biological characteristics, life history traits or life cycles which were not accounted in previous theories such as 1) the skew in offspring production typical of marine species, 2) fast adapting microparasites (virus, bacteria and fungi) exhibiting large variation in population sizes during epidemics, 3) the peculiar life cycles of fungi and bacteria alternating sexual and asexual cycles, and 4) the high rates of extinction-recolonization in spatially structured populations. We finally discuss the relevance of multiple merger models for the detection of SNPs under selection in these species, for population genomics of very large sample size and advocate to potentially examine the conclusion of previous population genetics studies.

The life cycle of Drosophila orphan genes

The life cycle of Drosophila orphan genes

Nicola Palmieri, Carolin Kosiol, Christian Schlötterer
(Submitted on 20 Jan 2014)

Orphans are genes restricted to a single phylogenetic lineage and emerge at high rates. While this predicts an accumulation of genes, the gene number has remained remarkably constant through evolution. This paradox has not yet been resolved. Because orphan genes have been mainly analyzed over long evolutionary time scales, orphan loss has remained unexplored. Here we study the patterns of orphan turnover among close relatives in the Drosophila obscura group. We show that orphans are not only emerging at a high rate, but that they are also rapidly lost. Interestingly, recently emerged orphans are more likely to be lost than older ones. Furthermore, highly expressed orphans with a strong male-bias are more likely to be retained. Since both lost and retained orphans show similar evolutionary signatures of functional conservation, we propose that orphan loss is not driven by high rates of sequence evolution, but reflects lineage specific functional requirements.

Demography and the age of rare variants

Demography and the age of rare variants
Iain Mathieson, Gil McVean
(Submitted on 16 Jan 2014)

Recently, large whole-genome sequencing projects have provided access to much of the rare variation in human populations. This variation is highly informative about population structure and recent demography. In this paper, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how this information can detect and quantify historical relationships between populations. We investigate the distribution of the age of f2 variants in a worldwide sample sequenced by the 1,000 Genomes Project, revealing enormous variation across populations. The median age of f2 variants shared within continents is 50 to 160 generations for Europe and Asia, and 170 to 320 generations for Africa. Variants shared between continents are much older with median ages ranging from 320 to 670 generations between Europe and Asia, and 1,000 to 2,400 generations between African and Non-African populations. The distribution of the ages of variants shared across populations is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the signature of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.