Markov mutation models on Yule trees: pairwise species comparisons

Markov mutation models on Yule trees: pairwise species comparisons
Willem H. Mulder, Forrest W. Crawford
Subjects: Populations and Evolution (q-bio.PE)

Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are “born” from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck processes on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under several mutation models. These results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees.

Phylogenetic tree shapes resolve disease transmission patterns

Phylogenetic tree shapes resolve disease transmission patterns
Jennifer Gardy, Caroline Colijn

Whole genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterised infectious periods, epidemiological and clinical meta-data which may not always be available, and typically require computationally intensive analysis focussing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the overall transmission patterns underlying an outbreak. Here we use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks. We find that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe five topological features that summarize a phylogeny’s structure and find that computational classifiers based on these are capable of predicting an outbreak’s transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks. We conclude that there are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission, and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex
Gytis Dudas, Trevor Bedford, Samantha Lycett, Andrew Rambaut
Comments: 33 pages, 21 figures
Subjects: Populations and Evolution (q-bio.PE)

Influenza B viruses are increasingly being recognized as major contributors to morbidity attributed to seasonal influenza. Currently circulating influenza B isolates are known to belong to two antigenically distinct lineages referred to as B/Victoria and B/Yamagata. Frequent exchange of genomic segments of these two lineages has been noted in the past, but the observed patterns of reassortment have not been formalized in detail. We investigate inter-lineage reassortments by comparing phylogenetic trees across genomic segments. Our analyses indicate that of the 8 segments of influenza B viruses only PB1, PB2 and HA segments maintained separate Victoria and Yamagata lineages and that currently circulating strains possess PB1, PB2 and HA segments derived entirely from one or the other lineage; other segments have repeatedly reassorted between lineages thereby reducing genetic diversity. We argue that this difference between segments is due to selection against reassortant viruses with mixed lineage PB1, PB2 and HA segments. Given sufficient time and continued recruitment to the reassortment-isolated PB1-PB2-HA gene complex, we expect influenza B viruses to eventually undergo sympatric speciation.

Local description of phylogenetic group-based models

Local description of phylogenetic group-based models

Marta Casanellas, Jesús Fernández-Sánchez, Mateusz Michałek
(Submitted on 27 Feb 2014)

Motivated by phylogenetics, our aim is to obtain a system of equations that define a phylogenetic variety on an open set containing the biologically meaningful points. In this paper we consider phylogenetic varieties defined via group-based models. For any finite abelian group G, we provide an explicit construction of codimX phylogenetic invariants (polynomial equations) of degree at most |G| that define the variety X on a Zariski open set U. The set U contains all biologically meaningful points when G is the group of the Kimura 3-parameter model. In particular, our main result confirms a conjecture by the third author and, on the set U, a couple of conjectures by Bernd Sturmfels and Seth Sullivant.

Implications of uniformly distributed, empirically informed priors for phylogeographical model selection: A reply to Hickerson et al

Implications of uniformly distributed, empirically informed priors for phylogeographical model selection: A reply to Hickerson et al

Jamie R. Oaks, Charles W. Linkem, Jeet Sukumaran
(Submitted on 26 Feb 2014)

Biogeographers often seek to explain speciation on geographical phenomena. Establishing that a set of population splitting events occurred at the same time can be a persuasive argument that a set of taxa were affected by the same geographic events. Huang et al. (2011) introduced an approximate Bayesian approach (implemented in the software msBayes) to estimate the probabilities of models in which multiple sets of taxa diverge simultaneously. Oaks et al. (2013) used this model-choice framework to study 22 pairs of vertebrates distributed across the Philippines; they also studied the behavior of the approach using simulations. Oaks et al. (2013) found the model was very sensitive to the prior and had low power to detect variation in divergences times. This was not surprising in light of a rich statistical literature showing the marginal likelihood of a model is sensitive to vague priors. Because this sensitivity to prior assumptions affects the crucial insights a researcher who employs msBayes seeks to gain, Oaks et al. (2013) recommended users of the approach carefully assess the robustness of their conclusions to different priors. According to Hickerson et al. (2014), the lack of robustness was due to broad priors leading to inadequate numbers of simulations. They proposed a model-averaging approach using narrow, empirically informed uniform priors. Here, we demonstrate their approach is dangerous in the sense that the empirically-derived priors often exclude the true values of the parameters. We question the value of adopting an empirical-Bayesian stance for this problem, because it can mislead model posterior probabilities. The robust approach of conducting analyses under a variety of priors can reveal sensitivity and communicate assumptions underlying inference. Furthermore, simulations provide insight into the temporal resolution of the method and guide interpretation of results.

An Improved Approximate-Bayesian Model-choice Method for Estimating Shared Evolutionary History

An Improved Approximate-Bayesian Model-choice Method for Estimating Shared Evolutionary History

Jamie R. Oaks
(Submitted on 25 Feb 2014)

To understand the processes that generate biodiversity, it is important to account for large-scale processes that affect the evolutionary history of groups of co-distributed populations of organisms. Such events predict temporally clustered divergences times, a pattern that can be estimated using genetic data from co-distributed species. I introduce a new approximate-Bayesian method for comparative phylogeographical model-choice that estimates the temporal distribution of divergences across taxa from multi-locus DNA sequence data. The model is an extension of that implemented in msBayes. By reparameterizing the model, introducing more flexible priors on demographic and divergence-time parameters, and implementing a non-parametric Dirichlet-process prior over divergence models, I improved the robustness, accuracy, and power of the method for estimating shared evolutionary history across taxa. The results demonstrate the improved performance of the new method is due to (1) more appropriate priors on divergence-time and demographic parameters that avoid prohibitively small marginal likelihoods for models with more divergence events, and (2) the Dirichlet-process providing a flexible prior on divergence histories that does not strongly disfavor models with intermediate numbers of divergence events. The new method yields more robust estimates of posterior uncertainty, and thus greatly reduces the tendency of the model to incorrectly estimate biogeographically interesting models with strong support.

Tracing evolutionary links between species

Tracing evolutionary links between species

Mike Steel
(Submitted on 16 Feb 2014)

The idea that all life on earth traces back to a common beginning dates back at least to Charles Darwin’s {\em Origin of Species}. Ever since, biologists have tried to piece together parts of this `tree of life’ based on what we can observe today: fossils, and the evolutionary signal that is present in the genomes and phenotypes of different organisms. Mathematics has played a key role in helping transform genetic data into phylogenetic (evolutionary) trees and networks. Here, I will explain some of the central concepts and basic results in phylogenetics, which benefit from several branches of mathematics, including combinatorics, probability and algebra.

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees

Daniel L. Rabosky
(Submitted on 26 Jan 2014)

A number of methods have been developed to infer differential rates of species diversification through time and among clades using time-calibrated phylogenetic trees. However, we lack a general framework that can delineate and quantify heterogeneous mixtures of dynamic processes within single phylogenies. I developed a method that can identify arbitrary numbers of time-varying diversification processes on phylogenies without specifying their locations in advance. The method uses reversible-jump Markov Chain Monte Carlo to move between model subspaces that vary in the number of distinct diversification regimes. The model assumes that changes in evolutionary regimes occur across the branches of phylogenetic trees under a compound Poisson process and explicitly accounts for rate variation through time and among lineages. Using simulated datasets, I demonstrate that the method can be used to quantify complex mixtures of time-dependent, diversity-dependent, and constant-rate diversification processes. I compared the performance of the method to the MEDUSA model of rate variation among lineages. As an empirical example, I analyzed the history of speciation and extinction during the radiation of modern whales. The method described here will greatly facilitate the exploration of macroevolutionary dynamics across large phylogenetic trees, which may have been shaped by heterogeneous mixtures of distinct evolutionary processes.

Sequence Capture Versus Restriction Site Associated DNA Sequencing for Phylogeography

Sequence Capture Versus Restriction Site Associated DNA Sequencing for Phylogeography
Michael G. Harvey, Brian Tilston Smith, Travis C. Glenn, Brant C. Faircloth, Robb T. Brumfield
(Submitted on 22 Dec 2013)

Genomic datasets generated with massively parallel sequencing methods have the potential to propel systematics in new and exciting directions, but selecting appropriate markers and methods is not straightforward. We applied two approaches with particular promise for systematics, restriction site associated DNA sequencing (RAD-Seq) and sequence capture (Seq-cap) of ultraconserved elements (UCEs), to the same set of samples from a non-model, Neotropical bird. We found that both RAD-Seq and Seq-cap produced genomic datasets containing thousands of loci and SNPs and that the inferred population assignments and species trees were concordant between datasets. However, model-based estimates of demographic parameters differed between datasets, particularly when we estimated the parameters using a method based on allele frequency spectra. The differences we observed may result from differences in assembly, alignment, and filtering of sequence data between methods, and our findings suggest that caution is warranted when using allele frequencies to estimate parameters from low-coverage sequencing data. We further explored the differences between methods using simulated Seq-cap- and RAD-Seq-like datasets. Analyses of simulated data suggest that increasing the number of loci from 500 to 5000 increased phylogenetic concordance factors and the accuracy and precision of demographic parameter estimates, but increasing the number of loci past 5000 resulted in minimal gains. Increasing locus length from 64 bp to 500 bp improved phylogenetic concordance factors and minimal gains were observed with loci longer than 500 bp, but locus length did not influence the accuracy and precision of demographic parameter estimates. We discuss our results relative to the diversity of data collection methods available, and we provide advice for harnessing next-generation sequencing for systematics research.

piBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios

piBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios
Filip Bielejec, Philippe Lemey, Luiz Max Carvalho, Guy Baele, Andrew Rambaut, Marc A. Suchard
Comments: 13 pages, 2 figures, 1 table
Subjects: Populations and Evolution (q-bio.PE)

Background: Simulated nucleotide or amino acid sequences are frequently used to assess the performance of phylogenetic reconstruction methods. BEAST, a Bayesian statistical framework that focuses on reconstructing time-calibrated molecular evolutionary processes, supports a wide array of evolutionary models, but lacked matching machinery for simulation of character evolution along phylogenies.
Results: We present a flexible Monte Carlo simulation tool, called piBUSS, that employs the BEAGLE high performance library for phylogenetic computations within BEAST to rapidly generate large sequence alignments under complex evolutionary models. piBUSS sports a user-friendly graphical user interface (GUI) that allows combining a rich array of models across an arbitrary number of partitions. A command-line interface mirrors the options available through the GUI and facilitates scripting in large-scale simulation studies. Analogous to BEAST model and analysis setup, more advanced simulation options are supported through an extensible markup language (XML) specification, which in addition to generating sequence output, also allows users to combine simulation and analysis in a single BEAST run.
Conclusions: piBUSS offers a unique combination of flexibility and ease-of-use for sequence simulation under realistic evolutionary scenarios. Through different interfaces, piBUSS supports simulation studies ranging from modest endeavors for illustrative purposes to complex and large-scale assessments of evolutionary inference procedures. The software aims at implementing new models and data types that are continuously being developed as part of BEAST/BEAGLE.