Probabilistic Graphical Model Representation in Phylogenetics
Sebastian Höhna, Tracy A. Heath, Bastien Boussau, Michael J. Landis, Fredrik Ronquist, John P. Huelsenbeck
(Submitted on 9 Dec 2013)
Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (1) reproducibility of an analysis, (2) model development and (3) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and non-specialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies.
Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference.
Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution.
The time-dependent reconstructed evolutionary process with a key-role for mass-extinction events
(Submitted on 9 Dec 2013)
The homogeneous reconstructed evolutionary process is a birth-death process without observed extinct lineages. Each species evolves independently with the same diversification rates (speciation rate λ(t) and extinction rate μ(t)) that may change over time. The process is commonly applied to model species diversification where the data are reconstructed phylogenies, e.g., trees reconstructed from present-day molecular data, and used to infer diversification rates.
In the present paper I develop the general probability density of a reconstructed tree under any time-dependent birth-death process. I elaborate on how to adapt this probability density if conditioned on survival of one or two initial lineages, or having sampled n species and show how to transform between the probability density of a reconstructed and the probability density of the speciation times.
I demonstrate the use of the general time-dependent probability density functions by deriving the probability density of a reconstructed tree under a birth-death-shift model with explicit mass-extinction events. I enrich this compendium by providing and discussing several special cases, including: the pure birth process, the pure death process, the birth-death process and the critical branching process. Thus, I provide here most of the commonly used birth-death models in a unified framework (e.g., same condition and same data) with common notation.
Species Delimitation using Genome-Wide SNP Data
Adam Leache, Matthew Fujita, Vladimir Minin, Remco Bouckaert
The multi-species coalescent has provided important progress for evolutionary inferences, including increasing the statistical rigor and objectivity of comparisons among competing species delimitation models. However, Bayesian species delimitation methods typically require brute force integration over gene trees via Markov chain Monte Carlo (MCMC), which introduces a large computation burden and precludes their application to genomic-scale data. Here we combine a recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, to provide a rigorous and computationally tractable technique for genome-wide species delimitation. We provide a critical yet simple correction that brings the likelihoods of different species trees, and more importantly their corresponding marginal likelihoods, to the same common denominator, which enables direct and accurate comparisons of competing species delimitation models using Bayes factors. We test this approach, which we call Bayes factor delimitation (*with genomic data; BFD*), using common species delimitation scenarios with computer simulations. Varying the numbers of loci and the number of samples suggest that the approach can distinguish the true model even with few loci and limited samples per species. Misspecification of the prior for population size θ has little impact on support for the true model. We apply the approach to West African forest geckos (Hemidactylus fasciatus complex) using genome-wide SNP data data. This new Bayesian method for species delimitation builds on a growing trend for objective species delimitation methods with explicit model assumptions that are easily tested.
Generation of high-resolution a priori Y-chromosome phylogenies using “next-generation” sequencing data
Gregory R Magoon, Raymond H Banks, Christian Rottensteiner, Bonnie E Schrack, Vincent O Tilroe, Andrew J Grierson
An approach for generating high-resolution a priori maximum parsimony Y-chromosome (chrY) phylogenies based on SNP and small INDEL variant data from massively-parallel short-read (next-generation) sequencing data is described; the tree-generation methodology produces annotations localizing mutations to individual branches of the tree, along with indications of mutation placement uncertainty in cases for which “no-calls” (through lack of mapped reads or otherwise) at particular site precludes a precise placement of the mutation. The approach leverages careful variant site filtering and a novel iterative reweighting procedure to generate high-accuracy trees while considering variants in regions of chrY that had previously been excluded from analyses based on short-read sequencing data. It is argued that the proposed approach is also superior to previous region-based filtering approaches in that it adapts to the quality of the underlying data and will automatically allow the scope of sites considered to expand as the underlying data quality improves (e.g. through longer read lengths). Key related issues, including calling of genotypes for the hemizygous chrY, reliability of variant results, read mismappings and “heterozygous” genotype calls, and the mutational stability of different variants are discussed and taken into account. The methodology is demonstrated through application to a dataset consisting of 1292 male samples from diverse populations and haplogroups, with the majority coming from low-coverage sequencing by the 1000 Genomes Project. Application of the tree-generation approach to these data produces a tree involving over 120,000 chrY variant sites (about 45,000 sites if singletons are excluded). The utility of this approach in refining the Y-chromosome phylogenetic tree is demonstrated by examining results for several haplogroups. The results indicate a number of new branches on the Y-chromosome phylogenetic tree, many of them subdividing known branches, but also including some that inform the presence of additional levels along the trunk of the tree. Finally, opportunities for extensions of this phylogenetic analysis approach to other types of genetic data are examined.
The inference of gene trees with species trees
Gergely J. Szöllosi, Eric Tannier, Vincent Daubin, Bastien Boussau
(Submitted on 4 Nov 2013)
Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.
Fighting network space: it is time for an SQL-type language to filter phylogenetic networks
Steven Kelk, Simone Linz, David A. Morrison
(Submitted on 25 Oct 2013)
The search space of rooted phylogenetic trees is vast and a major research focus of recent decades has been the development of algorithms to effectively navigate this space. However this space is tiny when compared with the space of rooted phylogenetic networks, and navigating this enlarged space remains a poorly understood problem. This, and the difficulty of biologically interpreting such networks, obstructs adoption of networks as tools for modelling reticulation. Here, we argue that the superimposition of biologically motivated constraints, via an SQL-style language, can both stimulate use of network software by biologists and potentially significantly prune the search space.
The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation
Tracy A. Heath, John P. Huelsenbeck, Tanja Stadler
(Submitted on 10 Oct 2013)
Time-calibrated species phylogenies are critical for addressing a wide range of questions in evolutionary biology, such as those that elucidate historical biogeography or uncover patterns of coevolution and diversification. Because molecular sequence data are not informative on absolute time, external data, most commonly fossil age estimates, are required to calibrate estimates of species divergence dates. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily chosen parametric distributions on internal nodes, often disregarding most of the information in the fossil record. We introduce the ‘fossilized birth-death’ (FBD) process, a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are part of the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in robust and accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods. We then used this model to estimate the speciation times for a dataset composed of all living bears, indicating that the genus Ursus diversified in the late Miocene to mid Pliocene.
Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment
Piotr Plonski, Jan P. Radomski
(Submitted on 8 Oct 2013)
Most of major algorithms for phylogenetic tree reconstruction assume that sequences in the analyzed set either do not have any offspring, or that parent sequences can maximally mutate into just two descendants. The graph resulting from such assumptions forms therefore a binary tree, with all the nodes labeled as leaves. However, these constraints are unduly restrictive as there are numerous data sets with multiple offspring of the same ancestors. Here we propose a solution to analyze and visualize such sets in a more intuitive manner. The method reconstructs phylogenetic tree by assigning the sequences with offspring as internal nodes, and the sequences without offspring as leaf nodes. In the resulting tree there is no constraint for the number of adjacent nodes, which means that the solution tree needs not to be a binary graph only. The subsequent derivation of evolutionary pathways, and pair-wise mutations, are then an algorithmically straightforward, with edge’s length corresponding directly to the number of mutations. Other tree reconstruction algorithms can be extended in the proposed manner, to also give unbiased topologies.
Identical inferences about correlated evolution arise from ancestral state reconstruction and independent contrasts
Michael G. Elliot
(Submitted on 30 Sep 2013)
Inferences about the evolution of continuous traits based on reconstruction of ancestral states has often been considered more error-prone than analysis of independent contrasts. Here we show that both methods in fact yield identical estimators for the correlation coefficient and regression gradient of correlated traits, indicating that reconstructed ancestral states are a valid source of information about correlated evolution. We show that the independent contrast associated with a pair of sibling nodes on a phylogenetic tree can be expressed in terms of the maximum likelihood ancestral state function at those nodes and their common parent. This expression gives rise to novel formulae for independent contrasts for any model of evolution admitting of a local likelihood function. We thus derive new formulae for independent contrasts applicable to traits evolving under directional drift, and use simulated data to show that these directional contrasts provide better estimates of evolutionary model parameters than standard independent contrasts, when traits in fact evolve with a directional tendency.
Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography
Filip Bielejec, Philippe Lemey, Guy Baele, Andrew Rambaut, Marc A Suchard
(Submitted on 12 Sep 2013)
Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we extend and generalize an evolutionary approach that relaxes the time-homogeneous process assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures.