A comparative study of SVDquartets and other coalescent-based species tree estimation methods

A comparative study of SVDquartets and other coalescent-based species tree estimation methods
Jed Chou, Ashu Gupta, Shashank Yaduvanshi, Ruth Davidson, Mike Nute, Siavash Mirarab, Tandy Warnow
doi: http://dx.doi.org/10.1101/022855

Background: Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Because ILS is expected to occur and the standard concatenation approach can return incorrect trees with high support in the presence of ILS, “coalescent-based” summary methods (which first estimate gene trees and then combine gene trees into a species tree) have been developed that have theoretical guarantees of robustness to arbitrarily high amounts of ILS. Some studies have suggested that summary methods should only be used on “c-genes” (i.e., recombination-free loci) that can be extremely short (sometimes fewer than 100 sites). However, gene trees estimated on short alignments can have high estimation error, and summary methods tend to have high error on short c-genes. To address this problem, Chifman and Kubatko introduced SVDquartets, a new coalescent-based method. SVDquartets takes multi-locus unlinked single-site data, infers the quartet trees for all subsets of four species, and then combines the set of quartet trees into a species tree using a quartet amalgamation heuristic. Yet, the relative accuracy of SVDquartets to leading coalescent-based methods has not been assessed. Results: We compared SVDquartets to two leading coalescent-based methods (ASTRAL-2 and NJst), and to concatenation using maximum likelihood. We used a collection of simulated datasets, varying ILS levels, numbers of taxa, and number of sites per locus. Although SVDquartets was sometimes more accurate than ASTRAL-2 and NJst, most often the best results were obtained using ASTRAL-2, even on the shortest gene sequence alignments we explored (with only 10 sites per locus). Finally, concatenation was the most accurate of all methods under low ILS conditions. Conclusions: ASTRAL-2 generally had the best accuracy under higher ILS conditions, and concatenation had the best accuracy under the lowest ILS conditions. However, SVDquartets was competitive with the best methods under conditions with low ILS and small numbers of sites per locus. The good performance under many conditions of ASTRAL-2 in comparison to SVDquartets is surprising given the known vulnerability of ASTRAL-2 and similar methods to short gene sequences.

Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Ruriko Yoshida, Kenji Fukumizu
(Submitted on 26 Jun 2015)

Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in prokaryotes. Nevertheless, most genes should display topologically related phylogenies, and should group into one or more (for genetic hybrids) clusters in the “tree space.” In this paper we propose to apply the normalized-cut (Ncut) clustering algorithm to the set of gene trees with the geodesic distance between trees over the Billera-Holmes-Vogtmann (BHV) tree space. We first show by simulated data sets that the Ncut algorithm accurately clusters the set of gene trees given a species tree under the coalescent process, and show that the Ncut algorithm works better on the gene trees reconstructed via the neighbor-joining method than these reconstructed via the maximum likelihood estimator under the evolutionary models. Moreover, we apply the methods to a genome-wide data set (1290 genes encoding 690,838 amino acid residues) on coelacanths, lungfishes, and tetrapods. The result suggests that there are two clusters in the data set. Finally we reconstruct the consensus trees from these two clusters; the consensus tree constructed from one cluster has the tree topology that coelacanths are most closely related to the tetrapods, and the consensus tree from the other includes an irresolvable trichotomy over the coelacanth, lungfish, and tetrapod lineages, suggesting divergence within a very short time interval.

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Clayton E. Cressler, Marguerite A. Butler, Aaron A. King
(Submitted on 25 Jun 2015)

Phylogenetic comparative analysis is an approach to inferring evolutionary process from a combination of phylogenetic and phenotypic data. The last few years have seen increasingly sophisticated models employed in the evaluation of more and more detailed evolutionary hypotheses, including adaptive hypotheses with multiple selective optima and hypotheses with rate variation within and across lineages. The statistical performance of these sophisticated models has received relatively little systematic attention, however. We conducted an extensive simulation study to quantify the statistical properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and why these methods break down so that users can apply them with greater understanding of their strengths and weaknesses. Our analysis identifies three key determinants of performance: a discriminability ratio, a signal-to-noise ratio, and the number of taxa sampled. Interestingly, we find that model-selection power can be high even in regions that were previously thought to be difficult, such as when tree size is small. On the other hand, we find that model parameters are in many circumstances difficult to estimate accurately, indicating a relative paucity of information in the data relative to these parameters. Nevertheless, we note that accurate model selection is often possible when parameters are only weakly identified. Our results have implications for more sophisticated methods inasmuch as the latter are generalizations of the case we study.

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

Sebastian Höhna, Michael R. May, Brian R. Moore
doi: http://dx.doi.org/10.1101/021238

Many fundamental questions in evolutionary biology entail estimating rates of lineage diversification (speciation–extinction). We develop a flexible Bayesian framework for specifying an effectively infinite array of diversification models—where rates are constant, vary continuously, or change episodically through time—and implement numerical methods to estimate parameters of these models from molecular phylogenies, even when species sampling is incomplete. Additionally we provide robust methods for comparing the relative and absolute fit of competing branching-process models to a given tree, thereby providing rigorous tests of biological hypotheses regarding patterns and processes of lineage diversification.

Folding and unfolding phylogenetic trees and networks

Folding and unfolding phylogenetic trees and networks

Katharina T. Huber, Vincent Moulton, Mike Steel, Taoyang Wu
(Submitted on 14 Jun 2015)

Phylogenetic networks are rooted, labelled directed acyclic graphs which are commonly used to represent reticulate evolution. There is a close relationship between phylogenetic networks and multi-labelled trees (MUL-trees). Indeed, any phylogenetic network N can be ‘unfolded’ to obtain a MUL-tree U(N) and, conversely, a MUL-tree T can in certain circumstances be ‘folded’ to obtain a phylogenetic network F(T) that exhibits T. In this paper, we study properties of the operations U and F in more detail. In particular, we introduce the class of stable networks, phylogenetic networks N for which F(U(N)) is isomorphic to N, characterise such networks, and show that that they are related to the well-known class of tree-sibling networks. We also explore how the concept of displaying a tree in a network N can be related to displaying the tree in the MUL-tree U(N). To do this, we develop a phylogenetic analogue of graph fibrations. This allows us to view U(N) as the analogue of the universal cover of a digraph, and to establish a close connection between displaying trees in U(N) and reconciling phylogenetic trees with networks.

bModelTest: Bayesian site model selection for nucleotide data

bModelTest: Bayesian site model selection for nucleotide data

Remco Bouckaert
doi: http://dx.doi.org/10.1101/020792

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.

Sequence capture of ultraconserved elements from bird museum specimens

Sequence capture of ultraconserved elements from bird museum specimens

John McCormack, Whitney L.E. Tsai, Brant C Faircloth
doi: http://dx.doi.org/10.1101/020271

New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted approximately 5,000 UCE loci in 27 Western Scrub-Jays (Aphelocoma californica) representing three evolutionary lineages, and we collected an average of 3,749 UCE loci containing 4,460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures we used for SNP calling. This study and other recent studies on the genomics of museums specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.

The effect of non-reversibility on inferring rooted phylogenies

The effect of non-reversibility on inferring rooted phylogenies

S. Cherlin, T. M. W. Nye, R. J. Boys, S. E. Heaps, T. A. Williams, T. M. Embley
(Submitted on 29 May 2015)

Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree. In this paper we investigate the effect of relaxing the reversibility assumption and allowing the position of the root to be another unknown quantity in the model. We propose two hierarchical models which are centred on a reversible model but perturbed to allow non-reversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods. We illustrate the performance of the two non-reversible models in analyses of simulated datasets using two types of topological priors. We then apply the models to a real biological dataset, the radiation of polyploid yeasts, for which there is a robust biological opinion about the root position. Finally we apply the models to a second biological dataset for which the rooted tree is controversial: the ribosomal tree of life. We compare the two non-reversible models and conclude that both are useful in inferring the position of the root from real biological datasets.

A Bayesian Approach for Detecting Mass-Extinction Events When Rates of Lineage Diversification Vary

A Bayesian Approach for Detecting Mass-Extinction Events When Rates of Lineage Diversification Vary

Michael R. May, Sebastian Höhna, Brian R. Moore
doi: http://dx.doi.org/10.1101/020149

The paleontological record chronicles numerous episodes of mass extinction that severely culled the Tree of Life. Biologists have long sought to assess the extent to which these events may have impacted particular groups. We present a novel method for detecting mass-extinction events from phylogenies estimated from molecular sequence data. We develop our approach in a Bayesian statistical framework, which enables us to harness prior information on the frequency and magnitude of mass-extinction events. The approach is based on an episodic stochastic-branching process model in which rates of speciation and extinction are constant between rate-shift events. We model three types of events: (1) instantaneous tree-wide shifts in speciation rate; (2) instantaneous tree-wide shifts in extinction rate, and; (3) instantaneous tree-wide mass-extinction events. Each of the events is described by a separate compound Poisson process (CPP) model, where the waiting times between each event are exponentially distributed with event-specific rate parameters. The magnitude of each event is drawn from an event-type specific prior distribution. Parameters of the model are then estimated using a reversible-jump Markov chain Monte Carlo (rjMCMC) algorithm. We demonstrate via simulation that this method has substantial power to detect the number of mass-extinction events, provides unbiased estimates of the timing of mass-extinction events, while exhibiting an appropriate (i.e., below 5%) false discovery rate even in the case of background diversification rate variation. Finally, we provide an empirical application of this approach to conifers, which reveals that this group has experienced two major episodes of mass extinction. This new approach—the CPP on Mass Extinction Times (CoMET) model—provides an effective tool for identifying mass-extinction events from molecular phylogenies, even when the history of those groups includes more prosaic temporal variation in diversification rate.

On the equivalence of Maximum Parsimony and Maximum Likelihood on phylogenetic networks

On the equivalence of Maximum Parsimony and Maximum Likelihood on phylogenetic networks

Mareike Fischer, Parisa Bazargani
(Submitted on 26 May 2015)

Phylogenetic inference aims at reconstructing the evolutionary relationships of different species given some data (e.g. DNA, RNA or proteins). Traditionally, the relationships between species were assumed to be treelike, so the most frequently used phylogenetic inference methods like e.g. Maximum Parsimony or Maximum Likelihood were originally introduced to reconstruct phylogenetic trees. However, it has been well-known that some evolutionary events like hybridization or horizontal gene transfer cannot be represented by a tree but rather require a phylogenetic network. Therefore, current research seeks to adapt tree inference methods to networks. In the present paper, we analyze Maximum Parsimony and Maximum Likelihood on networks for various network definitions which have recently been introduced, and we investigate the well-known Tuffley and Steel equivalence result concerning these methods under the setting of a phylogenetic network.