Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Ruriko Yoshida, Kenji Fukumizu
(Submitted on 26 Jun 2015)

Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in prokaryotes. Nevertheless, most genes should display topologically related phylogenies, and should group into one or more (for genetic hybrids) clusters in the “tree space.” In this paper we propose to apply the normalized-cut (Ncut) clustering algorithm to the set of gene trees with the geodesic distance between trees over the Billera-Holmes-Vogtmann (BHV) tree space. We first show by simulated data sets that the Ncut algorithm accurately clusters the set of gene trees given a species tree under the coalescent process, and show that the Ncut algorithm works better on the gene trees reconstructed via the neighbor-joining method than these reconstructed via the maximum likelihood estimator under the evolutionary models. Moreover, we apply the methods to a genome-wide data set (1290 genes encoding 690,838 amino acid residues) on coelacanths, lungfishes, and tetrapods. The result suggests that there are two clusters in the data set. Finally we reconstruct the consensus trees from these two clusters; the consensus tree constructed from one cluster has the tree topology that coelacanths are most closely related to the tetrapods, and the consensus tree from the other includes an irresolvable trichotomy over the coelacanth, lungfish, and tetrapod lineages, suggesting divergence within a very short time interval.

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Clayton E. Cressler, Marguerite A. Butler, Aaron A. King
(Submitted on 25 Jun 2015)

Phylogenetic comparative analysis is an approach to inferring evolutionary process from a combination of phylogenetic and phenotypic data. The last few years have seen increasingly sophisticated models employed in the evaluation of more and more detailed evolutionary hypotheses, including adaptive hypotheses with multiple selective optima and hypotheses with rate variation within and across lineages. The statistical performance of these sophisticated models has received relatively little systematic attention, however. We conducted an extensive simulation study to quantify the statistical properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and why these methods break down so that users can apply them with greater understanding of their strengths and weaknesses. Our analysis identifies three key determinants of performance: a discriminability ratio, a signal-to-noise ratio, and the number of taxa sampled. Interestingly, we find that model-selection power can be high even in regions that were previously thought to be difficult, such as when tree size is small. On the other hand, we find that model parameters are in many circumstances difficult to estimate accurately, indicating a relative paucity of information in the data relative to these parameters. Nevertheless, we note that accurate model selection is often possible when parameters are only weakly identified. Our results have implications for more sophisticated methods inasmuch as the latter are generalizations of the case we study.

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

TESS: Bayesian inference of lineage diversification rates from (incompletely sampled) molecular phylogenies in R

Sebastian Höhna, Michael R. May, Brian R. Moore
doi: http://dx.doi.org/10.1101/021238

Many fundamental questions in evolutionary biology entail estimating rates of lineage diversification (speciation–extinction). We develop a flexible Bayesian framework for specifying an effectively infinite array of diversification models—where rates are constant, vary continuously, or change episodically through time—and implement numerical methods to estimate parameters of these models from molecular phylogenies, even when species sampling is incomplete. Additionally we provide robust methods for comparing the relative and absolute fit of competing branching-process models to a given tree, thereby providing rigorous tests of biological hypotheses regarding patterns and processes of lineage diversification.

Folding and unfolding phylogenetic trees and networks

Folding and unfolding phylogenetic trees and networks

Katharina T. Huber, Vincent Moulton, Mike Steel, Taoyang Wu
(Submitted on 14 Jun 2015)

Phylogenetic networks are rooted, labelled directed acyclic graphs which are commonly used to represent reticulate evolution. There is a close relationship between phylogenetic networks and multi-labelled trees (MUL-trees). Indeed, any phylogenetic network N can be ‘unfolded’ to obtain a MUL-tree U(N) and, conversely, a MUL-tree T can in certain circumstances be ‘folded’ to obtain a phylogenetic network F(T) that exhibits T. In this paper, we study properties of the operations U and F in more detail. In particular, we introduce the class of stable networks, phylogenetic networks N for which F(U(N)) is isomorphic to N, characterise such networks, and show that that they are related to the well-known class of tree-sibling networks. We also explore how the concept of displaying a tree in a network N can be related to displaying the tree in the MUL-tree U(N). To do this, we develop a phylogenetic analogue of graph fibrations. This allows us to view U(N) as the analogue of the universal cover of a digraph, and to establish a close connection between displaying trees in U(N) and reconciling phylogenetic trees with networks.

bModelTest: Bayesian site model selection for nucleotide data

bModelTest: Bayesian site model selection for nucleotide data

Remco Bouckaert
doi: http://dx.doi.org/10.1101/020792

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.

Sequence capture of ultraconserved elements from bird museum specimens

Sequence capture of ultraconserved elements from bird museum specimens

John McCormack, Whitney L.E. Tsai, Brant C Faircloth
doi: http://dx.doi.org/10.1101/020271

New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted approximately 5,000 UCE loci in 27 Western Scrub-Jays (Aphelocoma californica) representing three evolutionary lineages, and we collected an average of 3,749 UCE loci containing 4,460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures we used for SNP calling. This study and other recent studies on the genomics of museums specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.

The effect of non-reversibility on inferring rooted phylogenies

The effect of non-reversibility on inferring rooted phylogenies

S. Cherlin, T. M. W. Nye, R. J. Boys, S. E. Heaps, T. A. Williams, T. M. Embley
(Submitted on 29 May 2015)

Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree. In this paper we investigate the effect of relaxing the reversibility assumption and allowing the position of the root to be another unknown quantity in the model. We propose two hierarchical models which are centred on a reversible model but perturbed to allow non-reversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods. We illustrate the performance of the two non-reversible models in analyses of simulated datasets using two types of topological priors. We then apply the models to a real biological dataset, the radiation of polyploid yeasts, for which there is a robust biological opinion about the root position. Finally we apply the models to a second biological dataset for which the rooted tree is controversial: the ribosomal tree of life. We compare the two non-reversible models and conclude that both are useful in inferring the position of the root from real biological datasets.