Majority rule has transition ratio 4 on Yule trees under a 2-state symmetric model

Majority rule has transition ratio 4 on Yule trees under a 2-state symmetric model

Elchanan Mossel, Mike Steel
(Submitted on 10 Apr 2014)

Inferring the ancestral state at the root of a phylogenetic tree from states observed at the leaves is a problem arising in evolutionary biology. The simplest technique — majority rule — estimates the root state by the most frequently occurring state at the leaves. Alternative methods — such as maximum parsimony – explicitly take the tree structure into account. Since either method can outperform the other on particular trees, it is useful to consider the accuracy of the methods on trees generated under some evolutionary null model, such as a Yule pure-birth model. In this short note, we answer a recently posed question concerning the performance of majority rule on Yule trees under a symmetric 2-state Markovian substitution model of character state change. We show that majority rule is accurate precisely when the ratio of the birth (speciation) rate of the Yule process to the substitution rate exceeds the value 4. By contrast, maximum parsimony has been shown to be accurate only when this ratio is at least 6. Our proof relies on a second moment calculation, coupling, and a novel application of a reflection principle.

Estimating Phylogeny from microRNA Data: A Critical Appraisal

Estimating Phylogeny from microRNA Data: A Critical Appraisal

Robert Thomson, David Plachetzki, Luke Mahler, Brian Moore

As progress toward a highly resolved tree of life continues to expose nodes that resist resolution, interest in new sources of phylogenetic information that are informative for these most difficult relationships continues to increase. One such potential source of information, the presence and absence of microRNA families, has been vigorously promoted as an ideal phylogenetic marker and has been recently deployed to resolve several long-standing phylogenetic questions. Understanding the utility of such markers for phylogenetic inference hinges on developing a better understanding for how such markers behave under suitable evolutionary models, as well as how they perform in real inference scenarios. However, as yet, no study has rigorously characterized the statistical behavior or utility of these markers. Here we examine the behavior and performance of microRNA presence/absence data under a variety of evolutionary models and reexamine datasets from several previous studies. We find that highly heterogeneous rates of microRNA gain and loss, pervasive secondary loss, and sampling error collectively render microRNA-based inference of phylogeny difficult, and fundamentally alter the conclusions for four of the five studies that we re-examine. Our results indicate that miRNA data have far less phylogenetic utility in resolving the tree of life than is currently recognized and we urge ample caution in their interpretation.

Bias and measurement error in comparative analyses: a case study with the Ornstein Uhlenbeck model

Bias and measurement error in comparative analyses: a case study with the Ornstein Uhlenbeck model

Gavin Huw Thomas, Natalie Cooper, Chris Venditti, Andrew Meade, Robert P Freckleton

Phylogenetic comparative methods are increasingly used to give new insight into variation, causes and consequences of trait variation among species. The foundation of these methods is a suite of models that attempt to capture evolutionary patterns by extending the Brownian constant variance model. However, the parameters of these models have been hypothesised to be biased and only asymptotically behave in a statistically predictable way as datasets become large. This does not seem to be widely appreciated. We show that a commonly used model in evolutionary biology (the Ornstein-Uhlenbeck model) is biased over a wide range of conditions. Many studies fitting this model use datasets that are small and prone to substantial biases. Our results suggest that simulating fitted models and comparing with empirical results is critical when fitting OU and other extensions of the Brownian model.

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

Paulo Bandiera-Paiva, Marcelo R.S. Briones
(Submitted on 2 Apr 2014)

The Phylogenetic Genome Annotator (PGA) is a computer program that enables real-time comparison of ‘gene trees’ versus ‘species trees’ obtained from predicted open reading frames of whole genome data. The gene phylogenies are inferred for each individual genome predicted proteins whereas the species phylogenies are inferred from rDNA data. The correlated protein domains, defined by PFAM, are then displayed side-by-side with a phylogeny of the corresponding species. The statistical support of gene clusters (branches) is given by the quartet puzzling method. This analysis readily discriminates paralogs from orthologs, enabling the identification of proteins originated by gene duplications and the prediction of possible functional divergence in groups of similar sequences.

Phylogenetic Stochastic Mapping without Matrix Exponentiation

Phylogenetic Stochastic Mapping without Matrix Exponentiation
Jan Irvahn, Vladimir N. Minin

Phylogenetic stochastic mapping is a method for reconstructing the history of trait changes on a phylogenetic tree relating species/organisms carrying the trait. State-of-the-art methods assume that the trait evolves according to a continuous-time Markov chain (CTMC) and work well for small state spaces. The computations slow down considerably for larger state spaces (e.g. space of codons), because current methodology relies on exponentiating CTMC infinitesimal rate matrices — an operation whose computational complexity grows as the size of the CTMC state space cubed. In this work, we introduce a new approach, based on a CTMC technique called uniformization, that does not use matrix exponentiation for phylogenetic stochastic mapping. Our method is based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the distribution of trait histories conditional on the trait data observed at the tips of the tree. The computational complexity of our MCMC method grows as the size of the CTMC state space squared. Moreover, in contrast to competing matrix exponentiation methods, if the rate matrix is sparse, we can leverage this sparsity and increase the computational efficiency of our algorithm further. Using simulated data, we illustrate advantages of our MCMC algorithm and investigate how large the state space needs to be for our method to outperform matrix exponentiation approaches. We show that even on the moderately large state space of codons our MCMC method can be significantly faster than currently used matrix exponentiation methods.

Markov mutation models on Yule trees: pairwise species comparisons

Markov mutation models on Yule trees: pairwise species comparisons
Willem H. Mulder, Forrest W. Crawford
Subjects: Populations and Evolution (q-bio.PE)

Efforts to reconstruct phylogenetic trees and understand evolutionary processes depend fundamentally on stochastic models of speciation and mutation. The simplest continuous-time model for speciation in phylogenetic trees is the Yule process, in which new species are “born” from existing lineages at a constant rate. Recent work has illuminated some of the structural properties of Yule trees, but it remains mostly unknown how these properties affect sequence and trait patterns observed at the tips of the phylogenetic tree. Understanding the interplay between speciation and mutation under simple models of evolution is essential for deriving valid phylogenetic inference methods and gives insight into the optimal design of phylogenetic studies. In this work, we derive the probability distribution of interspecies covariance under Brownian motion and Ornstein-Uhlenbeck processes on a Yule tree. We compute the probability distribution of the number of mutations shared between two randomly chosen taxa in a Yule tree under several mutation models. These results suggest summary measures of phylogenetic information content, illuminate the correlation between site patterns in sequences or traits of related organisms, and provide heuristics for experimental design and reconstruction of phylogenetic trees.

Phylogenetic tree shapes resolve disease transmission patterns

Phylogenetic tree shapes resolve disease transmission patterns
Jennifer Gardy, Caroline Colijn

Whole genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterised infectious periods, epidemiological and clinical meta-data which may not always be available, and typically require computationally intensive analysis focussing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the overall transmission patterns underlying an outbreak. Here we use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks. We find that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe five topological features that summarize a phylogeny’s structure and find that computational classifiers based on these are capable of predicting an outbreak’s transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks. We conclude that there are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission, and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex

Reassortment between influenza B lineages and the emergence of a co-adapted PB1-PB2-HA gene complex
Gytis Dudas, Trevor Bedford, Samantha Lycett, Andrew Rambaut
Comments: 33 pages, 21 figures
Subjects: Populations and Evolution (q-bio.PE)

Influenza B viruses are increasingly being recognized as major contributors to morbidity attributed to seasonal influenza. Currently circulating influenza B isolates are known to belong to two antigenically distinct lineages referred to as B/Victoria and B/Yamagata. Frequent exchange of genomic segments of these two lineages has been noted in the past, but the observed patterns of reassortment have not been formalized in detail. We investigate inter-lineage reassortments by comparing phylogenetic trees across genomic segments. Our analyses indicate that of the 8 segments of influenza B viruses only PB1, PB2 and HA segments maintained separate Victoria and Yamagata lineages and that currently circulating strains possess PB1, PB2 and HA segments derived entirely from one or the other lineage; other segments have repeatedly reassorted between lineages thereby reducing genetic diversity. We argue that this difference between segments is due to selection against reassortant viruses with mixed lineage PB1, PB2 and HA segments. Given sufficient time and continued recruitment to the reassortment-isolated PB1-PB2-HA gene complex, we expect influenza B viruses to eventually undergo sympatric speciation.

Local description of phylogenetic group-based models

Local description of phylogenetic group-based models

Marta Casanellas, Jesús Fernández-Sánchez, Mateusz Michałek
(Submitted on 27 Feb 2014)

Motivated by phylogenetics, our aim is to obtain a system of equations that define a phylogenetic variety on an open set containing the biologically meaningful points. In this paper we consider phylogenetic varieties defined via group-based models. For any finite abelian group G, we provide an explicit construction of codimX phylogenetic invariants (polynomial equations) of degree at most |G| that define the variety X on a Zariski open set U. The set U contains all biologically meaningful points when G is the group of the Kimura 3-parameter model. In particular, our main result confirms a conjecture by the third author and, on the set U, a couple of conjectures by Bernd Sturmfels and Seth Sullivant.

Implications of uniformly distributed, empirically informed priors for phylogeographical model selection: A reply to Hickerson et al

Implications of uniformly distributed, empirically informed priors for phylogeographical model selection: A reply to Hickerson et al

Jamie R. Oaks, Charles W. Linkem, Jeet Sukumaran
(Submitted on 26 Feb 2014)

Biogeographers often seek to explain speciation on geographical phenomena. Establishing that a set of population splitting events occurred at the same time can be a persuasive argument that a set of taxa were affected by the same geographic events. Huang et al. (2011) introduced an approximate Bayesian approach (implemented in the software msBayes) to estimate the probabilities of models in which multiple sets of taxa diverge simultaneously. Oaks et al. (2013) used this model-choice framework to study 22 pairs of vertebrates distributed across the Philippines; they also studied the behavior of the approach using simulations. Oaks et al. (2013) found the model was very sensitive to the prior and had low power to detect variation in divergences times. This was not surprising in light of a rich statistical literature showing the marginal likelihood of a model is sensitive to vague priors. Because this sensitivity to prior assumptions affects the crucial insights a researcher who employs msBayes seeks to gain, Oaks et al. (2013) recommended users of the approach carefully assess the robustness of their conclusions to different priors. According to Hickerson et al. (2014), the lack of robustness was due to broad priors leading to inadequate numbers of simulations. They proposed a model-averaging approach using narrow, empirically informed uniform priors. Here, we demonstrate their approach is dangerous in the sense that the empirically-derived priors often exclude the true values of the parameters. We question the value of adopting an empirical-Bayesian stance for this problem, because it can mislead model posterior probabilities. The robust approach of conducting analyses under a variety of priors can reveal sensitivity and communicate assumptions underlying inference. Furthermore, simulations provide insight into the temporal resolution of the method and guide interpretation of results.