Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading
Sebastien Roch, Mike Steel
(Submitted on 6 Sep 2014)
The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).
An algorithm for constructing principal geodesics in phylogenetic treespace
Tom M. W. Nye
(Submitted on 2 Sep 2014)
Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard Principal Components Analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimises the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualising principal geodesics is freely available from http://www.ncl.ac.uk/~ntmwn/geophytter.
A Consistent Estimator of the Evolutionary Rate
Krzysztof Bartoszek, Serik Sagitov
(Submitted on 21 Aug 2014)
We consider a branching particle system where particles reproduce according to the pure birth Yule process with the birth rate L, conditioned on the observed number of particles to be equal n. Particles are assumed to move independently on the real line according to the Brownian motion with the local variance s2. In this paper we treat n particles as a sample of related species. The spatial Brownian motion of a particle describes the development of a trait value of interest (e.g. log-body-size). We propose an unbiased estimator Rn2 of the evolutionary rate r2=s2/L. The estimator Rn2 is proportional to the sample variance Sn2 computed from n trait values. We find an approximate formula for the standard error of Rn2 based on a neat asymptotic relation for the variance of Sn2.
A Distance Method to Reconstruct Species Trees In the Presence of Gene Flow
Lingfei Cui, Laura Kubatko
One of the central tasks in evolutionary biology is to reconstruct the evolutionary relationships among species from sequence data, particularly from multilocus data. In the last ten years, many methods have been proposed to use the variance in the gene histories to estimate species trees by explicitly modeling deep coalescence. However, gene flow, another process that may produce gene history variance, has been less studied. In this paper, we propose a simple yet innovative method for species trees estimation in the presence of gene flow. Our method, called STEST (Species Tree Estimation from Speciation Times), constructs species tree estimates from pairwise speciation time or species divergence time estimates. By using methods that estimate speciation times in the presence of gene flow, (for example, M1 (Yang 2010) or SIM3s (Zhu and Yang 2012)), STEST is able to estimate species trees from data subject to gene flow. We develop two methods, called STEST (M1) and STEST (SIM3s), for this purpose. Additionally, we consider the method STEST (M0), which instead uses the M0 method (Yang 2002), a coalescent-based method that does not assume gene flow, to estimate speciation times. It is therefore devised to estimate species trees in the absence of gene flow. Our simulation studies show that STEST (M0) outperforms STEST(M1), STEST (SIM3s) and STEM in terms of estimation accuracy and outperfroms *BEAST in terms of running time when the degree of gene flow is small. STEST (M1) outperforms STEST (M0), STEST (SIM3s), STEM and *BEAST in term of estimation accuracy when the degree of gene flow is large. An empirical data set analyzed by these methods gives species tree estimates that are consistent with the previous results.
A codon model of nucleotide substitution with selection on synonymous codon usage
Laura Kubatko, Premal Shah, Radu Herbei, Michael Gilchrist
The quality of phylogenetic inference made from protein-coding genes depends, in part, on the realism with which the codon substitution process is modeled. Here we propose a new mechanistic model that combines the standard M0 substitution model of Yang (1997) with a simplified model from Gilchrist (2007) that includes selection on synonymous substitutions as a function of codon-specific nonsense error rates. We tested the newly proposed model by applying it to 104 protein-coding genes in brewer’s yeast, and compared the fit of the new model to the standard M0 model and to the mutation-selection model of Yang and Nielsen (2008) using the AIC. Our new model provided significantly better fit in approximately 85% of the cases considered for the basic M0 model and in approximately 25% of the cases for the M0 model with estimated codon frequencies, but only in a few cases when the mutation-selection model was considered. However, our model includes a parameter that can be interpreted as a measure of the rate of protein production, and the estimates of this parameter were highly correlated with an independent measure of protein production for the yeast genes considered here. Finally, we found that in some cases the new model led to the preference of a different phylogeny for a subset of the genes considered, indicating that substitution model choice may have an impact on the estimated phylogeny.
A Statistical Test for Clades in Phylogenies
Thurston H. Y. Dang, Elchanan Mossel
(Submitted on 29 Jul 2014)
We investigated testing the likelihood of a phylogenetic tree by comparison to its subtree pruning and regrafting (SPR) neighbors, with or without re-optimizing branch lengths. This is inspired by aspects of Bayesian significance tests, and the use of SPRs for heuristically finding maximum likelihood trees. Through a number of simulations with the Jukes-Cantor model on various topologies, it is observed that the SPR tests are informative, and reasonably fast compared to searching for the maximum likelihood tree. This suggests that the SPR tests would be a useful addition to the suite of existing statistical tests, for identifying potential inaccuracies of inferred topologies.
Statistical and conceptual challenges in the comparative analysis of principal components
Josef C Uyeda, Daniel S. Caetano, Matthew W Pennell
Quantitative geneticists long ago recognized the value of studying evolution in a multivariate framework (Pearson, 1903). Due to linkage, pleiotropy, coordinated selection and mutational covariance, the evolutionary response in any phenotypic trait can only be properly understood in the context of other traits (Lande, 1979; Lynch and Walsh, 1998). This is of course also well?appreciated by comparative biologists. However, unlike in quantitative genetics, most of the statistical and conceptual tools for analyzing phylogenetic comparative data (recently reviewed in Pennell and Harmon, 2013) are designed for analyzing a single trait (but see, for example Revell and Harmon, 2008; Revell and Harrison, 2008; Hohenlohe and Arnold, 2008; Revell and Collar, 2009; Schmitz and Motani, 2011; Adams, 2014b). Indeed, even classical approaches for testing for correlated evolution between two traits (e.g., Felsenstein, 1985; Grafen, 1989; Harvey and Pagel, 1991) are not actually multivariate as each trait is assumed to have evolved under a process that is independent of the state of the other (Hansen and Orzack, 2005; Hansen and Bartoszek, 2012). As a result of these limitations, researchers with multivariate datasets are often faced with a choice: analyze each trait as if they were independent or else decompose the dataset into statistically independent set of traits, such that each set can be analyzed with the univariate methods.