Estimating the temporal and spatial extent of gene flow among sympatric lizard populations (genus Sceloporus) in the southern Mexican highlands
Jared A Grummer, Martha L. Calderón, Adrián Nieto Montes-de Oca, Eric N Smith, Fausto Méndez-de la Cruz, Adam Leaché
Interspecific gene flow is pervasive throughout the tree of life. Although detecting gene flow between populations has been facilitated by new analytical approaches, determining the timing and geography of hybridization has remained difficult, particularly for historical gene flow. A geographically explicit phylogenetic approach is needed to determine the ancestral population overlap. In this study, we performed population genetic analyses, species delimitation, simulations, and a recently developed approach of species tree diffusion to infer the phylogeographic history, timing and geographic extent of gene flow in lizards of the Sceloporus spinosus group. The two species in this group, S. spinosus and S. horridus, are distributed in eastern and western portions of Mexico, respectively, but populations of these species are sympatric in the southern Mexican highlands. We generated data consisting of three mitochondrial genes and eight nuclear loci for 148 and 68 individuals, respectively. We delimited six lineages in this group, but found strong evidence of mito-nuclear discordance in sympatric populations of S. spinosus and S. horridus owing to mitochondrial introgression. We used coalescent simulations to differentiate ancestral gene flow from secondary contact, but found mixed support for these two models. Bayesian phylogeography indicated more than 60% range overlap between ancestral S. spinosus and S. horridus populations since the time of their divergence. Isolation-migration analyses, however, revealed near-zero levels of gene flow between these ancestral populations. Interpreting results from both simulations and empirical data indicate that despite a long history of sympatry among these two species, gene flow in this group has only recently occurred.
A Composite Genome Approach to Identify Phylogenetically Informative Data from Next-Generation Sequencing
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013 (v1), last revised 12 Nov 2014 (this version, v3))
We have developed a novel method to rapidly obtain homologous genomic data for phylogenetics directly from next-generation sequencing reads without the use of a reference genome. This software, called SISRS, avoids the time consuming steps of de novo whole genome assembly, genome-genome alignment, and annotation. For simulations SISRS is able to identify large numbers of loci containing variable sites with phylogenetic signal. For genomic data from apes, SISRS identified thousands of variable sites, from which we produced an accurate phylogeny. Finally, we used SISRS to identify phylogenetic markers that we used to estimate the phylogeny of placental mammals. We recovered phylogenies from multiple datasets that were consistent with previous conflicting estimates of the relationships among mammals. SISRS is open source and freely available at this https URL
Impacts of terraces on phylogenetic inference
Michael J Sanderson, Michelle M. McMahon, Alexandros Stamatakis, Derrick J. Zwickl, Mike Steel
Comments: 50 pages, 9 figures
Subjects: Populations and Evolution (q-bio.PE)
Terraces are potentially large sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces add ambiguity and complexity to phylogenetic inference particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this paper we present five new findings about terraces and their impacts on phylogenetic inference. First we clarify assumptions about model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terraces on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges “attract” each other in Bayesian inference. Fifth, we show that even if data are not partitioned, patterns of missing data studied in the terrace problem can lead to instances of apparent statistical inconsistency when even a small element of heterotachy is introduced to the model generating the sequence data. Finally, we discuss strategies for remediation of some of these problems.
Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading
Sebastien Roch, Mike Steel
(Submitted on 6 Sep 2014)
The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).
An algorithm for constructing principal geodesics in phylogenetic treespace
Tom M. W. Nye
(Submitted on 2 Sep 2014)
Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard Principal Components Analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimises the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualising principal geodesics is freely available from http://www.ncl.ac.uk/~ntmwn/geophytter.
A Consistent Estimator of the Evolutionary Rate
Krzysztof Bartoszek, Serik Sagitov
(Submitted on 21 Aug 2014)
We consider a branching particle system where particles reproduce according to the pure birth Yule process with the birth rate L, conditioned on the observed number of particles to be equal n. Particles are assumed to move independently on the real line according to the Brownian motion with the local variance s2. In this paper we treat n particles as a sample of related species. The spatial Brownian motion of a particle describes the development of a trait value of interest (e.g. log-body-size). We propose an unbiased estimator Rn2 of the evolutionary rate r2=s2/L. The estimator Rn2 is proportional to the sample variance Sn2 computed from n trait values. We find an approximate formula for the standard error of Rn2 based on a neat asymptotic relation for the variance of Sn2.
A Distance Method to Reconstruct Species Trees In the Presence of Gene Flow
Lingfei Cui, Laura Kubatko
One of the central tasks in evolutionary biology is to reconstruct the evolutionary relationships among species from sequence data, particularly from multilocus data. In the last ten years, many methods have been proposed to use the variance in the gene histories to estimate species trees by explicitly modeling deep coalescence. However, gene flow, another process that may produce gene history variance, has been less studied. In this paper, we propose a simple yet innovative method for species trees estimation in the presence of gene flow. Our method, called STEST (Species Tree Estimation from Speciation Times), constructs species tree estimates from pairwise speciation time or species divergence time estimates. By using methods that estimate speciation times in the presence of gene flow, (for example, M1 (Yang 2010) or SIM3s (Zhu and Yang 2012)), STEST is able to estimate species trees from data subject to gene flow. We develop two methods, called STEST (M1) and STEST (SIM3s), for this purpose. Additionally, we consider the method STEST (M0), which instead uses the M0 method (Yang 2002), a coalescent-based method that does not assume gene flow, to estimate speciation times. It is therefore devised to estimate species trees in the absence of gene flow. Our simulation studies show that STEST (M0) outperforms STEST(M1), STEST (SIM3s) and STEM in terms of estimation accuracy and outperfroms *BEAST in terms of running time when the degree of gene flow is small. STEST (M1) outperforms STEST (M0), STEST (SIM3s), STEM and *BEAST in term of estimation accuracy when the degree of gene flow is large. An empirical data set analyzed by these methods gives species tree estimates that are consistent with the previous results.