Evaluating strategies of phylogenetic analyses by the coherence of their results

Evaluating strategies of phylogenetic analyses by the coherence of their results
Blaise Li
(Submitted on 5 Jul 2013)

I propose an approach to identify, among several strategies of phylogenetic analysis, those producing the most accurate results. This approach is based on the hypothesis that the more a result is reproduced from independent data, the more it reflects the historical signal common to the analysed data. Under this hypothesis, the capacity of an analytical strategy to extract historical signal should correlate positively with the coherence of the obtained results. I apply this approach to a series of analyses on empirical data, basing the coherence measure on the Robinson-Foulds distances between the obtained trees. At first approximation, the analytical strategies most suitable for the data produce the most coherent results. However, risks of false positives and false negatives are identified, which are difficult to rule out.

Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups

Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups
Dongying Wu, Guillaume Jospin, Jonathan A. Eisen
(Submitted on 2 Jul 2013)

With the astonishing rate that the genomic and metagenomic sequence data sets are accumulating, there are many reasons to constrain the data analyses. One approach to such constrained analyses is to focus on select subsets of gene families that are particularly well suited for the tasks at hand. Such gene families have generally been referred to as marker genes. We are particularly interested in identifying and using such marker genes for phylogenetic and phylogeny-driven ecological studies of microbes and their communities. We therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology) markers. The dual use of these PhyEco markers means that we needed to develop and apply a set of somewhat novel criteria for identification of the best candidates for such markers. The criteria we focused on included universality across the taxa of interest, ability to be used to produce robust phylogenetic trees that reflect as much as possible the evolution of the species from which the genes come, and low variation in copy number across taxa. We describe here an automated protocol for identifying potential PhyEco markers from a set of complete genome sequences. The protocol combines rapid searching, clustering and phylogenetic tree building algorithms to generate protein families that meet the criteria listed above. We report here the identification of PhyEco markers for different taxonomic levels including 40 for all bacteria and archaea, 114 for all bacteria, and much more for some of the individual phyla of bacteria. This new list of PhyEco markers should allow much more detailed automated phylogenetic and phylogenetic ecology analyses of these groups than possible previously.

Predicting the loss of phylogenetic diversity under non-stationary diversification models

Predicting the loss of phylogenetic diversity under non-stationary diversification models
Amaury Lambert, Mike Steel
(Submitted on 12 Jun 2013)

For many taxa, the current high rates of extinction are likely to result in a significant loss of biodiversity. The evolutionary heritage of biodiversity is frequently quantified by a measure called phylogenetic diversity (PD). We predict the loss of PD under a wide class of phylogenetic tree models, where speciation rates and extinction rates may be time-dependent, and assuming independent random species extinctions at the present. We study the loss of PD when $K$ contemporary species are selected uniformly at random from the $N$ extant species as the surviving taxa, while the remaining $N-K$ become extinct. We consider two models of species sampling, the so-called field of bullets model, where each species independently survives the extinction event at the present with probability $p$, and a model for which the number of surviving species is fixed.
We provide explicit formulae for the expected remaining PD in both models, conditional on $N=n$, conditional on $K=k$, or conditional on both events. When $N=n$ is fixed, we show the convergence to an explicit deterministic limit of the ratio of new to initial PD, as $n\to\infty$, both under the field of bullets model, and when $K=k_n$ is fixed and depends on $n$ in such a way that $k_n/n$ converges to $p$. We also prove the convergence of this ratio as $T\to\infty$ in the supercritical, time-homogeneous case, where $N$ simultaneously goes to $\infty$, thereby strengthening previous results of Mooers et al. (2012).

Efficient Exploration of the Space of Reconciled Gene Trees

Efficient Exploration of the Space of Reconciled Gene Trees
Gergely J. Szöllősi, Wojciech Rosikiewicz, Bastien Boussau, Eric Tannier, Vincent Daubin
(Submitted on 10 Jun 2013)

Gene trees record the combination of gene level events, such as duplication, transfer and loss, and species level events, such as speciation and extinction. Gene tree-species tree reconciliation methods model these processes by drawing gene trees into the species tree using a series of gene and species level events. The reconstruction of gene trees based on sequence alone almost always involves choosing between statistically equivalent or weakly distinguishable relationships that could be much better resolved based on a putative species tree. To exploit this potential for accurate reconstruction of gene trees the space of reconciled gene trees must be explored according to a joint model of sequence evolution and gene tree-species tree reconciliation.
Here we present amalgamated likelihood estimation (ALE), a probabilistic approach to exhaustively explore all reconciled gene trees that can be amalgamated as a combination of clades observed in a sample of trees. We implement ALE in the context of a reconciliation model, which allows for the duplication, transfer and loss of genes. We use ALE to efficiently approximate the sum of the joint likelihood over amalgamations and to find the reconciled gene tree that maximizes the joint likelihood.
We demonstrate using simulations that gene trees reconstructed using the joint likelihood are substantially more accurate than those reconstructed using sequence alone. Using realistic topologies, branch lengths and alignment sizes, we demonstrate that ALE produces more accurate gene trees even if the model of sequence evolution is greatly simplified. Finally, examining 1099 gene families from 36 cyanobacterial genomes we find that joint likelihood-based inference results in a striking reduction in apparent phylogenetic discord, with 24%, 59% and 46% percent reductions in the mean numbers of duplications, transfers and losses.

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences
Dietrich Radel, Andreas Sand, Mike Steel
(Submitted on 6 Jun 2013)

Finding optimal evolutionary trees from sequence data is typically an intractable problem, and there is usually no way of knowing how close to optimal the best tree from some search truly is. The problem would seem to be particularly acute when we have many taxa and when that data has high levels of homoplasy, in which the individual characters require many changes to fit on the best tree. However, a recent mathematical result has provided a precise tool to generate a short number of high-homoplasy characters for any given tree, so that this tree is provably the optimal tree under the maximum parsimony criterion. This provides, for the first time, a rigorous way to test tree search algorithms on homoplasy-rich data, where we know in advance what the `best’ tree is. In this short note we consider just one search program (TNT) but show that it is able to locate the globally optimal tree correctly for 32,768 taxa, even though the characters in the dataset requires, on average, 1148 state-changes each to fit on this tree, and the number of characters is only 57.

Our paper: The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

This guest post is by Detlef Weigel (@WeigelWorld) and Hernán A. Burbano on their arXived paper [with coauthors] Yoshida et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. arXived here and in press at eLife [to appear here].

This paper is the result of a great collaboration between a lab that specializes in ancient DNA (that of Johannes Krause from the University of Tübingen), an expert in pathogen systematics (the group of Marco Thines from the Senckenberg Museum and Goethe University in Frankfurt), two pathogen genomics labs (those of Sophien Kamoun from the Sainsbury Laboratory in Norwich and Frank Martin from the USDA in California), and our evolutionary genomics group at the Max Planck Institute in Tübingen (Hernán A. Burbano and Detlef Weigel).

 

Phytophthora infestans made history when it destroyed large parts of the European potato crop, beginning in 1845. Potato has its origin in the Andes, in the Southeast of modern Peru and Northwest of Bolivia, while the center of diversity of P. infestans is several thousand kilometers further north, in Mexico’s Toluca Valley. There, other Phytophthora species live on a broad range of host plants. At some point in its history, evolutionary events associated with repeat-driven genome expansion [1,2] endowed P. infestans with the genetic arsenal required to infect potato. The pathogen was introduced to Europe in 1845 via infected potato tuber from the United States, where potato blight had made its first appearance in 1843. In the ensuing European blight epidemic, Ireland was hit especially hard, because the virtual absence of independent farmers and a restrictive customs policy conspired with the disease caused by P. infestans, potato blight, to have disproportionately devastating effects. The Great Famine that struck Ireland was a decisive event in both European and American history. One million Irish died of starvation, and at least another million left the country – most of them to the USA.

 

This part of P. infestans history has been clear, but the relationship of the strain(s) that caused the nineteenth century epidemic to modern strains has been controversial. Before a range of genetically quite distinct P. infestans strains made their debut throughout the world some 40 years ago, the global population outside Mexico was dominated by a single strain, called US-1. Because of its prevalence, US-1 was long thought to have been the cause of the fatal outbreak in the nineteenth century. From the analysis of a single SNP in the mitochondrial genome, it was, however, concluded in 2001 that the nineteenth century strains were more closely related to the modern strains that prevail today [3].

 

In our new paper, we resolve this paradoxical view: While the historical pathogen strain, which we call HERB-1, indeed differs at this one position from US-1, which has a derived allele, HERB-1 is far more closely related to US-1 than to other modern strains. Molecular clock analyses show that both strains probably separated from each other only a few years before the major European outbreak. HERB-1 seems to have dominated the global population without many genetic changes, and only in the twentieth century, after new potato varieties were introduced, was HERB-1 replaced by US-1 as the most successful P. infestans strain. We do not know for sure why HERB-1 was replaced, but we noted that the modern strains tend to be polyploid, while HERB-1 was diploid. We speculate that the increased genetic diversity in polyploid lineages were important for the success of US-1 (and other modern strains).

 

Our conclusions are based on Illumina sequencing of 11 herbarium samples of infected potato and tomato leaves collected in Ireland, the UK, Continental Europe and North America and preserved in the herbaria of the Botanical State Collection Munich and the Kew Gardens in London. Both herbaria placed a great deal of confidence in our abilities and were very generous in providing the dried plants. The degree of DNA preservation in the herbarium samples was impressive, much higher than in other examples of ancient DNA, and the majority of recovered DNA was from the host plant, with some samples having in addition over 20% pathogen DNA. In contrast to recent studies of historic human pathogens, no target DNA enrichment was required. We compared the historic samples with modern strains from Europe, Africa and North and South America as well as two closely related Phytophthora species. Due to the 150-year long period over which the individual samples had been collected, we were able to estimate with great confidence when the various P. infestans strains had emerged during evolutionary time. Here, too, we found connections with historic events: the first contact between Europeans and Americans in Mexico falls exactly into the time window in which the genetic diversity of P. infestans experienced a remarkable increase. Presumably, the social upheaval following the arrival of the Europeans somehow led to a spread of the pathogen at the beginning of the sixteenth century, which in turn accelerated its evolution.

 

The historical HERB-1 type is so far not known from modern collections, but we now have many diagnostic markers with which we can type the hundreds of modern isolates to determine whether perhaps there is somewhere a reservoir of HERB-1. In addition, our work highlights that herbaria constitute a rich, so far untapped source for investigating real-time evolution.

 

Detlef Weigel, weigel@weigelworld.org

Hernán A. Burbano, hernan.burbano@tuebingen.mpg.de

 

Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany

 

 

1.         Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393-398.

2.         Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, et al. (2010) Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 330: 1540-1543.

3.         Ristaino JB, Groves CT, Parra GR (2001) PCR amplification of the Irish potato famine pathogen from historic specimens. Nature 411: 695-697.

 

 

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine
Kentaro Yoshida, Verena J. Schuenemann, Liliana M. Cano, Marina Pais, Bagdevi Mishra, Rahul Sharma, Christa Lanz, Frank N. Martin, Sophien Kamoun, Johannes Krause, Marco Thines, Detlef Weigel, Hernán A. Burbano
(Submitted on 17 May 2013)

Phytophthora infestans, the cause of potato late blight, is infamous for having triggered the Irish Great Famine in the 1840s. Until the late 1970s, P. infestans diversity outside of its Mexican center of origin was low, and one scenario held that a single strain, US-1, had dominated the global population for 150 years; this was later challenged based on DNA analysis of historical herbarium specimens. We have compared the genomes of 11 herbarium and 15 modern strains. We conclude that the nineteenth century epidemic was caused by a unique genotype, HERB-1, that persisted for over 50 years. HERB-1 is distinct from all examined modern strains, but it is a close relative of US-1, which replaced it outside of Mexico in the twentieth century. We propose that HERB-1 and US-1 emerged from a metapopulation that was established in the early 1800s outside of the species’ center of diversity.

SARS-CoV originated from bats in 1998 and may still exist in humans

SARS-CoV originated from bats in 1998 and may still exist in humans
Ailin Tao, Yuyi Huang, Peilu Li1, Jun Liu, Nanshan Zhong, Chiyu Zhang
(Submitted on 13 May 2013)

SARS-CoV is believed to originate from civets and was thought to have been eliminated as a threat after the 2003 outbreak. Here, we show that human SARS-CoV (huSARS-CoV) originated directly from bats, rather than civets, by a cross-species jump in 1991, and formed a human-adapted strain in 1998. Since then huSARS-CoV has evolved further into highly virulent strains with genotype T and a 29-nt deletion mutation, and weakly virulent strains with genotype C but without the 29-nt deletion. The former can cause pneumonia in humans and could be the major causative pathogen of the SARS outbreak, whereas the latter might not cause pneumonia in humans, but evolved the ability to co-utilize civet ACE2 as an entry receptor, leading to interspecies transmission between humans and civets. Three crucial time points – 1991, for the cross-species jump from bats to humans; 1998, for the formation of the human-adapted SARS-CoV; and 2003, when there was an outbreak of SARS in humans – were found to associate with anomalously low annual precipitation and high temperatures in Guangdong. Anti-SARS-CoV sero-positivity was detected in 20% of all the samples tested from Guangzhou children who were born after 2005, suggesting that weakly virulent huSARS-CoVs might still exist in humans. These existing but undetected SARS-CoVs have a large potential to evolve into highly virulent strains when favorable climate conditions occur, highlighting a potential risk for the reemergence of SARS.

SICLE: A high-throughput tool for extracting evolutionary relationships from phylogenetic trees

SICLE: A high-throughput tool for extracting evolutionary relationships from phylogenetic trees
Dan DeBlasio, Jennifer Wiscaver
(Submitted on 22 Mar 2013)

We present the phylogeny analysis software SICLE (Sister Clade Extractor), an easy to use, adaptable, and high-throughput tool to describe the nearest neighbors to a node of interest in a phylogenetic tree as well as the support value for the relationship. With SICLE it is possible to summarize the phylogenetic information produced by automated phylogenetic pipelines to rapidly identify and quantify the possible evolutionary relationships that merit further investigation. The program is a simple command line utility and is easy to adapt and implement in any phylogenetic pipeline. As a test case, we applied this new tool to published gene phylogenies to identify potential instances of horizontal gene transfer in Salinibacter ruber.

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution
Michael G. Elliot, Arne O. Mooers
(Submitted on 20 Feb 2013)

The value of a continuous character evolving on a phylogenetic tree is commonly modelled as the location of a particle moving under one-dimensional Brownian motion with constant rate. The Brownian motion model is best suited to characters evolving under neutral drift or tracking an optimum that drifts neutrally. We present a generalization of the Brownian motion model which relaxes assumptions of neutrality and gradualism by considering increments to evolving characters to be drawn from a heavy-tailed stable distribution (of which the normal distribution is a specialized form). We describe Markov chain Monte Carlo methods for fitting the model to biological data paying special attention to ancestral state reconstruction, and study the performance of the model in comparison with a selection of existing comparative methods, using both simulated data and a database of body mass in 1,679 mammalian species. We discuss hypothesis testing and model selection. The new model is well suited to a stochastic process with a volatile rate of change in which biological characters undergo a mixture of neutral drift and occasional evolutionary events of large magnitude.