A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
M. Cyrus Maher, Ryan D. Hernandez
(Submitted on 9 Sep 2013)

Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. There is a range of methods available for OD. However, relative performance varies by application, stymying attempts to identify a single best method. In this paper, we present a novel tool, MOSAIC, which is capable of integrating the entire swath of OD methods. We analyze the results of applying MOSAIC over four methodologically diverse OD methods. Relative to component and competing methods, we demonstrate large gains in the number of detected orthologs while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

Predicting the ancestral character changes in a tree is typically easier than predicting the root state

Predicting the ancestral character changes in a tree is typically easier than predicting the root state
Olivier Gascuel, Mike Steel
(Submitted on 4 Sep 2013)

Predicting the ancestral sequences of a group of homologous sequences related by a phylogenetic tree has been the subject of many studies, and numerous methods have been proposed to this purpose. Theoretical results are available that show that when the mutation rate become too large, reconstructing the ancestral state at the tree root is no longer feasible. Here, we also study the reconstruction of the ancestral changes that occurred along the tree edges. We show that, depending on the tree and branch length distribution, reconstructing these changes (i.e. reconstructing the ancestral state of all internal nodes in the tree) may be easier or harder than reconstructing the ancestral root state. However, results from information theory indicate that for the standard Yule tree, the task of reconstructing internal node states remains feasible, even for very high substitution rates. Moreover, computer simulations demonstrate that for more complex trees and scenarios, this result still holds. For a large variety of counting, parsimony-based and likelihood-based methods, the predictive accuracy of a randomly selected internal node in the tree is indeed much higher than the accuracy of the same method when applied to the tree root. Moreover, parsimony- and likelihood-based methods appear to be remarkably robust to sampling bias and model mis-specification.

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees
Dongying Wu, Ladan Doroud, Jonathan A. Eisen
(Submitted on 28 Aug 2013)

Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the small- subunit rRNAs (ssu-rRNAs). To address the limitation of ssu-rRNA as a phylogenetic marker, such as copy number variation among organisms and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations, we have identified protein- coding gene families as alternative Phylogenetic and Phylogenetic Ecology markers (PhyEco). Current nucleotide sequence similarity based Operational Taxonomic Unit (OTU) classification methods are not readily applicable to amino acid sequences of PhyEco markers. We report here the development of TreeOTU, a phylogenetic tree structure based OTU classification method that takes into account of differences in rates of evolution between taxa and between genes. OTU sets built by TreeOTU are more faithful to phylogenetic tree structures than sequence clustering (non phylogenetic) methods for ssu-rRNAs. OTUs built from phylogenetic trees of protein coding PhyEco markers are comparable to our current taxonomic classification at different levels. With the included OTU comparing tools, the TreeOTU is robust in phylogenetic referencing with different phylogenetic markers and trees.

Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales
Brian Tilston Smith, Michael G. Harvey, Brant C. Faircloth, Travis C. Glenn, Robb T. Brumfield
(Submitted on 24 Aug 2013)

Comparative genetic studies of non-model organisms are transforming rapidly due to major advances in sequencing technology. A limiting factor in these studies has been the identification and screening of orthologous loci across an evolutionarily distant set of taxa. Here, we evaluate the efficacy of genomic markers targeting ultraconserved DNA elements (UCEs) for analyses at shallow evolutionary timescales. Using sequence capture and massively parallel sequencing to generate UCE data for five co-distributed Neotropical rainforest bird species, we recovered 776-1,516 UCE loci across the five species. Across species, 53-77 percent of the loci were polymorphic, containing between 2.0 and 3.2 variable sites per polymorphic locus, on average. We performed species tree construction, coalescent modeling, and species delimitation, and we found that the five co-distributed species exhibited discordant phylogeographic histories. We also found that species trees and divergence times estimated from UCEs were similar to those obtained from mtDNA. The species that inhabit the understory had older divergence times across barriers, contained a higher number of cryptic species, and exhibited larger effective population sizes relative to species inhabiting the canopy. Because orthologous UCEs can be obtained from a wide array of taxa, are polymorphic at shallow evolutionary time scales, and can be generated rapidly at low cost, they are effective genetic markers for studies investigating evolutionary patterns and processes at shallow time scales.

A network approach to analyzing highly recombinant malaria parasite genes

A network approach to analyzing highly recombinant malaria parasite genes
Daniel B. Larremore, Aaron Clauset, Caroline O. Buckee
(Submitted on 23 Aug 2013)

The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-{\alpha} (DBL{\alpha}) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBL{\alpha} classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.

Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model

Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model
Denise Kühnert, Tanja Stadler, Timothy G. Vaughan, Alexei J. Drummond
(Submitted on 23 Aug 2013)

Evolution of RNA viruses such as HIV, Hepatitis C and Influenza virus occurs so rapidly that the viruses’ genomes contain information on past ecological dynamics. The interaction of ecological and evolutionary processes demands their joint analysis. Here we adapt a birth-death-sampling model, which allows for serially sampled data and rate changes over time to estimate epidemiological parameters of the underlying population dynamics in terms of a compartmental susceptible-infected-removed (SIR) model. Our proposed approach results in a phylodynamic method that enables the joint estimation of epidemiological parameters and phylogenetic history. In contrast to standard coalescent process approaches this method provides separate information on incidence and prevalence of infections. Detailed information on the interaction of host population dynamics and evolutionary history can inform decisions on how to contain or entirely avoid disease outbreaks.
We apply our Birth-Death SIR method (BDSIR) to five human immunodeficiency virus type 1 clusters sampled in the United Kingdom (UK) between 1999 and 2003. The estimated basic reproduction ratio ranges from 1.9 to 3.2 among the clusters. Our results imply that these local epidemics arose from introduction of infected individuals into populations of between 900 and 3000 effectively susceptible individuals, albeit with wide margins of uncertainty. All clusters show a decline in the growth rate of the local epidemic in the middle or end of the 90’s. The effective reproduction ratio of cluster 1 drops below one around 1994, with the local epidemic having almost run its course by the end of the sampled period. For the other four clusters the effective reproduction ratio also decreases over time, but stays above 1. The method is implemented as a BEAST2 package.

Macro-evolutionary models and coalescent point processes: The shape and probability of reconstructed phylogenies

Macro-evolutionary models and coalescent point processes: The shape and probability of reconstructed phylogenies
Amaury Lambert, Tanja Stadler
(Submitted on 6 Aug 2013)

Forward-time models of diversification (i.e., speciation and extinction) produce phylogenetic trees that grow “vertically” as time goes by. Pruning the extinct lineages out of such trees leads to natural models for reconstructed trees (i.e., phylogenies of extant species). Alternatively, reconstructed trees can be modelled by coalescent point processes (CPP), where trees grow “horizontally” by the sequential addition of vertical edges. Each new edge starts at some random speciation time and ends at the present time; speciation times are drawn from the same distribution independently. CPP lead to extremely fast computation of tree likelihoods and simulation of reconstructed trees. Their topology always follows the uniform distribution on ranked tree shapes (URT). We characterize which forward-time models lead to URT reconstructed trees and among these, which lead to CPP reconstructed trees. We show that for any “asymmetric” diversification model in which speciation rates only depend on time and extinction rates only depend on time and on a non-heritable trait (e.g., age), the reconstructed tree is CPP, even if extant species are incompletely sampled. If rates additionally depend on the number of species, the reconstructed tree is (only) URT (but not CPP). We characterize the common distribution of speciation times in the CPP description, and discuss incomplete species sampling as well as three special model cases in detail: 1) extinction rate does not depend on a trait; 2) rates do not depend on time; 3) mass extinctions may happen additionally at certain points in the past.

Characterizing Compatibility and Agreement of Unrooted Trees via Cuts in Graphs

Characterizing Compatibility and Agreement of Unrooted Trees via Cuts in Graphs
Sudheer Vakati, David Fernández-Baca
(Submitted on 30 Jul 2013)

Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees.
Tree compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. Here, we show that the latter characterization yields a natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement.

Agalma: an automated phylogenomics workflow

Agalma: an automated phylogenomics workflow
Casey W. Dunn, Mark Howison, Felipe Zapata
(Submitted on 24 Jul 2013)

In the past decade, transcriptome data have become an important component of many phylogenetic studies. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. We present Agalma, an automated tool that conducts phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, and enables rich HTML reports for all stages of the analysis. Agalma includes a test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at this https URL. Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended.

An Arrow-type result for inferring a species tree from gene trees

An Arrow-type result for inferring a species tree from gene trees
Mike Steel
(Submitted on 19 Jul 2013)

The reconstruction of a central tendency `species tree’ from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four desirable properties that a method for estimating a species tree from gene trees should have. We show that while these can be achieved when taxon coverage is complete (by the Adams consensus method), they cannot all be satisfied in the more general setting of partial taxon coverage.