Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Raunaq Malhotra, Steven Wu, Allen Rodrigo, Mary Poss, Raj Acharya
(Submitted on 14 Feb 2015)

A viral population can contain a large and diverse collection of viral haplotypes which play important roles in maintaining the viral population. We present an algorithm for reconstructing viral haplotypes in a population from paired-end Next Generation Sequencing (NGS) data. We propose a novel polynomial time dynamic programming based approximation algorithm for generating top paths through each node in De Bruijn graph constructed from the paired-end NGS data. We also propose two novel formulations for obtaining an optimal set of viral haplotypes for the population using the paths generated by the approximation algorithm. The first formulation obtains a maximum likelihood estimate of the viral population given the observed paired-end reads. The second formulation obtains a minimal set of viral haplotypes retaining the phylogenetic information in the population. We evaluate our algorithm on simulated datasets varying on mutation rates and genome length of the viral haplotypes. The results of our method are compared to other methods for viral haplotype estimation. While all the methods overestimate the number of viral haplotypes in a population, the two proposed optimality formulations correctly estimate the exact sequence of all the haplotypes in most datasets, and recover the overall diversity of the population in all datasets. The haplotypes recovered from popular methods are biased toward the reference sequence used for mapping of reads, while the proposed formulations are reference-free and retain the overall diversity in the population.

Selection constrains phenotypic evolution in a functionally important plant trait

Selection constrains phenotypic evolution in a functionally important plant trait
Christopher D Muir
doi: http://dx.doi.org/10.1101/015172

A long-standing idea is that the macroevolutionary adaptive landscape — a `map’ of phenotype to fitness — constrains evolution because certain phenotypes are fit, while others are universally unfit. Such constraints should be evident in traits that, across many species, cluster around particular modal values, with few intermediates between modes. Here, I compile a new global database of 599 species from 94 plant families showing that stomatal ratio, an important functional trait affecting photosynthesis, is multimodal, hinting at distinct peaks in the adaptive landscape. The dataset confirms that most plants have all their stomata on the lower leaf surface (hypostomy), but shows for the first time that species with roughly half their stomata on each leaf surface (amphistomy) form a distinct mode in the trait distribution. Based on a new evolutionary process model, this multimodal pattern is unlikely without constraint. Further, multimodality has evolved repeatedly across disparate families, evincing long-term constraint on the adaptive landscape. A simple cost-benefit model of stomatal ratio demonstrates that selection alone is sufficient to generate an adaptive landscape with multiple peaks. Finally, phylogenetic comparative methods indicate that life history evolution drives shifts between peaks. This implies that the adaptive benefit conferred by amphistomy — increased photosynthesis — is most important in plants with fast life histories, challenging existing ideas that amphistomy is an adaptation to thick leaves and open habitats. I conclude that peaks in the adaptive landscape have been constrained by selection over much of land plant evolution, leading to predictable, repeatable patterns of evolution.

Locating a Tree in a Phylogenetic Network in Quadratic Time

Locating a Tree in a Phylogenetic Network in Quadratic Time

Philippe Gambette, Andreas D. M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang
(Submitted on 11 Feb 2015)

A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.

Bayesian priors for tree calibration: Evaluating two new approaches based on fossil intervals

Bayesian priors for tree calibration: Evaluating two new approaches based on fossil intervals
Ryan W Norris, Cory L Strope, David M McCandlish, Arlin Stoltzfus
doi: http://dx.doi.org/10.1101/014340

Background: Studies of diversification and trait evolution increasingly rely on combining molecular sequences and fossil dates to infer time-calibrated phylogenetic trees. Available calibration software provides many options for the shape of the prior probability distribution of ages at a node to be calibrated, but the question of how to assign a Bayesian prior from limited fossil data remains open. Results: We introduce two new methods for generating priors based upon (1) the interval between the two oldest fossils in a clade, i.e., the penultimate gap (PenG), and (2) the ghost lineage length (GLin), defined as the difference between the oldest fossils for each of two sister lineages. We show that PenG and GLin/2 are point estimates of the interval between the oldest fossil and the true age for the node. Furthermore, given either of these quantities, we derive a principled prior distribution for the true age. This prior is log-logistic, and can be implemented approximately in existing software. Using simulated data, we test these new methods against some other approaches. Conclusions: When implemented as approaches for assigning Bayesian priors, the PenG and GLin methods increase the accuracy of inferred divergence times, showing considerably more precision than the other methods tested, without significantly greater bias. When implemented as approaches to post-hoc scaling of a tree by linear regression, the PenG and GLin methods exhibit less bias than other methods tested. The new methods are simple to use and can be applied to a variety of studies that call for calibrated trees.

Estimating phylogenetic trees from genome-scale data

Estimating phylogenetic trees from genome-scale data
Liang Liu, Zhenxiang Xi, Shaoyuan Wu, Charles Davis, Scott V. Edwards
Comments: 39 pages, 3 figures
Subjects: Populations and Evolution (q-bio.PE)

As researchers collect increasingly large molecular data sets to reconstruct the Tree of Life, the heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. A class of phylogenetic methods known as “species tree methods” have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting or deep coalescence that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a single underlying species tree. Although such methods are gaining in popularity, they are being adopted with caution in some quarters, in part because of an increasing number of examples of strong phylogenetic conflict between concatenation or supermatrix methods and species tree methods. Here we review theory and empirical examples that help clarify these conflicts. Thinking of concatenation as a special case of the more general model provided by the multispecies coalescent can help explain a number of differences in the behavior of the two methods on phylogenomic data sets. Recent work suggests that species tree methods are more robust than concatenation approaches to some of the classic challenges of phylogenetic analysis, including rapidly evolving sites in DNA sequences, base compositional heterogeneity and long branch attraction. We show that approaches such as binning, designed to augment the signal in species tree analyses, can distort the distribution of gene trees and are inconsistent. Computationally efficient species tree methods that incorporate biological realism are a key to phylogenetic analysis of whole genome data.

Geographic range size is predicted by plant mating system

Geographic range size is predicted by plant mating system
Dena Grossenbacher, Ryan Briscoe Runquist, Emma Goldberg, Yaniv Brandvain
doi: http://dx.doi.org/10.1101/013417

Species ranges vary enormously, and even closest relatives may differ in range size by several orders of magnitude. With data from hundreds of species spanning 20 genera and generic sections, we show that plant species that autonomously reproduce via self-pollination consistently have larger geographic ranges than their close relatives that generally require two parents for reproduction. Further analyses strongly implicate autonomous fertilization in causing this relationship, as it is not driven by traits such as polyploidy or annual life history whose evolution is sometimes correlated with the transition to autonomous self-fertilization. Furthermore, we find that selfers occur at higher maximum latitudes and that disparity in range size between selfers and outcrossers increases with time since their separation. Together, these results show that autonomous reproduction – a critical biological trait that eliminates mate limitation and thus potentially increases the probability of establishment – increases range size.

DensiTree 2: Seeing Trees Through the Forest

DensiTree 2: Seeing Trees Through the Forest

Remco Bouckaert, Joseph Heled
doi: http://dx.doi.org/10.1101/012401

Motivation: Phylogenetic analysis like Bayesian MCMC or bootstrapping result in a collection of trees. Trees are discrete objects and it is generally difficult to get a mental grip on a distributions over trees. Visualisation tools like DensiTree can give good intuition on tree distributions. It works by drawing all trees in the set transparently thus highlighting areas where the tree in the set agrees. In this way, both uncertainty in clade heights and uncertainty in topology can be visualised. In our experience, a vanilla DensiTree can turn out to be misleading in that it shows too much uncertainty due to wrongly ordering taxa or due to unlucky placement of internal nodes. Results: DensiTree is extended to allow visualisation of meta-data associated with branches such as population size and evolutionary rates. Furthermore, geographic locations of taxa can be shown on a map, making it easy to visually check there is some geographic pattern in a phylogeny. Taxa orderings have a large impact on the layout of the tree set, and advances have been made in finding better orderings resulting in significantly more informative visualisations. We also explored various methods for positioning internal nodes, which can improve the quality of the image. Together, these advances make it easier to comprehend distributions over trees. Availability: DensiTree is freely available from http://compevol. auckland.ac.nz/software/.