The incorporation of ecological processes into models of trait evolution is important for understanding past drivers of evolutionary change. Species interactions have long been thought to be key drivers of trait evolution. However, models for comparative data that account for interactions between species are lacking. One of the challenges is that such models are intractable and difficult to express analytically. Here we present phylogenetic models of trait evolution that includes interspecific competition amongst species. Competition is modelled as a tendency of sympatric species to evolve towards distinct niches, producing trait overdispersion and high phylogenetic signal. The model predicts elevated trait variance across species and a slowdown in evolutionary rate both across the clade and within each branch. The model also predicts a reduction in correlation between otherwise correlated traits. We used an Approximate Bayesian Computation (ABC) approach to estimate model parameters. We tested the power of the model to detect deviations from Brownian trait evolution using simulations, finding reasonable power to detect competition in sufficiently large (20+ species) trees. We applied the model to examine the evolution of bill morphology of Darwin’s finches, and found evidence that competition affects the evolution of bill length.
Monthly Archives: December 2015
A New Statistical Framework for Genetic Pleiotropic Analysis of High Dimensional Phenotype Data
A New Statistical Framework for Genetic Pleiotropic Analysis of High Dimensional Phenotype Data
Panpan Wang, Mohammad Rahman, Li Jin, Momiao Xiong
(Submitted on 3 Dec 2015)
The widely used genetic pleiotropic analysis of multiple phenotypes are often designed for examining the relationship between common variants and a few phenotypes. They are not suited for both high dimensional phenotypes and high dimensional genotype (next-generation sequencing) data. To overcome these limitations, we develop sparse structural equation models (SEMs) as a general framework for a new paradigm of genetic analysis of multiple phenotypes. To incorporate both common and rare variants into the analysis, we extend the traditional multivariate SEMs to sparse functional SEMs. To deal with high dimensional phenotype and genotype data, we employ functional data analysis and the alternative direction methods of multiplier (ADMM) techniques to reduce data dimension and improve computational efficiency. Using large scale simulations we showed that the proposed methods have higher power to detect true causal genetic pleiotropic structure than other existing methods. Simulations also demonstrate that the gene-based pleiotropic analysis has higher power than the single variant-based pleiotropic analysis. The proposed method is applied to exome sequence data from the NHLBI Exome Sequencing Project (ESP) with 11 phenotypes, which identifies a network with 137 genes connected to 11 phenotypes and 341 edges. Among them, 114 genes showed pleiotropic genetic effects and 45 genes were reported to be associated with phenotypes in the analysis or other cardiovascular disease (CVD) related phenotypes in the literature.
Bayesian non-parametric inference for Λ-coalescents: consistency and a parametric method
Bayesian non-parametric inference for Λ-coalescents: consistency and a parametric method
Jere Koskela, Paul A. Jenkins, Dario Spanò
(Submitted on 3 Dec 2015)
We investigate Bayesian non-parametric inference for Λ-coalescent processes parametrised by probability measures on the unit interval, and provide an implementable, provably consistent MCMC inference algorithm. We give verifiable criteria on the prior for posterior consistency when observations form a time series, and prove that any non-trivial prior is inconsistent when all observations are contemporaneous. We then show that the likelihood given a data set of size n∈ℕ is constant across Λ-measures whose leading n−2 moments agree, and focus on inferring truncated sequences of moments. We provide a large class of functionals which can be extremised using finite computation given a credibility region of posterior truncated moment sequences, and a pseudo-marginal Metropolis-Hastings algorithm for sampling the posterior. Finally, we compare the efficiency of the exact and noisy pseudo-marginal algorithms with and without delayed acceptance acceleration using a simulation study.
Efficient recycled algorithms for quantitative trait models on phylogenies
Efficient recycled algorithms for quantitative trait models on phylogenies
Gordon Hiscott, Colin Fox, Matthew Parry, David Bryant
(Submitted on 2 Dec 2015)
We present an efficient and flexible method for computing likelihoods of phenotypic traits on a phylogeny. The method does not resort to Monte-Carlo computation but instead blends Felsenstein’s discrete character pruning algorithm with methods for numerical quadrature. It is not limited to Gaussian models and adapts readily to model uncertainty in the observed trait values. We demonstrate the framework by developing efficient algorithms for likelihood calculation and ancestral state reconstruction under Wright’s threshold model, applying our methods to a dataset of trait data for extrafloral nectaries (EFNs) across a phylogeny of 839 Labales species.
Author post: Efficient coalescent simulation and genealogical analysis for large sample sizes
This guest post is by Jerome Kelleher, on the preprint by Kelleher, Etheridge, and McVean titled “Efficient coalescent simulation and genealogical analysis for large sample sizes” available here from biorXiv.
In this post we summarise the main results of our recent bioRxiv preprint. We’ve left out a lot of important details here, but hopefully this summary will be enough to convince you that it’s worth reading the paper!
Coalescent simulation is a fundamental tool in modern population genetics, and a large number of packages exist to simulate various aspects of the model. The basic algorithm to simulate the coalescent with recombination was defined by Hudson in 1983, who also published the classical ms simulation program in 2002. Programs such as ms based on Hudson’s algorithm perform poorly for longer sequences, making simulation of chromosome sized regions under the influence of recombination impossible. The Sequentially Markov Coalescent (SMC) approximates the coalescent with recombination by assuming that each marginal genealogy depends only on its predecessor, making simulation much more efficient. The SMC can be a poor approximation when long range linkage information is important, however, and current simulations do not scale well in terms of sample size. Population-scale sequencing projects currently under way mean there is an urgent need for accurate simulations of hundreds of thousands of genomes.
We present a new formulation of Hudson’s simulation algorithm that solves these issues, making chromosome-scale simulation of the exact coalescent with recombination for hundreds of thousands of samples possible for the first time. Our approach begins by defining the genealogies that we are constructing in terms of integer vectors of a specific form, which we refer to as `sparse trees’. We generate recombination and common ancestor events in the same manner as the classical methods, but our approach to constructing marginal genealogies is quite different. When a coalescence within a marginal tree occurs, we store a tuple consisting of the left and right coordinates of the overlapping region, the parent and child nodes (which are integers), and the time at which the event occurred. We refer to these tuples as `coalescence records’, and they provide sufficient information to recover all genealogies after the simulation has completed. We implemented these ideas in a simulator called msprime, which we compared with the state of the art. For a fixed sample size of 1000 and increasing sequence length (with human-like recombination parameters), we found that msprime is much faster than comparable exact simulators and, surprisingly, is competitive with approximate SMC simulators. Even more surprisingly, we found that for a fixed sequence length of 50 megabases and increasing sample size, msprime was much faster than any existing simulator for large samples.
Storing the output of the simulations as coalescence records has major advantages. Because parent-child relationships shared by adjacent trees are stored only once, the correlation structure of the sequence of genealogies is explicit and the storage requirements minimised. To illustrate this, we ran a simulation of a 100 megabase chromosome with a roughly human recombination rate for a sample of 100,000 individuals. This simulation ran in about 6 minutes on a single CPU thread and used around 850MB of RAM. The resulting coalescence records required 88MB using msprime’s native HDF5 based storage format. Storing the same genealogies in Newick format requires around 3.5TB.
Highly compressed representations of data usually come at the cost of increased access time. In contrast, we can retrieve complete genealogical information from coalescence records many times faster than is possible using existing Newick-based methods. We provide a detailed listing of an algorithm to sequentially recover the marginal genealogies from a set of coalescence records and show that this algorithm requires constant time to transition between adjacent trees. This algorithm has been implemented as part of msprime’s Python API, and required around 3 seconds to iterate over all 1.1 million trees generated by the simulation above. We compared this performance to several popular tree processing libraries, and found that the fastest would require an estimated 38 days to parse the same set of trees in Newick format. Thus, in this example, by using msprime’s storage format and API we can store the same set of trees using around forty thousand times less space and parse them around a million times more quickly than Newick based methods.
We can also store mutational information in a natural and efficient way. If we have an infinite sites mutation that occurs on the parent branch of a particular node at a particular position on the sequence, then we simply store this (node, position) pair. This leads to a very compact representation of the combined genealogical and mutational state of a sample. We simulated 1.2 million infinite sites mutations on top of the genealogies generated earlier, which resulted in a 102MB HDF5 file containing the coalescence records and mutations. In contrast, the corresponding text haplotypes consumed 113GB of storage space. Associating mutations directly with tree nodes also allows us to perform some important calculations efficiently. We describe an efficient algorithm to count the total number of leaf nodes from a particular set below each node in the tree as we iterate over the sequence. This algorithm allows us to (for example) calculate allele frequencies within specific subsets in constant time. Many other applications of these techniques are possible.
The availability of faster and more accurate simulations may lead to interesting new applications, and so we conclude by discussing some potential applications of our work. Of particular interest is the possibility of inferring a set of coalescence records from real biological data, obtaining a compressed representation that can be efficiently processed. This is a very interesting and promising direction for future research.