Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics

Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics

Kevin Emmett, Daniel Rosenbloom, Pablo Camara, Raul Rabadan
(Submitted on 18 Jun 2014)

Persistent homology computes topological invariants from point cloud data. Recent work has focused on developing statistical methods for data analysis in this framework. We show that, in certain models, parametric inference can be performed using statistics defined on the computed invariants. We develop this idea with a model from population genetics, the coalescent with recombination. We apply our model to an influenza dataset, identifying two scales of topological structure which have a distinct biological interpretation.

Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes

Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes

Julia Chifman, Laura Kubatko
(Submitted on 18 Jun 2014)

The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for estimation of the phylogenetic species tree given multi-locus DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the n-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process with time-reversible substitution.

The overdue promise of short tandem repeat variation for heritability

The overdue promise of short tandem repeat variation for heritability.

Maximilian Press, Keisha D. Carlson, Christine Queitsch

Short tandem repeat (STR) variation has been proposed as a major explanatory factor in the heritability of complex traits in humans and model organisms. However, we still struggle to incorporate STR variation into genotype-phenotype maps. Here, we review the promise of STRs in contributing to complex trait heritability, and highlight the challenges that STRs pose due to their repetitive nature. We argue that STR variants are more likely than single nucleotide variants to have epistatic interactions, reiterate the need for targeted assays to accurately genotype STRs, and call for more appropriate statistical methods in detecting STR-phenotype associations. Lastly, somatic STR variation within individuals may serve as a read-out of disease susceptibility, and is thus potentially a valuable covariate for future association studies.

Error correction and assembly complexity of single molecule sequencing reads.

Error correction and assembly complexity of single molecule sequencing reads.

Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, Michael Schatz

Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration

Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration

Alexandra Gavryushkina, David Welch, Tanja Stadler, Alexei Drummond
(Submitted on 18 Jun 2014)

Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after the sampling, in particular we extend the birth-death skyline model [Stadler et al, 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that sampled ancestor birth-death models where all samples come from different time points are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply this method to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included among the species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in literature. The sampler is available as an open-source BEAST2 package (this https URL ancestors/).

Nanopore Sequencing of the phi X 174 genome

Nanopore Sequencing of the phi X 174 genome

Andrew H. Laszlo, Ian M. Derrington, Brian C. Ross, Henry Brinkerhoff, Andrew Adey, Ian C. Nova, Jonathan M. Craig, Kyle W. Langford, Jenny Mae Samson, Riza Daza, Kenji Doering, Jay Shendure, Jens H. Gundlach
(Submitted on 17 Jun 2014)

Nanopore sequencing of DNA is a single-molecule technique that may achieve long reads, low cost, and high speed with minimal sample preparation and instrumentation. Here, we build on recent progress with respect to nanopore resolution and DNA control to interpret the procession of ion current levels observed during the translocation of DNA through the pore MspA. As approximately four nucleotides affect the ion current of each level, we measured the ion current corresponding to all 256 four-nucleotide combinations (quadromers). This quadromer map is highly predictive of ion current levels of previously unmeasured sequences derived from the bacteriophage phi X 174 genome. Furthermore, we show nanopore sequencing reads of phi X 174 up to 4,500 bases in length that can be unambiguously aligned to the phi X 174 reference genome, and demonstrate proof-of-concept utility with respect to hybrid genome assembly and polymorphism detection. All methods and data are made fully available.

Assessing phenotypic correlation through the multivariate phylogenetic latent liability model

Assessing phenotypic correlation through the multivariate phylogenetic latent liability model

Gabriela B. Cybis, Janet S. Sinsheimer, Trevor Bedford, Alison E. Mather, Philippe Lemey, Marc A. Suchard
(Submitted on 15 Jun 2014)

Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The latent formulation enables us to consider in a single model combinations of continuous traits, discrete binary traits, and discrete traits with multiple ordered and unordered states. Previous approaches have entertained a single data type generally along a fixed history, precluding estimation of correlation between traits and ignoring uncertainty in the history. We implement our model in a Bayesian phylogenetic framework, and discuss inference techniques for hypothesis testing. Finally, we showcase the method through applications to columbine flower morphology, antibiotic resistance in Salmonella, and epitope evolution in influenza.

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Victor Hanson-Smith, Christopher Baker, Alexander Johnson
(Submitted on 11 Jun 2014)

A central challenge in the study of protein evolution is the identification of historic amino acid sequence changes responsible for creating novel functions observed in present-day proteins. To address this problem, we developed a new method to identify and rank amino acid mutations in ancestral protein sequences according to their function-shifting potential. Our approach scans the changes between two reconstructed ancestral sequences in order to find (1) sites with sequence changes that significantly deviate from our model-based probabilistic expectations, (2) sites that demonstrate extreme changes in mutual information, and (3) sites with extreme gains or losses of information content. By taking the overlaps of these statistical signals, the method accurately identifies cryptic evolutionary patterns that are often not obvious when examining only the conservation of modern-day protein sequences. We validated this method with a training set of previously-discovered function-shifting mutations in three essential protein families in animals and fungi, whose evolutionary histories were the prior subject of systematic molecular biological investigation. Our method identified the known function-shifting mutations in the training set with a very low rate of false positive discovery. Further, our approach significantly outperformed other methods that use variability in evolutionary rates to detect functional loci. The accuracy of our approach indicates it could be a useful tool for generating specific testable hypotheses regarding the acquisition of new functions across a wide range of protein families.

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Carlo G. Artieri, Hunter B. Fraser

The recent advent of ribosome profiling ? sequencing of short ribosome-bound fragments of mRNA ? has offered an unprecedented opportunity to interrogate the sequence features responsible for modulating translational rates. Nevertheless, numerous analyses of the first riboprofiling dataset have produced equivocal and often incompatible results. Here we analyze three independent yeast riboprofiling data sets, including two with much higher coverage than previously available, and find that all three show substantial technical sequence biases that confound interpretations of ribosomal occupancy. After accounting for these biases, we find no effect of previously implicated factors on ribosomal pausing. Rather, we find that incorporation of proline, whose unique side-chain stalls peptide synthesis in vitro, also slows the ribosome in vivo. We also reanalyze a recent method that reported positively charged amino acids as the major determinant of ribosomal stalling and demonstrate that its assumptions lead to false signals of stalling in low-coverage data. Our results suggest that any analysis of riboprofiling data should account for sequencing biases and sparse coverage. To this end, we establish a robust methodology that enables analysis of ribosome profiling data without prior assumptions regarding which positions spanned by the ribosome cause stalling.

The rugged adaptive landscape of an emerging plant RNA virus

The rugged adaptive landscape of an emerging plant RNA virus

Jasna Lalic, Santiago F. Elena

RNA viruses are the main source of emerging infectious diseases owed to the evolutionary potential bestowed by their fast replication, large population sizes and high mutation and recombination rates. However, an equally important parameter, which is usually neglected, is the topography of the fitness landscape, that is, how many fitness maxima exist and how well connected they are, which determines the number of accessible evolutionary pathways. To address this question, we have reconstructed the fitness landscape describing the adaptation of Tobacco etch potyvirus to its new host, Arabidopsis thaliana. Fitness was measured for most of the genotypes in the landscape, showing the existence of peaks and holes. We found prevailing epistatic effects between mutations, with cases of reciprocal sign epistasis being common at latter stages. Therefore, results suggest that the landscape was rugged and holey, with several local fitness peaks and a very limited number of potential neutral paths. The viral genotype fixed at the end of the evolutionary process was not on the global fitness optima but stuck into a suboptimal peak.