Learning quantitative sequence-function relationships from high-throughput biological data

Learning quantitative sequence-function relationships from high-throughput biological data

Gurinder S Atwal, Justin B Kinney
doi: http://dx.doi.org/10.1101/020172

Understanding the transcriptional regulatory code, as well as other types of information encoded within biomolecular sequences, will require learning biophysical models of sequence-function relationships from high-throughput data. Controlling and characterizing the noise in such experiments, however, is notoriously difficult. The unpredictability of such noise creates problems for standard likelihood-based methods in statistical learning, which require that the quantitative form of experimental noise be known precisely. However, when this unpredictability is properly accounted for, important theoretical aspects of statistical learning which remain hidden in standard treatments are revealed. Specifically, one finds a close relationship between the standard inference method, based on likelihood, and an alternative inference method based on mutual information. Here we review and extend this relationship. We also describe its implications for learning sequence-function relationships from real biological data. Finally, we detail an idealized experiment in which these results can be demonstrated analytically.

Most viewed on Haldane’s Sieve: May 2015

The most viewed posts on Haldane’s Sieve in May 2015 were:

Optimizing error correction of RNAseq reads

Optimizing error correction of RNAseq reads

Matthew D MacManes
doi: http://dx.doi.org/10.1101/020123

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool’s ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0

Mixed Models for Meta-Analysis and Sequencing

Mixed Models for Meta-Analysis and Sequencing

Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/020115

Mixed models are an effective statistical method for increasing power and avoiding confounding in genetic association studies. Existing mixed model methods have been designed for “pooled” studies where all individual-level genotype and phenotype data are simultaneously visible to a single analyst. Many studies follow a “meta-analysis” design, wherein a large number of independent cohorts share only summary statistics with a central meta-analysis group, and no one person can view individual-level data for more than a small fraction of the total sample. When using linear regression for GWAS, there is no difference in power between pooled studies and meta-analyses \cite{lin2010meta}; however, we show that when using mixed models, standard meta-analysis is much less powerful than mixed model association on a pooled study of equal size. We describe a method that allows meta-analyses to capture almost all of the power available to mixed model association on a pooled study without sharing individual-level genotype data. The added computational cost and analytical complexity of this method is minimal, but the increase in power can be large: based on the predictive performance of polygenic scoring reported in \cite{wood2014defining} and \cite{locke2015genetic}, we estimate that the next height and BMI studies could see increases in effective sample size of $\approx$15\% and $\approx$8\%, respectively. Last, we describe how a related technique can be used to increase power in sequencing, targeted sequencing and exome array studies. Note that these techniques are presently only applicable to randomly ascertained studies and will sometimes result in loss of power in ascertained case/control studies. We are developing similar methods for case/control studies, but this is more complicated.

An integrative statistical model for inferring strain admixture within clinical Plasmodium falciparum isolates

An integrative statistical model for inferring strain admixture within clinical Plasmodium falciparum isolates

John D. O’Brien, Zamin Iqbal, Lucas Amenga-Etego
(Submitted on 29 May 2015)

Since the arrival of genetic typing methods in the late 1960’s, researchers have puzzled at the clinical consequence of observed strain mixtures within clinical isolates of Plasmodium falciparum. We present a new statistical model that infers the number of strains present and the amount of admixture with the local population (panmixia) using whole-genome sequence data. The model provides a rigorous statistical approach to inferring these quantities as well as the proportions of the strains within each sample. Applied to 168 samples of whole-genome sequence data from northern Ghana, the model provides significantly improvement fit over models implementing simpler approaches to mixture for a large majority (129/168) of samples. We discuss the possible uses of this model as a window into within-host selection for clinical and epidemiological studies and outline possible means for experimental validation.

The effect of non-reversibility on inferring rooted phylogenies

The effect of non-reversibility on inferring rooted phylogenies

S. Cherlin, T. M. W. Nye, R. J. Boys, S. E. Heaps, T. A. Williams, T. M. Embley
(Submitted on 29 May 2015)

Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree. In this paper we investigate the effect of relaxing the reversibility assumption and allowing the position of the root to be another unknown quantity in the model. We propose two hierarchical models which are centred on a reversible model but perturbed to allow non-reversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods. We illustrate the performance of the two non-reversible models in analyses of simulated datasets using two types of topological priors. We then apply the models to a real biological dataset, the radiation of polyploid yeasts, for which there is a robust biological opinion about the root position. Finally we apply the models to a second biological dataset for which the rooted tree is controversial: the ribosomal tree of life. We compare the two non-reversible models and conclude that both are useful in inferring the position of the root from real biological datasets.

A Bayesian Approach for Detecting Mass-Extinction Events When Rates of Lineage Diversification Vary

A Bayesian Approach for Detecting Mass-Extinction Events When Rates of Lineage Diversification Vary

Michael R. May, Sebastian Höhna, Brian R. Moore
doi: http://dx.doi.org/10.1101/020149

The paleontological record chronicles numerous episodes of mass extinction that severely culled the Tree of Life. Biologists have long sought to assess the extent to which these events may have impacted particular groups. We present a novel method for detecting mass-extinction events from phylogenies estimated from molecular sequence data. We develop our approach in a Bayesian statistical framework, which enables us to harness prior information on the frequency and magnitude of mass-extinction events. The approach is based on an episodic stochastic-branching process model in which rates of speciation and extinction are constant between rate-shift events. We model three types of events: (1) instantaneous tree-wide shifts in speciation rate; (2) instantaneous tree-wide shifts in extinction rate, and; (3) instantaneous tree-wide mass-extinction events. Each of the events is described by a separate compound Poisson process (CPP) model, where the waiting times between each event are exponentially distributed with event-specific rate parameters. The magnitude of each event is drawn from an event-type specific prior distribution. Parameters of the model are then estimated using a reversible-jump Markov chain Monte Carlo (rjMCMC) algorithm. We demonstrate via simulation that this method has substantial power to detect the number of mass-extinction events, provides unbiased estimates of the timing of mass-extinction events, while exhibiting an appropriate (i.e., below 5%) false discovery rate even in the case of background diversification rate variation. Finally, we provide an empirical application of this approach to conifers, which reveals that this group has experienced two major episodes of mass extinction. This new approach—the CPP on Mass Extinction Times (CoMET) model—provides an effective tool for identifying mass-extinction events from molecular phylogenies, even when the history of those groups includes more prosaic temporal variation in diversification rate.

Chromosomal rearrangements as barriers to genetic homogenization between archaic and modern humans

Chromosomal rearrangements as barriers to genetic homogenization between archaic and modern humans

Rebekah L. Rogers
(Submitted on 26 May 2015)

Chromosomal rearrangements, which shuffle DNA across the genome, are an important source of divergence across taxa that can modify gene expression and function. Using a paired-end read approach with Illumina sequence data for archaic humans, I identify changes in genome structure that occurred recently in human evolution. Hundreds of rearrangements indicate genomic trafficking between the sex chromosomes and autosomes, raising the possibility of sex-specific changes. Additionally, genes adjacent to genome structure changes in Neanderthals are associated with testis-specific expression, consistent with evolutionary theory that new genes commonly form with expression in the testes. I identify one case of new-gene creation through transposition from the Y chromosome to chromosome 10 that combines the 5′ end of the testis-specific gene Fank1 with previously untranscribed sequence. This new transcript experienced copy number expansion in archaic genomes, indicating rapid genomic change. Finally, loci containing genome structure changes show diminished rates of introgression from Neanderthals into modern humans, consistent with the hypothesis that rearrangements serve as barriers to gene flow during hybridization. Together, these results suggest that this previously unidentified source of genomic variation has important biological consequences in human evolution.

A Unified Architecture of Transcriptional Regulatory Elements

A Unified Architecture of Transcriptional Regulatory Elements

Robin Andersson, Albin Sandelin, Charles G Danko
doi: http://dx.doi.org/10.1101/019844

Gene expression is precisely controlled in time and space through the integration of signals that act at gene promoters and gene-distal enhancers. Classically, promoters and enhancers are considered separate classes of regulatory elements, often distinguished by histone modifications. However, recent studies have revealed broad similarities between enhancers and promoters, blurring the distinction: active enhancers often initiate transcription, and some gene promoters have the potential of enhancing transcriptional output of other promoters. Here, we propose a model in which promoters and enhancers are considered a single class of functional element, with a unified architecture for transcription initiation. The context of interacting regulatory elements, and surrounding sequences, determine local transcriptional output as well as the enhancer and promoter activities of individual elements.

Large-scale Machine Learning for Metagenomics Sequence Classification

Large-scale Machine Learning for Metagenomics Sequence Classification

Kévin Vervier (CBIO), Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras, Jean-Philippe Vert (CBIO)
(Submitted on 26 May 2015)

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional approaches benefit from increasing the number of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning these models involves training a machine learning model on about 10 8 samples in 10 7 dimensions, which is out of reach of standard soft-wares but can be done efficiently with modern implementations for large-scale machine learning. The resulting models are competitive in terms of accuracy with well-established alignment tools for problems involving a small to moderate number of candidate species, and for reasonable amounts of sequencing errors. We show, however, that compositional approaches are still limited in their ability to deal with problems involving a greater number of species, and more sensitive to sequencing errors. We finally confirm that compositional approach achieve faster prediction times, with a gain of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.