The next 20 years of genome research

The next 20 years of genome research

Michael Schatz
doi: http://dx.doi.org/10.1101/020289

The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than one billion genomes, bringing with it even deeper insights into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches can keep pace with the rapid improvements to biotechnology. In this perspective, we aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.

Learning quantitative sequence-function relationships from high-throughput biological data

Learning quantitative sequence-function relationships from high-throughput biological data

Gurinder S Atwal, Justin B Kinney
doi: http://dx.doi.org/10.1101/020172

Understanding the transcriptional regulatory code, as well as other types of information encoded within biomolecular sequences, will require learning biophysical models of sequence-function relationships from high-throughput data. Controlling and characterizing the noise in such experiments, however, is notoriously difficult. The unpredictability of such noise creates problems for standard likelihood-based methods in statistical learning, which require that the quantitative form of experimental noise be known precisely. However, when this unpredictability is properly accounted for, important theoretical aspects of statistical learning which remain hidden in standard treatments are revealed. Specifically, one finds a close relationship between the standard inference method, based on likelihood, and an alternative inference method based on mutual information. Here we review and extend this relationship. We also describe its implications for learning sequence-function relationships from real biological data. Finally, we detail an idealized experiment in which these results can be demonstrated analytically.

Optimizing error correction of RNAseq reads

Optimizing error correction of RNAseq reads

Matthew D MacManes
doi: http://dx.doi.org/10.1101/020123

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool’s ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0

Mixed Models for Meta-Analysis and Sequencing

Mixed Models for Meta-Analysis and Sequencing

Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/020115

Mixed models are an effective statistical method for increasing power and avoiding confounding in genetic association studies. Existing mixed model methods have been designed for “pooled” studies where all individual-level genotype and phenotype data are simultaneously visible to a single analyst. Many studies follow a “meta-analysis” design, wherein a large number of independent cohorts share only summary statistics with a central meta-analysis group, and no one person can view individual-level data for more than a small fraction of the total sample. When using linear regression for GWAS, there is no difference in power between pooled studies and meta-analyses \cite{lin2010meta}; however, we show that when using mixed models, standard meta-analysis is much less powerful than mixed model association on a pooled study of equal size. We describe a method that allows meta-analyses to capture almost all of the power available to mixed model association on a pooled study without sharing individual-level genotype data. The added computational cost and analytical complexity of this method is minimal, but the increase in power can be large: based on the predictive performance of polygenic scoring reported in \cite{wood2014defining} and \cite{locke2015genetic}, we estimate that the next height and BMI studies could see increases in effective sample size of $\approx$15\% and $\approx$8\%, respectively. Last, we describe how a related technique can be used to increase power in sequencing, targeted sequencing and exome array studies. Note that these techniques are presently only applicable to randomly ascertained studies and will sometimes result in loss of power in ascertained case/control studies. We are developing similar methods for case/control studies, but this is more complicated.

An integrative statistical model for inferring strain admixture within clinical Plasmodium falciparum isolates

An integrative statistical model for inferring strain admixture within clinical Plasmodium falciparum isolates

John D. O’Brien, Zamin Iqbal, Lucas Amenga-Etego
(Submitted on 29 May 2015)

Since the arrival of genetic typing methods in the late 1960’s, researchers have puzzled at the clinical consequence of observed strain mixtures within clinical isolates of Plasmodium falciparum. We present a new statistical model that infers the number of strains present and the amount of admixture with the local population (panmixia) using whole-genome sequence data. The model provides a rigorous statistical approach to inferring these quantities as well as the proportions of the strains within each sample. Applied to 168 samples of whole-genome sequence data from northern Ghana, the model provides significantly improvement fit over models implementing simpler approaches to mixture for a large majority (129/168) of samples. We discuss the possible uses of this model as a window into within-host selection for clinical and epidemiological studies and outline possible means for experimental validation.

An explicit Poisson-Kolmogorov-Smirnov test for the molecular clock in phylogenies

An explicit Poisson-Kolmogorov-Smirnov test for the molecular clock in phylogenies

Fernando Marcon, Fernando Antoneli, Marcelo R. S. Briones
(Submitted on 21 May 2015)

Divergence dates estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests for the molecular clock. Testing for global and local clocks generally compare a clock-constrained tree versus a non-clock tree (e.g. the likelihood ratio test). These tests verify the evolutionary rate homogeneity among taxa and usually employ the chi-square test for rejection/acceptance of the “clock-like” phylogeny. The paradox is that the molecular clock hypothesis, as proposed, is a Poisson process, and therefore, non-homogeneous. Here we propose a method for testing the molecular clock in phylogenies that is built upon the assumption of Poisson stochastic process that accommodates rate heterogeneity and is based on ensembles of trees inferred by the Bayesian method. The observed distribution of branch lengths (number of substitutions) is obtained from the ensemble of post burn-in Bayesian search. The parameter λ of the expected Poisson distribution is given by the average branch length of this ensemble. The goodness-of-fit test is performed using a modified Kolmogorov-Smirnov test for Poisson distributions. The method here introduced uses a large number of statistically equivalent phylogenies to obtain the observed distribution. This circumvents problems of small sample size (lack of power and lack of information), because the power of the test is asymptotic to unity. Also, the observed distribution obtained is very robust in the sense that for a sufficient number of trees (700) the empirical distribution stabilizes. Therefore, the estimated parameter λ, used to define the expected distribution, is essentially independent of sample size.

A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data

A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data

Amanda J Lea, Susan C Albert, Jenny Tung, Xiang Zhou
doi: http://dx.doi.org/10.1101/019562

Identifying sources of variation in DNA methylation levels is important for understanding gene regulation. Recently, bisulfite sequencing has become a popular tool for estimating DNA methylation levels at base-pair resolution, and for investigating the major drivers of epigenetic variation. However, modeling bisulfite sequencing data presents several challenges. Methylation levels are estimated from proportional read counts, yet coverage can vary dramatically across sites and samples. Further, methylation levels are influenced by genetic variation, and controlling for genetic covariance (e.g., kinship or population structure) is crucial for avoiding potential false positives. To address these challenges, we combine a binomial mixed model with an efficient sampling-based algorithm (MACAU) for approximate parameter estimation and p-value computation. This framework allows us to account for both the over-dispersed, count-based nature of bisulfite sequencing data, as well as genetic relatedness among individuals. Furthermore, by leveraging the advantages of an auxiliary variable-based sampling algorithm and recent mixed model innovations, MACAU substantially reduces computational complexity and can thus be applied to large, genome-wide data sets. Using simulations and two real data sets (whole genome bisulfite sequencing (WGBS) data from Arabidopsis thaliana and reduced representation bisulfite sequencing (RRBS) data from baboons), we show that, compared to existing approaches, our method provides better calibrated test statistics in the presence of population structure. Further, it improves power to detect differentially methylated sites: in the RRBS data set, MACAU detected 1.6-fold more age-associated CpG sites than a beta-binomial model (the next best approach). Changes in these sites are consistent with known age-related shifts in DNA methylation levels, and are enriched near genes that are differentially expressed with age in the same population. Taken together, our results indicate that MACAU is an effective tool for analyzing bisulfite sequencing data, with particular salience to analyses of structured populations. MACAU is freely available at http://www.xzlab.org/software.html.

Roary: Rapid large-scale prokaryote pan genome analysis

Roary: Rapid large-scale prokaryote pan genome analysis

Andrew J Page, Carla A Cummins, Martin Hunt, Vanessa K Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Jacqueline A Keane, Julian Parkhill
doi: http://dx.doi.org/10.1101/019315

A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and dispensable accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.

Controlling False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Controlling False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

David M Rocke , Luyao Ruan , J. Jared Gossett , Blythe Durbin-Johnson , Sharon Aviran
doi: http://dx.doi.org/10.1101/018739

We review existing methods for the analysis of RNA-Seq data and place them in a common framework of a sequence of tasks that are usually part of the process. We show that many existing methods produce large numbers of false positives in cases where the null hypothesis is true by construction and where actual data from RNA-Seq studies are used, as opposed to simulations that make specific assumptions about the nature of the data. We show that some of those mathematical assumptions about the data likely are one of the causes of the false positives, and define a general structure that is not apparently subject to these problems. The best performance was shown by limma-voom and by some simple methods composed of easily understandable steps.

Fine-mapping cellular QTLs with RASQUAL and ATAC-seq

Fine-mapping cellular QTLs with RASQUAL and ATAC-seq

Natsuhiko Kumasaka , Andrew Knights , Daniel Gaffney
doi: http://dx.doi.org/10.1101/018788

When cellular traits are measured using high-throughput DNA sequencing quantitative trait loci (QTLs) manifest at two levels: population level differences between individuals and allelic differences between cis-haplotypes within individuals. We present RASQUAL (Robust Allele Specific QUAntitation and quality controL), a novel statistical approach for association mapping that integrates genetic effects and robust modelling of biases in next generation sequencing (NGS) data within a single, probabilistic framework. RASQUAL substantially improves causal variant localisation and sensitivity of association detection over existing methods in RNA-seq, DNaseI-seq and ChIP-seq data. We illustrate how RASQUAL can be used to maximise association detection by generating the first map of chromatin accessibility QTLs (caQTLs) in a European population using ATAC-seq. Despite a modest sample size, we identified 2,706 independent caQTLs (FDR 10%) and illustrate how RASQUAL’s improved causal variant localisation provides powerful information for fine-mapping disease-associated variants. We also map “multipeak” caQTLs, identical genetic associations found across multiple, independent open chromatin regions and illustrate how genetic signals in ATAC-seq data can be used to link distal regulatory elements with gene promoters. Our results highlight how joint modelling of population and allele-specific genetic signals can improve functional interpretation of noncoding variation.