Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts

Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Daniel I Chasman, Paul M Ridker, Benjamin M Neale, Bonnie Berger, Nick Patterson, Alkes L Price
doi: http://dx.doi.org/10.1101/007799

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN^2) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts.

Genome-wide predictability of restriction sites across the eukaryotic tree of life

Genome-wide predictability of restriction sites across the eukaryotic tree of life

Santiago Herrera, Paula H. Reyes-Herrera, Timothy M. Shank
doi: http://dx.doi.org/10.1101/007781

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes–generally known as restriction-site associated DNA sequencing (RAD-seq)–is now one most commonly used strategies to generate single nucleotide polymorphism data in eukaryotes. The choice of restriction enzyme is critical for the design of any RAD-seq study as it determines the number of genetic markers that can be obtained for a given species, and ultimately the success of a project. In this study we tested the hypothesis that genome composition, in terms of GC content, mono-, di- and trinucleotide compositions, can be used to predict the number of restriction sites for a given combination of restriction enzyme and genome. We performed systematic in silico genome-wide surveys of restriction sites across the eukaryotic tree of live and compared them with expectations generated from stochastic models based on genome compositions using the newly developed software pipeline PredRAD (https://github.com/phrh/PredRAD). Our analyses reveal that in most cases the trinucleotide genome composition model is the best predictor, and the GC content and mononucleotide models are the worst predictors of the expected number of restriction sites in a eukaryotic genome. However, we argue that the predictability of restriction site frequencies in eukaryotic genomes needs to be treated in a case-specific basis, because the phylogenetic position of the taxon of interest and the specific recognition sequence of the selected restriction enzyme are the most determinant factors. The results from this study, and the software developed, will help guide the design of any study using RAD sequencing and related methods.

Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Seyhan Yazar, George EC Gooden, David A Mackey, Alex Hewitt
doi: http://dx.doi.org/10.1101/007724

A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95%CI: 27.5-78.2) for E.coli and 53.5% (95%CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95%CI: 211.5-303.1) and 173.9% (95%CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.

Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels

Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels

Patrick Deelen, Daria Zhernakova, Mark de Haan, Marijke van der Sijde, Marc Jan Bonder, Juha Karjalainen, K. Joeri van der Velde, Kristin M. Abbott, Jingyuan Fu, Cisca Wijmenga, Richard J. Sinke, Morris A. Swertz, Lude Franke
doi: http://dx.doi.org/10.1101/007633

Given increasing numbers of RNA-seq samples in the public domain, we studied to what extent expression quantitative trait loci (eQTLs) and allele-specific expression (ASE) can be identified in public RNA-seq data while also deriving the genotypes from the RNA-seq reads. 4,978 human RNA-seq runs, representing many different tissues and cell-types, passed quality control. Even though this data originated from many different laboratories, samples reflecting the same cell-type clustered together, suggesting that technical biases due to different sequencing protocols were limited. We derived genotypes from the RNA-seq reads and imputed non-coding variants. In a joint analysis on 1,262 samples combined, we identified cis-eQTLs effects for 8,034 unique genes. Additionally, we observed strong ASE effects for 34 rare pathogenic variants, corroborating previously observed effects on the corresponding protein levels. Given the exponential growth of the number of publicly available RNA-seq samples, we expect this approach will become relevant for studying tissue-specific effects of rare pathogenic genetic variants.

Exploiting evolutionary non-commutativity to prevent the emergence of bacterial antibiotic resistance

Exploiting evolutionary non-commutativity to prevent the emergence of bacterial antibiotic resistance

Daniel Nichol, Peter Jeavons, Alexander G Fletcher, Robert A Bonomo, Philip K Maini, Jerome L Paul, Robert A Gatenby, Alexander RA Anderson, Jacob G Scott
doi: http://dx.doi.org/10.1101/007542

The increasing rate of antibiotic resistance and slowing discovery of novel antibiotic treatments presents a growing threat to public health. In the present study we develop a Markov Chain model of evolution in asexually reproducing populations which we use to illustrate that different selection pressures do not commute. We demonstrate that the emergence of resistant individuals can be both hindered and promoted by careful orderings of drug application. This suggests a new strategy in the war against antibiotic therapy resistant organisms: rational drug ordering to shepherd evolution through genotype space to states corresponding to greater sensitivity to antibiotic treatment. The model we present is an encoding of the `Strong Selection Weak Mutation’ model of evolution on fitness landscapes within a Markov Chain, which associates the global properties of the fitness landscape with the algebraic properties of the Markov Chain transition matrix. Through this association we derive results on the non-commutativity and irreversibility of natural selection.

A comparison of control samples for ChIP-seq of histone modifications

A comparison of control samples for ChIP-seq of histone modifications

Christoffer Flensburg, Sarah A Kinkel, Andrew Keniry, Marnie Blewitt, Alicia Oshlack
doi: http://dx.doi.org/10.1101/007609

The advent of high-throughput sequencing has allowed genome wide profiling of histone modifications by Chromatin ImmunoPrecipitation (ChIP) followed by sequencing (ChIP-seq). In this assay the histone mark of interest is enriched through a chromatin pull-down assay using an antibody for the mark. Due to imperfect antibodies and other factors, many of the sequenced fragments do not originate from the histone mark of interest, and are referred to as background reads. Background reads are not uniformly distributed and therefore control samples are usually used to estimate the background distribution at any given genomic position. The Encyclopedia of DNA Elements (ENCODE) Consortium guidelines suggest sequencing a whole cell extract (WCE, or “input”) sample, or a mock ChIP reaction such as an IgG control, as a background sample. However, for a histone modification ChIP-seq investigation it is also possible to use a Histone H3 (H3) pull-down to map the underlying distribution of histones. In this paper we generated data from a hematopoietic stem and progenitor cell population isolated from mouse foetal liver to compare WCE and H3 ChIP-seq as control samples. The quality of the control samples is estimated by a comparison to pull-downs of histone modifications and to expression data. We find minor differences between WCE and H3 ChIP-seq, such as coverage in mitochondria and behaviour close to transcription start sites. Where the two controls differ, the H3 pull-down is generally more similar to the ChIP-seq of histone modifications. However, the differences between H3 and WCE have a negligible impact on the quality of a standard analysis.

Leveraging local identity-by-descent increases the power of case/control GWAS with related individuals

Leveraging local identity-by-descent increases the power of case/control GWAS with related individuals

Joshua N. Sampson, Bill Wheeler, Peng Li, Jianxin Shi
(Submitted on 31 Jul 2014)

Large case/control Genome-Wide Association Studies (GWAS) often include groups of related individuals with known relationships. When testing for associations at a given locus, current methods incorporate only the familial relationships between individuals. Here, we introduce the chromosome-based Quasi Likelihood Score (cQLS) statistic that incorporates local Identity-By-Descent (IBD) to increase the power to detect associations. In studies robust to population stratification, such as those with case/control sibling pairs, simulations show that the study power can be increased by over 50%. In our example, a GWAS examining late-onset Alzheimer’s disease, the p-values among the most strongly associated SNPs in the APOE gene tend to decrease, with the smallest p-value decreasing from 1.23×10−8 to 7.70×10−9. Furthermore, as a part of our simulations, we reevaluate our expectations about the use of families in GWAS. We show that, although adding only half as many unique chromosomes, genotyping affected siblings is more efficient than genotyping randomly ascertained cases. We also show that genotyping cases with a family history of disease will be less beneficial when searching for SNPs with smaller effect sizes.

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Hua Zhou, John Blangero, Thomas D Dyer, Kei-hang K Chan, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even data sets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper we re-examine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (6 CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1357 individuals in 124 pedigrees) takes less than 2 minutes and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 minutes and 1.5 GB of memory.

Fast Genome-Wide QTL Analysis Using Mendel

Fast Genome-Wide QTL Analysis Using Mendel

Hua Zhou, Jin Zhou, Tao Hu, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the t-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{this http URL}.

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Charles K. Fisher, Pankaj Mehta
(Submitted on 30 Jul 2014)

Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach — the Bayesian Ising Approximation (BIA) — to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high dimensional regression by analyzing a gene expression dataset with nearly 30,000 features.