Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

Bayesian test for co-localisation between pairs of genetic association studies using summary statistics
Claudia Giambartolomei (1), Damjan Vukcevic (2), Eric E. Schadt (3), Aroon D. Hingorani (1), Chris Wallace (4), Vincent Plagnol (1) ((1) University College London (UCL), London, UK, (2) Royal Children’s Hospital, Melbourne, Australia, (3) Mount Sinai School of Medicine, New York USA, (4) University of Cambridge, Cambridge, UK)
(Submitted on 17 May 2013)

Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. A key feature of the method is the ability to derive the key output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at (this http URL). We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including > 100,000 individuals of European ancestry. Our co-localisation results are broadly consistent with the conclusion from the published meta-analysis. Combining all lipid biomarkers, our re-analysis supported 29 out of 38 reported co-localisation results with eQTLs. Two clearly discordant findings (IFT172, CPNE1), as well as multiple new co-localisation results, highlight the value of a formal systematic statistical test. Our findings provide information about the causal gene in associated intervals and have direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

Meta-Analysis of Gene Level Association Tests

Meta-Analysis of Gene Level Association Tests
Dajiang J. Liu, Gina M. Peloso, Xiaowei Zhan, Oddgeir Holmen, Matthew Zawistowski, Shuang Feng, Majid Nikpay, Paul L. Auer, Anuj Goel, He Zhang, Ulrike Peters, Martin Farrall, Marju Orho-Melander, Charles Kooperberg, Ruth McPherson, Hugh Watkins, Cristen J. Willer, Kristian Hveem, Olle Melander, Sekar Kathiresan, Gonçalo R. Abecasis
(Submitted on 6 May 2013)

The vast majority of connections between complex disease and common genetic variants were identified through meta-analysis, a powerful approach that enables large samples sizes while protecting against common artifacts due to population structure, repeated small sample analyses, and/or limitations with sharing individual level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the unit of analysis. Here, we propose and evaluate new approaches for meta-analysis of rare variant association. We show that our approach retains useful features of single variant meta-analytic approaches and demonstrate its utility in a study of blood lipid levels in ~18,500 individuals genotyped with exome arrays.

XORRO: Rapid Paired-End Read Overlapper

XORRO: Rapid Paired-End Read Overlapper
Russell J. Dickson, Gregory B. Gloor
(Submitted on 16 Apr 2013)

Background: Computational analysis of next-generation sequencing data is outpaced by data generation in many cases. In one such case, paired-end reads can be produced from the Illumina sequencing method faster than they can be overlapped by downstream analysis. The advantages in read length and accuracy provided by overlapping paired-end reads demonstrates the necessity for software to efficiently solve this problem.
Results: XORRO is an extremely efficient paired-end read overlapping program. XORRO can overlap millions of short paired-end reads in a few minutes. It uses 64-bit registers with a two bit alphabet to represent sequences and does comparisons using low-level logical operations like XOR, AND, bitshifting and popcount.
Conclusions: As of the writing of this manuscript, XORRO provides the fastest solution to the paired-end read overlap problem. XORRO is available for download at: sourceforge.net/projects/xorro-overlap/

High-speed and accurate color-space short-read alignment with CUSHAW2

High-speed and accurate color-space short-read alignment with CUSHAW2
Yongchao Liu, Bernt Popp, Bertil Schmidt
(Submitted on 17 Apr 2013)

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at this http URL Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de

The Convergence of eQTL Mapping, Heritability Estimation and Polygenic Modeling: Emerging Spectrum of Risk Variation in Bipolar Disorder

The Convergence of eQTL Mapping, Heritability Estimation and Polygenic Modeling: Emerging Spectrum of Risk Variation in Bipolar Disorder
Eric R. Gamazon, Hae Kyung Im, Chunyu Liu, Members of the Bipolar Disorder Genome Study (BiGS) Consortium, Dan L. Nicolae, Nancy J. Cox
(Submitted on 25 Mar 2013)

It is widely held that a substantial genetic component underlies Bipolar Disorder (BD) and other neuropsychiatric disease traits. Recent efforts have been aimed at understanding the genetic basis of disease susceptibility, with genome-wide association studies (GWAS) unveiling some promising associations. Nevertheless, the genetic etiology of BD remains elusive with a substantial proportion of the heritability – which has been estimated to be 80% based on twin and family studies – unaccounted for by the specific genetic variants identified by large-scale GWAS. Furthermore, functional understanding of associated loci generally lags discovery. Studies we report here provide considerable support to the claim that substantially more remains to be gained from GWAS on the genetic mechanisms underlying BD susceptibility, and that a large proportion of the variation in disease risk may be uncovered through integrative functional genomic approaches. We combine recent analytic advances in heritability estimation and polygenic modeling and leverage recent technological advances in the generation of -omics data to evaluate the nature and scale of the contribution of functional classes of genetic variation to a relatively intractable disorder. We identified cis eQTLs in cerebellum and parietal cortex that capture more than half of the total heritability attributable to SNPs interrogated through GWAS and showed that eQTL-based heritability estimation is highly tissue-dependent. Our findings show that a much greater resolution may be attained than has been reported thus far on the number of common loci that capture a substantial proportion of the heritability to disease risk and that the functional nature of contributory loci may be clarified en masse.

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis
Marc Harper, Luisa Gronenberg, James Liao, Christopher Lee
(Submitted on 3 Mar 2013)

Discovering all the genetic causes of a phenotype is an important goal in functional genomics. In this paper we combine an experimental design for multiple independent detections of the genetic causes of a phenotype, with a high-throughput sequencing analysis that maximizes sensitivity for comprehensively identifying them. Testing this approach on a set of 24 mutant strains generated for a metabolic phenotype with many known genetic causes, we show that this pathway-based phenotype sequencing analysis greatly improves sensitivity of detection compared with previous methods, and reveals a wide range of pathways that can cause this phenotype. We demonstrate our approach on a metabolic re-engineering phenotype, the PEP/OAA metabolic node in E. coli, which is crucial to a substantial number of metabolic pathways and under renewed interest for biofuel research. Out of 2157 mutations in these strains, pathway-phenoseq discriminated just five gene groups (12 genes) as statistically significant causes of the phenotype. Experimentally, these five gene groups, and the next two high-scoring pathway-phenoseq groups, either have a clear connection to the PEP metabolite level or offer an alternative path of producing oxaloacetate (OAA), and thus clearly explain the phenotype. These high-scoring gene groups also show strong evidence of positive selection pressure, compared with strictly neutral selection in the rest of the genome.

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor
Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
(Submitted on 15 Feb 2013)

RNA sequencing (RNA-seq) has been rapidly adopted for the multilayered profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, thus there is a need for guidance on current best practices. This protocol presents a “state-of-the-art” computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day, even with modest resources.

Equitability, mutual information, and the maximal information coefficient

Equitability, mutual information, and the maximal information coefficient
Justin B. Kinney, Gurinder S. Atwal
(Submitted on 31 Jan 2013)

Reshef et al. recently proposed a new statistical measure, the “maximal information coefficient” (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC possesses a desirable mathematical property called “equitability” that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect. First we offer mathematical proof that no (non-trivial) dependence measure satisfies the definition of equitability proposed by Reshef et al.. We then propose a self-consistent and more general definition of equitability that follows naturally from the Data Processing Inequality. Mutual information satisfies this new definition of equitability while MIC does not. Finally, we show that the simulation evidence offered by Reshef et al. was artifactual. We conclude that estimating mutual information is not only practical for many real-world applications, but also provides a natural solution to the problem of quantifying associations in large data sets.