Realistic simulations reveal extensive sample-specificity of RNA-seq biases

Realistic simulations reveal extensive sample-specificity of RNA-seq biases
Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman
(Submitted on 14 Aug 2013)

In line with the importance of RNA-seq, the bioinformatics community has produced numerous data analysis tools incorporating methods to correct sample-specific biases. However, few advanced simulation tools exist to enable benchmarking of competing correction methods. We introduce the first framework to reproduce the properties of individual RNA-seq runs and, by applying it on several datasets, we demonstrate the importance of accounting for sample-specificity in realistic simulations.

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human
Hunter B. Fraser
(Submitted on 8 Aug 2013)

Eukaryotic DNA replication follows a specific temporal program, with some genomic regions consistently replicating earlier than others, yet what determines this program is largely unknown. Highly transcribed regions have been observed to replicate in early S-phase in all plant and animal species studied to date, but this relationship is thought to be absent from both budding yeast and fission yeast. No association between cell-cycle regulated transcription and replication timing has been reported for any species. Here I show that in budding yeast, fission yeast, and human, the genes most highly transcribed during S-phase replicate early, whereas those repressed in S-phase replicate late. Transcription during other cell-cycle phases shows either the opposite correlation with replication timing, or no relation. The relationship is strongest near late-firing origins of replication, which is not consistent with a previously proposed model — that replication timing may affect transcription — and instead suggests a potential mechanism involving the recruitment of limiting replication initiation factors during S-phase. These results suggest that S-phase transcription may be an important determinant of DNA replication timing across eukaryotes, which may explain the well-established association between transcription and replication timing.

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements
Weiwei Zhang, Tim D Spector, Panos Deloukas, Jordana T Bell, Barbara E Engelhardt
(Submitted on 9 Aug 2013)

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is important, but current approaches tackle average methylation within a genomic locus and are often limited to specific genomic regions. Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict CpG site methylation levels using as features neighboring CpG site methylation levels and genomic distance, and co-localization with coding regions, CGIs, and regulatory elements from the ENCODE project, among others. Our approach achieves 91% — 94% prediction accuracy of genome-wide methylation levels at single CpG site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs. Our classifier outperforms state-of-the-art methylation classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation status, CpG island status, co-localized DNase I hypersensitive sites, and specific transcription factor binding sites were found to be most predictive of methylation levels. Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict site-specific methylation levels that achieves the best DNA methylation predictive accuracy to date. Furthermore, our method identified genomic features that interact with DNA methylation, elucidating mechanisms involved in DNA methylation modification and regulation, and linking different epigenetic processes.

Bayesian genome assembly and assessment by Markov Chain Monte Carlo sampling

Bayesian genome assembly and assessment by Markov Chain Monte Carlo sampling
Mark Howison, Felipe Zapata, Erika J. Edwards, Casey W. Dunn
(Submitted on 6 Aug 2013)

Most genome assemblers provide a point estimates of the true genome sequences, chosen from among many alternative hypotheses that are supported by the data. We present a Markov Chain Monte Carlo approach to sequence assembly that instead generates a distribution of assembly hypotheses with quantified probabilities. This statistically explicit Bayesian approach to assembly allows the investigator to evaluate alternative assembly hypotheses in a unified framework and propagate uncertainty about genomes assembly to downstream analyses. We implement this approach in a prototype assembler and illustrate its application to the genome of the bacteriophage $\Phi$X174.

Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays

Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays
Heejung Shim, Matthew Stephens
(Submitted on 27 Jul 2013)

Understanding how genetic variants influence cellular-level processes is an important step towards understanding how they influence important organismal-level traits, or “phenotypes”, including human disease susceptibility. To this end scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g. RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying “function” that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs) we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis.

Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

Bayesian test for co-localisation between pairs of genetic association studies using summary statistics
Claudia Giambartolomei (1), Damjan Vukcevic (2), Eric E. Schadt (3), Aroon D. Hingorani (1), Chris Wallace (4), Vincent Plagnol (1) ((1) University College London (UCL), London, UK, (2) Royal Children’s Hospital, Melbourne, Australia, (3) Mount Sinai School of Medicine, New York USA, (4) University of Cambridge, Cambridge, UK)
(Submitted on 17 May 2013)

Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. A key feature of the method is the ability to derive the key output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at (this http URL). We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including > 100,000 individuals of European ancestry. Our co-localisation results are broadly consistent with the conclusion from the published meta-analysis. Combining all lipid biomarkers, our re-analysis supported 29 out of 38 reported co-localisation results with eQTLs. Two clearly discordant findings (IFT172, CPNE1), as well as multiple new co-localisation results, highlight the value of a formal systematic statistical test. Our findings provide information about the causal gene in associated intervals and have direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

Meta-Analysis of Gene Level Association Tests

Meta-Analysis of Gene Level Association Tests
Dajiang J. Liu, Gina M. Peloso, Xiaowei Zhan, Oddgeir Holmen, Matthew Zawistowski, Shuang Feng, Majid Nikpay, Paul L. Auer, Anuj Goel, He Zhang, Ulrike Peters, Martin Farrall, Marju Orho-Melander, Charles Kooperberg, Ruth McPherson, Hugh Watkins, Cristen J. Willer, Kristian Hveem, Olle Melander, Sekar Kathiresan, Gonçalo R. Abecasis
(Submitted on 6 May 2013)

The vast majority of connections between complex disease and common genetic variants were identified through meta-analysis, a powerful approach that enables large samples sizes while protecting against common artifacts due to population structure, repeated small sample analyses, and/or limitations with sharing individual level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the unit of analysis. Here, we propose and evaluate new approaches for meta-analysis of rare variant association. We show that our approach retains useful features of single variant meta-analytic approaches and demonstrate its utility in a study of blood lipid levels in ~18,500 individuals genotyped with exome arrays.

XORRO: Rapid Paired-End Read Overlapper

XORRO: Rapid Paired-End Read Overlapper
Russell J. Dickson, Gregory B. Gloor
(Submitted on 16 Apr 2013)

Background: Computational analysis of next-generation sequencing data is outpaced by data generation in many cases. In one such case, paired-end reads can be produced from the Illumina sequencing method faster than they can be overlapped by downstream analysis. The advantages in read length and accuracy provided by overlapping paired-end reads demonstrates the necessity for software to efficiently solve this problem.
Results: XORRO is an extremely efficient paired-end read overlapping program. XORRO can overlap millions of short paired-end reads in a few minutes. It uses 64-bit registers with a two bit alphabet to represent sequences and does comparisons using low-level logical operations like XOR, AND, bitshifting and popcount.
Conclusions: As of the writing of this manuscript, XORRO provides the fastest solution to the paired-end read overlap problem. XORRO is available for download at: sourceforge.net/projects/xorro-overlap/

High-speed and accurate color-space short-read alignment with CUSHAW2

High-speed and accurate color-space short-read alignment with CUSHAW2
Yongchao Liu, Bernt Popp, Bertil Schmidt
(Submitted on 17 Apr 2013)

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at this http URL Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de

The Convergence of eQTL Mapping, Heritability Estimation and Polygenic Modeling: Emerging Spectrum of Risk Variation in Bipolar Disorder

The Convergence of eQTL Mapping, Heritability Estimation and Polygenic Modeling: Emerging Spectrum of Risk Variation in Bipolar Disorder
Eric R. Gamazon, Hae Kyung Im, Chunyu Liu, Members of the Bipolar Disorder Genome Study (BiGS) Consortium, Dan L. Nicolae, Nancy J. Cox
(Submitted on 25 Mar 2013)

It is widely held that a substantial genetic component underlies Bipolar Disorder (BD) and other neuropsychiatric disease traits. Recent efforts have been aimed at understanding the genetic basis of disease susceptibility, with genome-wide association studies (GWAS) unveiling some promising associations. Nevertheless, the genetic etiology of BD remains elusive with a substantial proportion of the heritability – which has been estimated to be 80% based on twin and family studies – unaccounted for by the specific genetic variants identified by large-scale GWAS. Furthermore, functional understanding of associated loci generally lags discovery. Studies we report here provide considerable support to the claim that substantially more remains to be gained from GWAS on the genetic mechanisms underlying BD susceptibility, and that a large proportion of the variation in disease risk may be uncovered through integrative functional genomic approaches. We combine recent analytic advances in heritability estimation and polygenic modeling and leverage recent technological advances in the generation of -omics data to evaluate the nature and scale of the contribution of functional classes of genetic variation to a relatively intractable disorder. We identified cis eQTLs in cerebellum and parietal cortex that capture more than half of the total heritability attributable to SNPs interrogated through GWAS and showed that eQTL-based heritability estimation is highly tissue-dependent. Our findings show that a much greater resolution may be attained than has been reported thus far on the number of common loci that capture a substantial proportion of the heritability to disease risk and that the functional nature of contributory loci may be clarified en masse.