MINI REVIEW: Statistical methods for detecting differentially methylated loci and regions

MINI REVIEW: Statistical methods for detecting differentially methylated loci and regions

Mark D Robinson, Abdullah Kahraman, Charity W Law, Helen Lindsay, Malgorzata Nowicka, Lukas M Weber, Xiaobei Zhou
doi: http://dx.doi.org/10.1101/007120

DNA methylation, and specifically the reversible addition of methyl groups at CpG dinucleotides genome-wide, represents an important layer that is associated with the regulation of gene expression. In particular, aberrations in the methylation status have been noted across a diverse set of pathological states, including cancer. With the rapid development and uptake of large scale sequencing of short DNA fragments, there has been an explosion of data analytic methods for processing and discovering changes in DNA methylation across diverse data types. In this mini-review, we aim to condense many of the salient challenges, such as experimental design, statistical methods for differential methylation detection and critical considerations such as cell type composition and the potential confounding that can arise from batch effects, into a compact and accessible format. Our main interests, from a statistical perspective, include the practical use of empirical Bayes or hierarchical models, which have been shown to be immensely powerful and flexible in genomics and the procedures by which control of false discoveries are made. Of course, there are many critical platform-specific data preprocessing aspects that we do not discuss here. In addition, we do not make formal performance comparisons of the methods, but rather describe the commonly used statistical models and many of the pertinent issues; we make some recommendations for further study.

Quantitative trait evolution with arbitrary mutational models

Quantitative trait evolution with arbitrary mutational models

Joshua G. Schraiber, Michael J. Landis
doi: http://dx.doi.org/10.1101/008540

When models of quantitative genetic variation are built from population ge- netic first principles, several assumptions are often made. One of the most important assumptions is that traits are controlled by many genes of small effect. This leads to a prediction of a Gaussian trait distribution in the population, via the Central Limit Theorem. Since these biological assumptions are often unknown or untrue, we charac- terized how finite numbers of loci or large mutational effects can impact the sampling distribution of a quantitative trait. To do so, we developed a neutral coalescent-based framework, allowing us to experiment freely with the number of loci and the underlying mutational model. Through both analytical theory and simulation we found the nor- mality assumption was highly sensitive to the details of the mutational process, with the greatest discrepancies arising when the number of loci was small or the mutational kernel was heavy-tailed. In particular, fat-tailed mutational kernels result in multimodal sampling distributions for any number of loci. Since selection models and robust neutral models may produce qualitatively similar sampling distributions, we advise extra caution should be taken when interpreting model-based results for poorly understood systems of quantitative traits.

Nuclear stability and transcriptional directionality separate functionally distinct RNA species

Nuclear stability and transcriptional directionality separate functionally distinct RNA species

Robin Andersson, Peter Refsing Andersen, Eivind Valen, Leighton Core, Jette Bornholdt, Mette Boyd, Torben Heick Jensen, Albin Sandelin
doi: http://dx.doi.org/10.1101/005447

Mammalian genomes are pervasively transcribed, yielding a complex transcriptome with high variability in composition and cellular abundance. While recent efforts have identified thousands of new long non-coding (lnc) RNAs and demonstrated a complex transcriptional repertoire produced by protein-coding (pc) genes, limited progress has been made in distinguishing functional RNA from spurious transcription events. This is partly due to present RNA classification, which is typically based on technical rather than biochemical criteria. Here we devise a strategy to systematically categorize human RNAs by their sensitivity to the ribonucleolytic RNA exosome complex and by the nature of their transcription initiation. These measures are surprisingly effective at correctly classifying annotated transcripts, including lncRNAs of known function. The approach also identifies uncharacterized stable lncRNAs, hidden among a vast majority of unstable transcripts. The predictive power of the approach promises to streamline the functional analysis of known and novel RNAs.

Most viewed on Haldane’s Sieve: August 2014

The most viewed posts on Haldane’s Sieve this month were:

The Genetic Architecture of Gene Expression Levels in Wild Baboons

The Genetic Architecture of Gene Expression Levels in Wild Baboons

Jenny Tung, Xiang Zhou, Susan C Alberts, Matthew Stephens, Yoav Gilad
doi: http://dx.doi.org/10.1101/008490

Gene expression variation is well documented in human populations and its genetic architecture has been extensively explored. However, we still know little about the genetic architecture of gene expression variation in other species, particularly our closest living relatives, the nonhuman primates. To address this gap, we performed an RNA sequencing (RNA-seq)-based study of 63 wild baboons, members of the intensively studied Amboseli baboon population in Kenya. Our study design allowed us to measure gene expression levels and identify genetic variants using the same data set, enabling us to perform complementary mapping of putative cis-acting expression quantitative trait loci (eQTL) and measurements of allele-specific expression (ASE) levels. We discovered substantial evidence for genetic effects on gene expression levels in this population. Surprisingly, we found more power to detect individual eQTL in the baboons relative to a HapMap human data set of comparable size, probably as a result of greater genetic variation, enrichment of SNPs with high minor allele frequencies, and longer-range linkage disequilibrium in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes. Interestingly, genes with eQTL significantly overlapped between the baboon and human data sets, suggesting that some genes may tolerate more genetic perturbation than others, and that this property may be conserved across species. Finally, we used a Bayesian sparse linear mixed model to partition genetic, demographic, and early environmental contributions to variation in gene expression levels. We found a strong genetic contribution to gene expression levels for almost all genes, while individual demographic and environmental effects tended to be more modest. Together, our results establish the feasibility of eQTL mapping using RNA-seq data alone, and act as an important first step towards understanding the genetic architecture of gene expression variation in nonhuman primates.

Sampling through time and phylodynamic inference with coalescent and birth-death models

Sampling through time and phylodynamic inference with coalescent and birth-death models

Erik M. Volz, Simon DW Frost
(Submitted on 28 Aug 2014)

Many population genetic models have been developed for the purpose of inferring population size and growth rates from random samples of genetic data. We examine two popular approaches to this problem, the coalescent and the birth-death-sampling model, in the context of estimating population size and birth rates in a population growing exponentially according to the birth-death branching process. For sequences sampled at a single time, we found the coalescent and the birth-death-sampling model gave virtually indistinguishable results in terms of the growth rates and fraction of the population sampled, even when sampling from a small population. For sequences sampled at multiple time points, we find that the birth-death model estimators are subject to large bias if the sampling process is misspecified. Since birth-death-sampling models incorporate a model of the sampling process, we show how much of the statistical power of birth-death-sampling models arises from the sequence of sample times and not from the genealogical tree. This motivates the development of a new coalescent estimator, which is augmented with a model of the known sampling process and is potentially more precise than the coalescent that does not use sample time information.

C. elegans harbors pervasive cryptic genetic variation for embryogenesis

C. elegans harbors pervasive cryptic genetic variation for embryogenesis

Annalise Paaby, Amelia White, David Riccardi, Kristin Gunsalus, Fabio Piano, Matthew Rockman
doi: http://dx.doi.org/10.1101/008532

Conditionally functional mutations are an important class of natural genetic variation, yet little is known about their prevalence in natural populations or their contribution to disease risk. Here, we describe a vast reserve of cryptic genetic variation, alleles that are normally silent but which affect phenotype when the function of other genes is perturbed, in the gene networks of C. elegans embryogenesis. We find evidence that cryptic-effect loci are ubiquitous and segregate at intermediate frequencies in the wild. The cryptic alleles demonstrate low developmental pleiotropy, in that specific, rather than general, perturbations are required to reveal them. Our findings underscore the importance of genetic background in characterizing gene function and provide a model for the expression of conditionally functional effects that may be fundamental in basic mechanisms of trait evolution and the genetic basis of disease susceptibility.