Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression
Peter A. Combs, Michael B. Eisen
(Submitted on 19 Feb 2013)

Complex spatial and temporal patterns of gene expression underlie embryo differentiation, yet methods do not yet exist for the efficient genome-wide determination of spatial patterns of gene expression. {\em In situ} imaging of transcripts and proteins is the gold-standard, but is difficult and time consuming to apply to an entire genome, even when highly automated. Sequencing, in contrast, is fast and genome-wide, but generally applied to homogenized tissues, thereby discarding spatial information. At some point, these methods will converge, and we will be able to sequence RNAs {\em in situ}, simultaneously determining their identity and location. As a step along this path, we developed methods to cryosection individual blastoderm stage {\em Drosophila melanogaster} embryos along the anterior-posterior axis and sequence the mRNA isolated from each 60\micron{} slice. The spatial patterns of gene expression we infer closely match patterns determined by {\em in situ} hybridization and microscopy, where such data exist, and thus we conclude that we have generated the first genome-wide map of spatial patterns in the {\em Drosophila} embryo. We identify numerous genes with spatial patterns that have not yet been screened in the several ongoing systematic in situ based projects, the majority of which are localized to the posterior end of the embryo, likely in the pole cells. This simple experiment demonstrates the potential for combining careful anatomical dissection with high-throughput sequencing to obtain spatially resolved gene expression on a genome-wide scale.

Most viewed on Haldane’s Sieve: January 2013

The most viewed preprints on Haldane’s Sieve in January 2013 were:

Equitability, mutual information, and the maximal information coefficient

Equitability, mutual information, and the maximal information coefficient
Justin B. Kinney, Gurinder S. Atwal
(Submitted on 31 Jan 2013)

Reshef et al. recently proposed a new statistical measure, the “maximal information coefficient” (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC possesses a desirable mathematical property called “equitability” that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect. First we offer mathematical proof that no (non-trivial) dependence measure satisfies the definition of equitability proposed by Reshef et al.. We then propose a self-consistent and more general definition of equitability that follows naturally from the Data Processing Inequality. Mutual information satisfies this new definition of equitability while MIC does not. Finally, we show that the simulation evidence offered by Reshef et al. was artifactual. We conclude that estimating mutual information is not only practical for many real-world applications, but also provides a natural solution to the problem of quantifying associations in large data sets.

Our paper: Unbiased statistical testing of shared genetic control for potentially related traits

This guest post is by Chris Wallace on the preprint Unbiased statistical testing of shared genetic control for potentially related traits, available from the arXiv here. This is a cross-post from her group’s blog.

We have a new paper on arXiv detailing some work on colocalisation analysis, a method to determine whether two traits share a common causal variant. This is of interest in autoimmune disease genetics as the associated loci of so many autoimmune diseases overlap 1, but, for some genes, it appears the causal variants are distinct. It is also relevant for integrating disease association and eQTL data, to understand whether association of a disease to a particular locus is mediated by a variant’s effect on expression of a specific gene, possibly in a specific tissue.

However, determining whether traits share a common causal variant as opposed to distinct causal variants, probably in some LD, is not straightforward. It is well established that regression coefficients are aymptotically unbiased. However, when a SNP has been selected because it is the most associated in a region, then coefficients do then tend to be biased away from the null, ie their effect is overestimated. Because SNPs need to be selected to describe the association in any region in order to do colocalisation analysis, and because the coefficient bias will differ between datasets, there could be a tendancy to call truly colocalising traits as distinct. In fact, application of a formal statistical test for colocalisation 2 in a naive manner could have a type 1 error rate around 10-20% for a nominal size of 5%. This of course suggests that our earlier analysis of type 1 diabetes and monocyte gene expression 3 needs to be revised because it is likely we will have falsely rejected some genes which mediate the type 1 diabetes association in a region.

In this paper, we demonstrate two methods to overcome the problem. One, possibly more attractive to frequentists, is to avoid the variable selection by performing the analysis on principle components which summarise the genetic variation in a region. There is an issue with how many components are required, and our simulations suggest enough components need to be selected to capture around 85% of variation in a region. Obviously, this leads to a huge increase in degrees of freedom but, surprisingly, the power was not much worse compared to our favoured option of averaging p values over the variable selection using Bayesian Model Averaging. The idea of averaging p values is possibly anathema to Bayesians and frequentists alike, but these “posterior predictive p values” do have some history, having been introduced by Rubin in 1984 4. If you are prepared to mix Bayesian and frequentist theory sufficiently to average a p value over a posterior distribution (in this case, the posterior is of the SNPs which jointly summarise the association to both traits), it’s quite a nice idea. We used it before 3 as an alternative to taking a profile likelihood approach to dealing with a nuisance parameter, instead calculating p values conditional on the nuisance parameter, and averaging over its posterior. In this paper, we show by simulation that it does a good job of maintaining type 1 error and tends to be more powerful than the principle components approach.

There are many questions regarding integration of data from different GWAS that this paper doesn’t address: how to do this on a genomewide basis, for multiple traits, or when samples are not independent (GWAS which share a common set of controls, for example). Thus, it is a small step, but a useful contribution, I think, demonstrating a statistically sound method of investigating potentially shared causal variants in individual loci in detail. And while detailed investigation of individual loci may be currently be less fashionable than genomewide analyses, those detailed analyses are crucial for fine resolution analysis.

Unbiased statistical testing of shared genetic control for potentially related traits

Unbiased statistical testing of shared genetic control for potentially related traits
Chris Wallace
(Submitted on 23 Jan 2013)

Integration of data from genomewide single nucleotide polymorphism (SNP) association studies of different traits should allow researchers to disentangle the genetics of potentially related traits within individually associated regions. Methods have ranged from visual comparison of association $p$ values for each trait to formal statistical colocalisation testing of individual regions, which requires selection of a set of SNPs summarizing the association in a region. We show that the SNP selection method greatly affects type 1 error rates, with all published studies to date having used SNP selection methods that result in substantially biased inference. The primary reasons are twofold: random variation in the prescence of linkage disequilibrium means selected SNPs do not fully capture the association signal, and selecting SNPs on the basis of significance leads to biased effect size estimates.
We show that unbiased inference can be made either by avoiding variable selection and instead testing the most informative principal components or by integrating over variable selection using Bayesian model averaging. Application to data from Graves’ disease and Hashimoto’s thyroiditis reveals a common genetic signature across seven regions shared between the diseases, and indicates that for five out of six regions which have been significantly associated with one disease and not the other, the lack of evidence in one disease represents genuine absence of association rather than lack of power.

Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Gene set bagging for estimating replicability of gene set analyses

Gene set bagging for estimating replicability of gene set analyses
Andrew E. Jaffe, John D. Storey, Hongkai Ji, Jeffrey T. Leek
(Submitted on 16 Jan 2013)

Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping gene-set analysis and can be used to determine which are the most reproducible gene sets. Results: Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions: We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can improve biological inference of gene set analysis.

Evolution of molecular phenotypes under stabilizing selection

Evolution of molecular phenotypes under stabilizing selection
Armita Nourmohammad, Stephan Schiffels, Michael Laessig
(Submitted on 17 Jan 2013)

Molecular phenotypes are important links between genomic information and organismic functions, fitness, and evolution. Complex phenotypes, which are also called quantitative traits, often depend on multiple genomic loci. Their evolution builds on genome evolution in a complicated way, which involves selection, genetic drift, mutations and recombination. Here we develop a coarse-grained evolutionary statistics for phenotypes, which decouples from details of the underlying genotypes. We derive approximate evolution equations for the distribution of phenotype values within and across populations. This dynamics covers evolutionary processes at high and low recombination rates, that is, it applies to sexual and asexual populations. In a fitness landscape with a single optimal phenotype value, the phenotypic diversity within populations and the divergence between populations reach evolutionary equilibria, which describe stabilizing selection. We compute the equilibrium distributions of both quantities analytically and we show that the ratio of mean divergence and diversity depends on the strength of selection in a universal way: it is largely independent of the phenotype’s genomic encoding and of the recombination rate. This establishes a new method for the inference of selection on molecular phenotypes beyond the genome level. We discuss the implications of our findings for the predictability of evolutionary processes.

Efficient Identification of Equivalences in Dynamic Graphs and Pedigree Structures

Efficient Identification of Equivalences in Dynamic Graphs and Pedigree Structures
Hoyt Koepke, Elizabeth Thompson
(Submitted on 16 Jan 2013)

We propose a new framework for designing test and query functions for complex structures that vary across a given parameter such as genetic marker position. The operations we are interested in include equality testing, set operations, isolating unique states, duplication counting, or finding equivalence classes under identifiability constraints. A motivating application is locating equivalence classes in identity-by-descent (IBD) graphs, graph structures in pedigree analysis that change over genetic marker location. The nodes of these graphs are unlabeled and identified only by their connecting edges, a constraint easily handled by our approach. The general framework introduced is powerful enough to build a range of testing functions for IBD graphs, dynamic populations, and other structures using a minimal set of operations. The theoretical and algorithmic properties of our approach are analyzed and proved. Computational results on several simulations demonstrate the effectiveness of our approach.

Mandated data archiving greatly improves access to research data

Mandated data archiving greatly improves access to research data
Timothy H. Vines, Rose L. Andrew, Dan G. Bock, Michelle T. Franklin, Kimberly J. Gilbert, Nolan C. Kane, Jean-Sébastien Moore, Brook T. Moyers, Sébastien Renaut, Diana J. Rennison, Thor Veen, Sam Yeaman
(Submitted on 16 Jan 2013)

The data underlying scientific papers should be accessible to researchers both now and in the future, but how best can we ensure that these data are available? Here we examine the effectiveness of four approaches to data archiving: no stated archiving policy, recommending (but not requiring) archiving, and two versions of mandating data deposition at acceptance. We control for differences between data types by trying to obtain data from papers that use a single, widespread population genetic analysis, STRUCTURE. At one extreme, we found that mandated data archiving policies that require the inclusion of a data availability statement in the manuscript improve the odds of finding the data online almost a thousand-fold compared to having no policy. However, archiving rates at journals with less stringent policies were only very slightly higher than those with no policy at all. We also assessed the effectiveness of asking for data directly from authors and obtained over half of the requested datasets, albeit with about 8 days delay and some disagreement with authors. Given the long term benefits of data accessibility to the academic community, we believe that journal based mandatory data archiving policies and mandatory data availability statements should be more widely adopted.