Our paper: Unbiased statistical testing of shared genetic control for potentially related traits

This guest post is by Chris Wallace on the preprint Unbiased statistical testing of shared genetic control for potentially related traits, available from the arXiv here. This is a cross-post from her group’s blog.

We have a new paper on arXiv detailing some work on colocalisation analysis, a method to determine whether two traits share a common causal variant. This is of interest in autoimmune disease genetics as the associated loci of so many autoimmune diseases overlap 1, but, for some genes, it appears the causal variants are distinct. It is also relevant for integrating disease association and eQTL data, to understand whether association of a disease to a particular locus is mediated by a variant’s effect on expression of a specific gene, possibly in a specific tissue.

However, determining whether traits share a common causal variant as opposed to distinct causal variants, probably in some LD, is not straightforward. It is well established that regression coefficients are aymptotically unbiased. However, when a SNP has been selected because it is the most associated in a region, then coefficients do then tend to be biased away from the null, ie their effect is overestimated. Because SNPs need to be selected to describe the association in any region in order to do colocalisation analysis, and because the coefficient bias will differ between datasets, there could be a tendancy to call truly colocalising traits as distinct. In fact, application of a formal statistical test for colocalisation 2 in a naive manner could have a type 1 error rate around 10-20% for a nominal size of 5%. This of course suggests that our earlier analysis of type 1 diabetes and monocyte gene expression 3 needs to be revised because it is likely we will have falsely rejected some genes which mediate the type 1 diabetes association in a region.

In this paper, we demonstrate two methods to overcome the problem. One, possibly more attractive to frequentists, is to avoid the variable selection by performing the analysis on principle components which summarise the genetic variation in a region. There is an issue with how many components are required, and our simulations suggest enough components need to be selected to capture around 85% of variation in a region. Obviously, this leads to a huge increase in degrees of freedom but, surprisingly, the power was not much worse compared to our favoured option of averaging p values over the variable selection using Bayesian Model Averaging. The idea of averaging p values is possibly anathema to Bayesians and frequentists alike, but these “posterior predictive p values” do have some history, having been introduced by Rubin in 1984 4. If you are prepared to mix Bayesian and frequentist theory sufficiently to average a p value over a posterior distribution (in this case, the posterior is of the SNPs which jointly summarise the association to both traits), it’s quite a nice idea. We used it before 3 as an alternative to taking a profile likelihood approach to dealing with a nuisance parameter, instead calculating p values conditional on the nuisance parameter, and averaging over its posterior. In this paper, we show by simulation that it does a good job of maintaining type 1 error and tends to be more powerful than the principle components approach.

There are many questions regarding integration of data from different GWAS that this paper doesn’t address: how to do this on a genomewide basis, for multiple traits, or when samples are not independent (GWAS which share a common set of controls, for example). Thus, it is a small step, but a useful contribution, I think, demonstrating a statistically sound method of investigating potentially shared causal variants in individual loci in detail. And while detailed investigation of individual loci may be currently be less fashionable than genomewide analyses, those detailed analyses are crucial for fine resolution analysis.

Unbiased statistical testing of shared genetic control for potentially related traits

Unbiased statistical testing of shared genetic control for potentially related traits
Chris Wallace
(Submitted on 23 Jan 2013)

Integration of data from genomewide single nucleotide polymorphism (SNP) association studies of different traits should allow researchers to disentangle the genetics of potentially related traits within individually associated regions. Methods have ranged from visual comparison of association $p$ values for each trait to formal statistical colocalisation testing of individual regions, which requires selection of a set of SNPs summarizing the association in a region. We show that the SNP selection method greatly affects type 1 error rates, with all published studies to date having used SNP selection methods that result in substantially biased inference. The primary reasons are twofold: random variation in the prescence of linkage disequilibrium means selected SNPs do not fully capture the association signal, and selecting SNPs on the basis of significance leads to biased effect size estimates.
We show that unbiased inference can be made either by avoiding variable selection and instead testing the most informative principal components or by integrating over variable selection using Bayesian model averaging. Application to data from Graves’ disease and Hashimoto’s thyroiditis reveals a common genetic signature across seven regions shared between the diseases, and indicates that for five out of six regions which have been significantly associated with one disease and not the other, the lack of evidence in one disease represents genuine absence of association rather than lack of power.

Natural selection. VI. Partitioning the information in fitness and characters by path analysis

Natural selection. VI. Partitioning the information in fitness and characters by path analysis
Steven A. Frank
(Submitted on 22 Jan 2013)

Three steps aid in the analysis of selection. First, describe phenotypes by their component causes. Components include genes, maternal effects, symbionts, and any other predictors of phenotype that are of interest. Second, describe fitness by its component causes, such as an individual’s phenotype, its neighbors’ phenotypes, resource availability, and so on. Third, put the predictors of phenotype and fitness into an exact equation for evolutionary change, providing a complete expression of selection and other evolutionary processes. The complete expression separates the distinct causal roles of the various hypothesized components of phenotypes and fitness. Traditionally, those components are given by the covariance, variance, and regression terms of evolutionary models. I show how to interpret those statistical expressions with respect to information theory. The resulting interpretation allows one to read the fundamental equations of selection and evolution as sentences that express how various causes lead to the accumulation of information by selection and the decay of information by other evolutionary processes. The interpretation in terms of information leads to a deeper understanding of selection and heritability, and a clearer sense of how to formulate causal hypotheses about evolutionary process. Kin selection appears as a particular type of causal analysis that partitions social effects into meaningful components.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)
(Submitted on 23 Jan 2013)

Background – The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results – In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions – Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Comprehensive evaluation of differential expression analysis methods for RNA-seq data

Comprehensive evaluation of differential expression analysis methods for RNA-seq data
Franck Rapaport, Raya Khanin, Yupu Liang, Azra Krek, Paul Zumbo, Christopher E. Mason, Nicholas D. Socci, Doron Betel
(Submitted on 22 Jan 2013)

High-throughput sequencing of RNA transcripts (RNA-seq) has become the method of choice for detection of differential expression (DE). Concurrent with the growing popularity of this technology there has been a significant research effort devoted towards understanding the statistical properties of this data and the development of analysis methods. We report on a comprehensive evaluation of the commonly used DE methods using the SEQC benchmark data set. We evaluate a number of key features including: assessment of normalization, accuracy of DE detection, modeling of genes expressed in only one condition, and the impact of sequencing depth and number of replications on identifying DE genes. We find significant differences among the methods with no single method consistently outperforming the others. Furthermore, the performance of array-based approach is comparable to methods customized for RNA-seq data. Perhaps most importantly, our results demonstrate that increasing the number of replicate samples provides significantly more detection power than increased sequencing depth.

Transcript length mediates developmental timing of gene expression across Drosophila

Transcript length mediates developmental timing of gene expression across Drosophila
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 18 Jan 2013)

The time required to transcribe genes with long primary transcripts may limit their ability to be expressed in cells with short mitotic cycles, a phenomenon termed intron delay. As such short cycles are a hallmark of the earliest stages of insect development, we used Drosophila developmental timecourse expression data to test whether intron delay affects gene expression genome-wide, and to determine its consequences for the evolution of gene structure. We find that long zygotically expressed, but not maternally deposited, genes show substantial delay in expression relative to their shorter counterparts and that this delay persists over a substantial portion of the ~24 hours of embryogenesis. Patterns of RNA-seq coverage from the 5′ and 3′ ends of transcripts show that this delay is consistent with their inability to terminate transcription, but not with transcriptional initiation-based regulatory control. Highly expressed zygotic genes are subject to purifying selection to maintain compact transcribed regions, allowing conservation of embryonic expression patterns across the Drosophila phylogeny. We propose that intron delay is an underappreciated physical mechanism affecting both patterns of expression as well as gene structure of many genes across Drosophila.

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations

Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations
Katarzyna Bryc, Wlodek Bryc, Jack W. Silverstein
(Submitted on 18 Jan 2013)

We present a mathematical model, and the corresponding mathematical analysis, that justifies and quantifies the use of principal component analysis of biallelic genetic marker data for a set of individuals to detect the number of subpopulations represented in the data. We indicate that the power of the technique relies more on the number of individuals genotyped than on the number of markers.

Reproductive isolation between phylogeographic lineages scales with divergence

Reproductive isolation between phylogeographic lineages scales with divergence
Sonal Singhal, Craig Moritz
(Submitted on 17 Jan 2013)

Phylogeographic studies frequently reveal multiple morphologically-cryptic lineages within species. What is yet unclear is whether such lineages represent nascent species or evolutionary ephemera. To address this question, we compare five contact zones, each of which occurs between eco-morphologically cryptic lineages of rainforest skinks from the rainforests of the Australian Wet Tropics. Although the contacts likely formed concurrently in response to Holocene expansion from glacial refugia, we estimate that the divergence times (t) of the lineage-pairs range from 3.1 to 11.5 Myr. Multilocus analyses of the contact zones yielded estimates of reproductive isolation that are tightly correlated with divergence time and, for longer-diverged lineages (t > 5 Myr), substantial. These results show that phylogeographic splits of increasing depth can represent stages along the speciation continuum, even in the absence of overt change in ecologically relevant morphology.

Gene set bagging for estimating replicability of gene set analyses

Gene set bagging for estimating replicability of gene set analyses
Andrew E. Jaffe, John D. Storey, Hongkai Ji, Jeffrey T. Leek
(Submitted on 16 Jan 2013)

Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping gene-set analysis and can be used to determine which are the most reproducible gene sets. Results: Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions: We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can improve biological inference of gene set analysis.