This guest post is by Xiaoquan (William) Wen, Francesca Luca, and Roger Pique-Regi on their preprint Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study, bioRxived here.
Our paper presents an integrative analysis framework to perform fine mapping and functional analysis of cis-eQTLs. In particular, we consider a setting where eQTL data are collected from multiple population groups. Although the details of our methods and analysis results are described in the manuscript, here we’d like to take the opportunity to discuss some of our main features and interesting findings of this work.
From the methodological perspective, the Bayesian inference framework that we present in this paper enables efficient multiple SNP analysis in the presence of multiple heterogeneous (population) groups. This framework is a natural extension from our previous works in dealing with heterogeneous genetic association data in single SNP analysis (Flutre et al 2013, Wen and Stephens 2014). The output from our multiple cis-eQTL analysis fully characterizes the uncertainty of eQTL calls, which becomes critical in the downstream functional analysis. This represents a significant advantage over the commonly applied conditional analysis approach, which is non-trivial to generalize when there are multiple heterogeneous subgroups. Taking advantage of these features, we further extend our analysis framework to incorporate functional genomic annotations and assess their levels of enrichment in association signals. Although in this paper we solely focus on eQTLs, it should be noted that our statistical methods are completely general, and applicable in other contexts of genetic association analysis.
Applying this analysis framework, we re-analyzed the eQTL data from the GUEVADIS project that consists of samples from five population groups. Importantly, a key motivation is to identify eQTL signals that are consistently presented in all population groups. This analysis yields some interesting findings, which we will highlight below:
Cross-population meta-analysis greatly improves the power of eQTL discovery. The power gain by integrating data across population groups is well expected: with a combined sample size ~400 in five population groups, we are able to identify 6,555 genes that harbor at least one cis-eQTL (which we refer to as “eGenes”) from 11,838 tested protein coding and lincRNA genes at 5% FDR level; in comparison, the union set from the population-by-population analysis yields 3,447 eGenes.
Cross-population samples provide unique resources to fine map cis-eQTLs. We perform multiple SNP analysis for each identified eGene, and find that for a non-trivial proportion of genes (7% of all genes analyzed, or 14% of identified eGenes), two or more independent cis-eQTL signals can be confidently identified in the GEUVADIS data. In most of those cases, we are relatively certain about the existence of multiple eQTL signals, but cannot pinpoint the causal variants by fully resolving the LD. Nevertheless, we find that utilizing cross-population samples, the population heterogeneity in local patterns of LD can be effectively leveraged to narrow down the genomic regions that harbor causal eQTLs, a phenomenon that we refer to as “LD filtering”. Using the GEUVADIS data, we are able to quantify the effect of LD filtering. More specifically, we select a set of genes that are identified harboring exactly one cis-eQTL with high confidence, and construct credible regions for each eQTL signal based on both the population-by-population and the cross-population analyses. We find that for the majority of the genes tested (92% of 526 selected genes), the joint analysis yields a smaller credible region comparing to the minimum credible region length from separate population analyses. The median reduction in region length from the separate analysis to joint analysis is close to 50% in the set of genes examined.
On the other hand, there are cases that population specific LD patterns can cause some SNPs to display large degree of heterogeneity across populations in their estimated effect sizes from single SNP analysis. In some extreme cases, a SNP may appear to possess strong “population specific” effects. As we acknowledge that genuine population specific eQTLs are certainly interesting phenomena and very much likely exist, we suggest interpreting highly heterogeneous eQTL signals from single SNP analysis with caution. In the paper, we demonstrate one such example where a set of SNPs in LD, when analyzed alone, appear to show strong but opposite effects on expression levels in European and African populations. The multiple SNP analysis yields a seemingly much more plausible alternative explanation: it identifies two independent eQTL signals in the region, and the “opposite effect” eQTLs tag one signal in the African population and the other signal in the European populations. This example, we believe, fully demonstrates the necessity and benefit of multiple SNP analysis using cross-population samples.
Genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs. This point is demonstrated by our functional analysis approach based on the fine mapping results of cis-eQTLs. In brief, we classify every cis-SNP into three mutually exclusive categories based on the computational predictions of CENTIPEDE model: 1) SNPs strongly affecting TF binding 2) SNPs residing in a DNAse-I footprint region but with little or no effects on TF binding 3) all other SNPs, or baseline SNPs. We find that the first category of SNPs are 1.49 fold more likely than baseline SNPs to be eQTLs, and its enrichment level is statistically highly significant (p-value = 4.93 x 10-22). The SNPs in category 2 is also enriched but with much less impressive fold change (1.15) and statistical significance (p-value = 0.0035). Very interestingly, this finding seems in agreement with the results reported in our recent work Moyerbrailean et al 2014) where other cellular and organismal phenotype QTLs are examined.
Overall, the ability of our method to disentangle multiple eQTL signals represents a significant step forward towards fully comprehending the complex mechanisms regulating gene expression. Using the natural interventions represented by genetic polymorphisms can be used in future studies to identify multiple functional regulatory elements for a gene. The computational methods used in this paper are implemented in the software packages FM-eQTL and eQTLBMA. Our analysis results are also available for browsing and downloading at this site.
1. T Flutre, X Wen, J Pritchard, M Stephens (2013). A statistical framework for joint eQTL analysis in multiple tissues. PLoS genetics 9 (5), e1003486
2. X Wen, M Stephens (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. The Annals of Applied Statistics 8 (1), 176-203
3. T Lappalainen et al (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511
4. Moyberbrailean et al (2014) Are all genetic variants in DNase I sensitivity regions functional? bioRxiv, 007559