[This author post is by Peter Carbonetto on Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease, available from the arXiv here.]
I expect that most readers of this blog appreciate the impact that genome-wide association studies have had on our understanding of many common diseases. Still, I think it is important to reiterate a major appeal of genome-wide association studies: the analysis is conceptually straightforward to understand, even for people who have never had to suffer through a course on statistics or epidemiology. To find links between genetic loci and disease, the analysis consists of systematically searching across the genome for variants that show statistically significant correlation with susceptibility to disease. These correlations signal the presence of nearby genes—or perhaps DNA elements that regulate other genes—that are risk factors for disease.
Many readers of this blog will also appreciate, due to the multifactorial nature of most common diseases, the difficulty of establishing compelling evidence for disease-variant correlations. Hence the search for more effective data-driven strategies for discovering genetic factors underlying common diseases.
One strategy is to assess evidence for the accumulation, or “enrichment,” of disease-conferring mutations within known biological pathways. The intuition is that identifying the accumulation of small genetic effects acting in a common pathway is easier than mapping the individual genes within the pathway that contribute to disease susceptibility.
We asked whether identifying these enriched pathways can also give us useful feedback about the individual gene variants associated with disease. To answer this question, we developed a statistical method that adjusts the support for disease-variant associations to reflect enrichment of associations in a pathway. Our approach was to introduce an enrichment parameter that quantifies the increase in the probability that each variant in the pathway is associated with disease risk.
Is this a valid approach? To investigate, we applied our approach to data from the Wellcome Trust Crohn’s disease study from 2007. First, we identified a broad class of cytokine signaling genes that were enriched for genetic associations with Crohn’s disease. Next, by prioritizing variants in this pathway, we discovered candidates for association—including the STAT3 gene, the IBD5 locus, and the MHC class II genes—that were not identified in conventional analyses of the same data. These results help validate our approach, as these genetic associations have been independently confirmed in other studies and meta-analyses with much larger combined samples.
Several other important lessons emerged from our case study:
1. Interrogate as many pathways as possible. Because we collected over 3000 candidate pathways from several sources (Reactome, KEGG, BioCarta, BioCyc, etc.), many of the pathways highlighted in previous analyses of the same data were eclipsed by much stronger enrichment signals in our analysis.
2. Assess evidence for combinations of enriched pathways. Some pathways become interesting only after assessing enrichment of the pathway in combination with another pathway.
3. Account for the heterogeneity of effect sizes in Crohn’s disease. One of the assumptions we made in our analysis, mainly out of convenience, was that the additive effects on disease risk are normally distributed. While this assumption simplified this analysis, we suspect that a normal distribution does not adequately capture the smaller effect sizes in pathways, leading to a loss of power to detect enriched pathways.
At conferences, and around the lab, I’ve heard many complaints about pathway analysis (or gene set enrichment analysis) for genome-wide association studies. One complaint is that the results are difficult to interpret. Another common complaint is that the findings are sensitive to arbitrary significance thresholds. While we didn’t devote much space in the paper to a discussion of these issues, we believe that our approach offers a coherent solution to many of these problems.
Ultimately, we would like other researchers to use our methods to analyze data from their own genome-wide association studies. We tried to make our paper as accessible as possible, especially to biologists that are not well-acquainted with Bayesian approaches, by carefully explaining how to interpret the Bayes factors and posterior statistics used in the analysis. We are working on releasing the full source code (in R and MATLAB) for all our methods, and accompanying documentation.
Peter Carbonetto