Our paper: Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

[This author post is by Peter Carbonetto on Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease, available from the arXiv here.]

I expect that most readers of this blog appreciate the impact that genome-wide association studies have had on our understanding of many common diseases. Still, I think it is important to reiterate a major appeal of genome-wide association studies: the analysis is conceptually straightforward to understand, even for people who have never had to suffer through a course on statistics or epidemiology. To find links between genetic loci and disease, the analysis consists of systematically searching across the genome for variants that show statistically significant correlation with susceptibility to disease. These correlations signal the presence of nearby genes—or perhaps DNA elements that regulate other genes—that are risk factors for disease.

Many readers of this blog will also appreciate, due to the multifactorial nature of most common diseases, the difficulty of establishing compelling evidence for disease-variant correlations. Hence the search for more effective data-driven strategies for discovering genetic factors underlying common diseases.

One strategy is to assess evidence for the accumulation, or “enrichment,” of disease-conferring mutations within known biological pathways. The intuition is that identifying the accumulation of small genetic effects acting in a common pathway is easier than mapping the individual genes within the pathway that contribute to disease susceptibility.

We asked whether identifying these enriched pathways can also give us useful feedback about the individual gene variants associated with disease. To answer this question, we developed a statistical method that adjusts the support for disease-variant associations to reflect enrichment of associations in a pathway. Our approach was to introduce an enrichment parameter that quantifies the increase in the probability that each variant in the pathway is associated with disease risk.

Is this a valid approach? To investigate, we applied our approach to data from the Wellcome Trust Crohn’s disease study from 2007. First, we identified a broad class of cytokine signaling genes that were enriched for genetic associations with Crohn’s disease. Next, by prioritizing variants in this pathway, we discovered candidates for association—including the STAT3 gene, the IBD5 locus, and the MHC class II genes—that were not identified in conventional analyses of the same data. These results help validate our approach, as these genetic associations have been independently confirmed in other studies and meta-analyses with much larger combined samples.

Several other important lessons emerged from our case study:

1. Interrogate as many pathways as possible. Because we collected over 3000 candidate pathways from several sources (Reactome, KEGG, BioCarta, BioCyc, etc.), many of the pathways highlighted in previous analyses of the same data were eclipsed by much stronger enrichment signals in our analysis.

2. Assess evidence for combinations of enriched pathways. Some pathways become interesting only after assessing enrichment of the pathway in combination with another pathway.

3. Account for the heterogeneity of effect sizes in Crohn’s disease. One of the assumptions we made in our analysis, mainly out of convenience, was that the additive effects on disease risk are normally distributed. While this assumption simplified this analysis, we suspect that a normal distribution does not adequately capture the smaller effect sizes in pathways, leading to a loss of power to detect enriched pathways.

At conferences, and around the lab, I’ve heard many complaints about pathway analysis (or gene set enrichment analysis) for genome-wide association studies. One complaint is that the results are difficult to interpret. Another common complaint is that the findings are sensitive to arbitrary significance thresholds. While we didn’t devote much space in the paper to a discussion of these issues, we believe that our approach offers a coherent solution to many of these problems.

Ultimately, we would like other researchers to use our methods to analyze data from their own genome-wide association studies. We tried to make our paper as accessible as possible, especially to biologists that are not well-acquainted with Bayesian approaches, by carefully explaining how to interpret the Bayes factors and posterior statistics used in the analysis. We are working on releasing the full source code (in R and MATLAB) for all our methods, and accompanying documentation.

Peter Carbonetto


Our paper: A genetic variant near olfactory receptor genes influences cilantro preference

For our next guest post Nick Eriksson (@nkeriks) writes about his ArXived paper with other 23andMe folks: A genetic variant near olfactory receptor genes influences cilantro preference ArXived here

First a little background about research at 23andMe. We have over 150,000 genotyped customers, a large proportion of whom answer surveys online. We run GWAS on pretty much everything trait you can think of (at least everything that is easily reported and possibly related to genetics). Around 2010, we started to ask a couple of questions about cilantro: if people like it, and if they perceive a soapy taste to it.

Fast forward a couple of years, and we have tens of thousands of people answering these questions. We start to see an interesting finding: one SNP significantly associated with both cilantro dislike and perceiving a soapy taste. Best of all, it was in a cluster of olfactory receptor genes.

The sense of smell is pretty cool. Humans have hundreds of olfactory receptor genes that encode G protein-coupled receptors. We perceive smells due to the binding of specific chemicals (“odorants”) to these receptors. There are maybe 1000 total olfactory receptors in various mammalian genomes, but it’s not totally clear which are pseudogenes. There has probably been some loss of these genes in humans as our sense of smell has become less critical. These genes appear in clusters in the genome, which makes it pretty hard for GWAS to pick out a specific gene. For example, in the first 23andMe paper, we identified a variant in a different cluster of olfactory receptors that affected whether you perceive a certain smell in your urine after eating asparagus. However, we still don’t know what the true functional variant in that region is.

Luckily, one of the olfactory receptors near our cilantro SNP turns out to be very well studied. It is known to bind to about 30 different aldehydes, including some of the chemicals that give cilantro its famous odor. So at the core this is a pretty simple paper. We found one significant association; it has as good of a functional story as you’ll see in nearly any GWAS. There are a couple of complications, however. First, we studied two related traits: soapy taste detection and cilantro dislike. They’re relatively correlated (r^2 about 0.33), and they are both associated with the same SNP. It looks like the association is stronger with soapy taste detection (and this trait seemed like it would be less influenced by environment than cilantro dislike), so we used soapy taste as the main phenotype.

The second complicated story is our heritability calculation. We saw about 9% heritability (tagged by the SNPs on our array). However, the confidence interval was pretty huge (-3% to 21%). Roughly, you could think of things falling into three heritability classes: high (height, celiac, type 1 diabetes), medium (type 2 diabetes, Crohn’s) and low (lung, colorectal, and maybe breast cancer). I think that’s about as accurate as the current heritability numbers can get. Our calculation puts cilantro soapy-taste detection into the low heritability group. There is the complication that this is only additive heritability tagged by common SNPs, so this phenotype could actually be very heritable, with most of the action coming from rare variants. But in my opinion, that’s doubtful.

Coming out of mathematics, I’ve always posted my papers to preprint servers. Luckily, this fits in well with 23andMe’s mission of making research faster, more participatory, and more fun. We’ve published all our papers so far in open access journals and have posted a couple of them to Nature Preceedings (before it shut down). I also write everything in LaTeX, so posting to the arXiv is a refreshing change (as compared to most biology journals where you have to undergo a conversion from LaTeX to word that makes everything look terrible (a particular pet peeve of mine with PLOS journals, which I otherwise love)).

I’m very curious to see how posting to the arXiv will affect publicity. Our papers tend to get a fair bit of press. However, I don’t know how the press will deal with one opportunity to report on the paper now (when the results are fresh and novel, but published on a site reporters will mostly not know about) and then another opportunity when the paper gets “blessed” via peer review. Because most of our papers are relatively straightforward GWAS (and we have a lot of coauthors here who have read and written a huge number of such papers), I think getting the data out on a preprint server is particularly important. However, we really need a Genetics category in q-bio!

Feedback on the paper would be most welcome. I’d love to see a replication or a nice functional study to followup, of course. I also think this is a good example for teaching people about genetics. A number of the issues that come up in this paper are a little tricky, but are good examples for understanding the how difficult it is to predict something based on genetics. On the technical side, I’m most curious if there are methods that might give a nice way of analyzing these two correlated traits together. We’ve tried a few regression based approaches for this sort of problem, but haven’t thought of anything entirely satisfactory.

Nick Eriksson

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Peter Carbonetto, Matthew Stephens
(Submitted on 21 Aug 2012)

Many common diseases are highly polygenic, modulated by a large number genetic factors with small effects on susceptibility to disease. These small effects are difficult to map reliably in genetic association studies. To address this problem, researchers have developed methods that aggregate information over sets of related genes, such as biological pathways, to identify gene sets that are enriched for genetic variants associated with disease. However, these methods fail to answer a key question: which genes and genetic variants are associated with disease risk? We develop a method based on sparse multiple regression that simultaneously identifies enriched pathways, and prioritizes the variants within these pathways, to locate additional variants associated with disease susceptibility. A central feature of our approach is an estimate of the strength of enrichment, which yields a coherent way to prioritize variants in enriched pathways. We illustrate the benefits of our approach in a genome-wide association study of Crohn’s disease with ~440,000 genetic variants genotyped for ~4700 study subjects. We obtain strong support for enrichment of IL-12, IL-23 and other cytokine signaling pathways. Furthermore, prioritizing variants in these enriched pathways yields support for additional disease-association variants, all of which have been independently reported in other case-control studies for Crohn’s disease.