Bayesian test for co-localisation between pairs of genetic association studies using summary statistics
Claudia Giambartolomei (1), Damjan Vukcevic (2), Eric E. Schadt (3), Aroon D. Hingorani (1), Chris Wallace (4), Vincent Plagnol (1) ((1) University College London (UCL), London, UK, (2) Royal Children’s Hospital, Melbourne, Australia, (3) Mount Sinai School of Medicine, New York USA, (4) University of Cambridge, Cambridge, UK)
(Submitted on 17 May 2013)
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. A key feature of the method is the ability to derive the key output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at (this http URL). We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including > 100,000 individuals of European ancestry. Our co-localisation results are broadly consistent with the conclusion from the published meta-analysis. Combining all lipid biomarkers, our re-analysis supported 29 out of 38 reported co-localisation results with eQTLs. Two clearly discordant findings (IFT172, CPNE1), as well as multiple new co-localisation results, highlight the value of a formal systematic statistical test. Our findings provide information about the causal gene in associated intervals and have direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Interesting paper, the method seems solid and there’s a lot of demand for this kind of work. I wonder if the authors will provide an implementation that allows the users to run this for their own eQTL data sets?
I was also slightly confused about the priors. “We assigned a prior of 10e-4 for p1 and p2, the probability that a SNP is associated with either of the two traits.” In modern eQTL analyses, a few percent of all variants are significant eQTLs, although the number of causal variants is on the order of the authors’ prior. eQTLs are much much more common than GWAS associations – shouldn’t the priors be different?
The method is implemented in the R package coloc, a revised version of which will appear on CRAN in a few days. The current version can be found at github https://github.com/chr1swallace/coloc, and this method is in the function coloc.abf().
Vincent and Claudia may wish to comment in more detail on the priors, which are the prior probabilities of a causal association. Myself, I don’t think a causal eQTL for a specific gene is necessarily more common than a GWAS association, perhaps even less common if you are looking genomewide as many diseases are now known to have 50+ GWAS hits whilst most genes have only a limited number of eQTLs. Looking cis to a gene, perhaps the prior should be larger. Varying the priors would be interesting to explore, and Claudia did some work on varying p12 in the paper. Of course, priors can be individually specified in the software, so you are not restricted by our defaults.
Thank you for your comment Tuuli. coloc v2.0 should be on CRAN now, and the function coloc.abf is the one that implements that new test.
Now the question of prior is always difficult. Here is a quick computation which may be used as a guide: taking the minimac reference files as a guide, and scaling down to a 400 kb around a gene (in cis, 200 kb on each side), I find an average of 2,000 variants (SNPs/indels). So 10^-4 means that one out of 5 genes has a cis-eQTL, give or take. That seems about right to me.
The important thing to mention is that we applied the method to metaboChip data so known regions involved in metabolic type disorders. Hence, 10^-4 is also perhaps OK given the design. But if one were working with a true GWAS discovery type design, I would probably make the prior for the biomarker/disease stricter. But keep the eQTL prior as it is.
Prior really matters here. If a region is already known to be disease/biomarker associated, a single SNP with 10^-4 for both eQTL and lipid trait, say, is quite convincing evidence for a shared causal variant. If you have not seen that region before, you probably should remain unconvinced. Our settings are probably OK for a metaboChip/immunoChip type setup, but the best we can do is let the users play a bit with these values if the experimental design is not the same.
Hopefully what I wrote makes sense!
Pingback: Our paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics | Haldane's Sieve
Pingback: Most viewed on Haldane’s Sieve: May 2013 | Haldane's Sieve