This guest post is by Olly Burren and Chris Wallace on their preprint, VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes, arXived here.
The idea for this paper came from reading a study by Liu et al. ( http://www.sciencedirect.com/science/article/pii/S0002929710003125) and the fact that summary p values from genome wide association studies are increasingly becoming publicly available. In the field of human disease, genome-wide association studies have been very successful in isolating regions of the genome that confer disease susceptibility. The next step however, is to understand mechanistically exactly how variation in these loci gives rise to this susceptibility. There are a myriad of pre-existing methods available for integrating genetic and genomic datasets, however things are complicated by the high degree of linkage disequilibrium that exists, which causes substantial inflation in the variance of any test statistic. This inter-SNP correlation must be taken into account, classically by permuting case/control status and recomputing association, requiring access to raw genotyping data. Indeed, this approach was taken in our previously published method see Heing et al. (http://www.nature.com/nature/journal/v467/n7314/full/nature09386.html) which uses a non-parametric test to compare distribution of GWAS p values from two sets of SNPs (“test” and “control”). As most researchers working with GWAS know gaining access to raw genotyping data is often difficult, and then how to include meta-analysis and imputed data? Liu et al., got around this by estimating the inter-SNP correlation using public datasets and sampling from a multivariate normal to generate simulated p values, analogous to the permuted p values possible with permuting phenotype status when raw data are available. VEGAS uses genotype data publicly available through the International HapMap project and aims to integrate GWAS results with trans eQTLs to identify causal disease genes.
Our thought was that by combining our previously published method, with the VEGAS approach, we could create a novel approach that would allow the integration of genetic information from GWAS with functional information from for example a set of micro-array experiments, crucially without the need for genotype information. The rationale being that it would help to prioritise future mechanistic studies, which can be costly and time-consuming to conduct. We also upped the stakes, and decided to use 1000 Genomes Project genotyping information for our estimations, to allow application to dense-genotyping technologies. The result was a software pipeline that takes as input a gene set of interest, a matched ‘control’ set and a summary set of GWAS statistics and computes an enrichment score.
Note that this approach differs from the Bayesian model suggested by Pickrell (https://haldanessieve.org/2013/12/16/author-post-joint-analysis-of-functional-genomic-data-and-genome-wide-association-studies-of-18-human-traits) as it focuses on comparing broad regions, rather than on considering more targeted genomic annotation, and in that sense is perhaps more akin to pathway analysis, although we do suggest that functionally defined genes sets, such as those found by knock down experiments in cell lines, may be more productive than using manually annotated pathways whose completeness can vary considerably.
To illustrate the method we applied it to a large meta-analysis GWAS study of type 1 diabetes (8000 case vs 8000 controls), and an interesting dataset examining the effect on gene-expression of knocking down a series of 59 transcription factors in a lymphoblastoid cell line see Cusanovich et al (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004226). We identified three transcription factors, IKZF3, BATF and ESRRA, whose putative targets are significantly enriched for variation associated with type 1 diabetes susceptibility. IKZF3 overlaps a known type 1 diabetes susceptibility region, whereas BATF and ESRRA overlap other autoimmune susceptibility regions, validating our approach. Of course there are caveats interpreting results derived from cell lines, however we think it’s promising that our top hit lies in a region already associated with type 1 diabetes susceptibility.
Using the quantities already computed, once enrichment is detected, we implemented a simple technique to prioritise genes within the set. This allows the generation of a succinct list of genes that are responsible for the enrichment detected on the global level. Cross referenced with other information these can either be informative in their own right or be used to inform future studies.
This study is also an example of the preprint process speeding up scientific discovery. We knew about the Cusanovich dataset because they released a preprint on arXiv, which was caught by Haldane’s Sieve (https://haldanessieve.org/2013/10/22/the-functional-consequences-of-variation-in-transcription-factor-binding/) in October 2013. One email, and the authors kindly shared their complete results. Had we waited for it to be published in PLoS Genetics in March 2014, we’d have been five months behind where we are.
The major benefit is that all of the datasets employed are within the public domain. Our hope is that either this or other methods in the same vein will help to bridge the gap between GWAS and disease mechanisms, ultimately fuelling the development of new therapeutics.