Are all genetic variants in DNase I sensitivity regions functional?

Gregory A Moyerbrailean, Chris T Harvey, Cynthia A Kalita, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

A detailed mechanistic understanding of the direct functional consequences of DNA variation on gene regulatory mechanism is critical for a complete understanding of complex trait genetics and evolution. Here, we present a novel approach that integrates sequence information and DNase I footprinting data to predict the impact of a sequence change on transcription factor binding. Applying this approach to 653 DNase-seq samples, we identified 3,831,862 regulatory variants predicted to affect active regulatory elements for a panel of 1,372 transcription factor motifs. Using QuASAR, we validated the non-coding variants predicted to be functional by examining allele-specific binding (ASB). Combining the predictive model and the ASB signal, we identified 3,217 binding variants within footprints that are significantly imbalanced (20% FDR). Even though most variants in DNase I hypersensitive regions may not be functional, we estimate that 56% of our annotated functional variants show actual evidence of ASB. To assess the effect these variants may have on complex phenotypes, we examined their association with complex traits using GWAS and observed that ASB-SNPs are enriched 1.22-fold for complex traits variants. Furthermore, we show that integrating footprint annotations into GWAS meta-study results improves identification of likely causal SNPs and provides a putative mechanism by which the phenotype is affected.

4 thoughts on “Are all genetic variants in DNase I sensitivity regions functional?”

Thanks to the authors for posting this preprint, looks pretty exciting. A few small comments/questions, mostly on the GWAS results:

1. On pg. 5, you all describe 653 publicly-available samples, but on pg. 6 you describe 316 samples used after filtering. Does this suggest that half of the samples have issues that led you to filter them out?

2. When you do the fgwas analysis, are you using the tissue information, or just the TF binding site information? Ie. when you see HNF4A binding sites enriched for LDL hits, are those HNF4A sites identified in the liver or any tissue?

3. The LDL example, where the low binding allele is associated with increased LDL, seems to imply that decreased expression of the relevant gene is associated with increased LDL. Do you know the relevant gene in the region, and if so, does this make sense?

4. In the height example, you write that the variant you describe (rs4519508) is not “genome-wide significant”. What is the P-value for this SNP?

5. Also, for this SNP, you should be able to get its effect size even if it’s imputed. I think you have the imputed Z-score and the SE of the effect size, so Z*SE = effect size.

6. I think the y-axis in Figure 5C and 5D are mislabeled–those don’t look like -log10(BFs), but maybe -log10(P) or log(BF). If you’re plotting the logBF column from fgwas, that’s log (base e) of the Bayes factor for association. Also, in the legend you call this the “prior association”, when I think it’s simply the Bayes factor.

Thanks again for posting, this seems like a really useful approach.

Reply ↓

Roger Pique (@rogerpique) on August 6, 2014 at 4:27 pm said:

Thank you very much for your comments Joe. Here, we will try to answer your questions:

Q1. On pg. 5, you all describe 653 publicly-available samples, but on pg. 6 you describe 316 samples used after filtering. Does this suggest that half of the samples have issues that led you to filter them out?

A1) Yes, a large number of ENCODE samples were discarded because they were not diploid, with large copy number alterations or loss of heterozygosity events. Some samples were also discarded because they represented pools of more than one individual (see Supplementary Section 4.2).

Q2. When you do the fgwas analysis, are you using the tissue information, or just the TF binding site information? Ie. when you see HNF4A binding sites enriched for LDL hits, are those HNF4A sites identified in the liver or any tissue?

A2) Only the TF motif information was used for the fgwas analysis. However, when a factor is tissue specific it shows footprints only in a subset of tissues. For example, footprints for HNF4 motif were detected in any of the 178 tissues where this TF is active. The overwhelming majority of these tissues and cell-lines (62%) are from an expected origin: stomach (13), small and large intestine (29), kidney (25), renal pelvis and cortex (39), and liver (5). However, we also found HNF4A footprints in fetal lung, placenta, trophoblasts and stem cells.

Q3. The LDL example, where the low binding allele is associated with increased LDL, seems to imply that decreased expression of the relevant gene is associated with increased LDL. Do you know the relevant gene in the region, and if so, does this make sense?

A3) The closest gene to rs532436 is ABO, in the previous GTEX release it was identified as an eQTL in whole blood for SLC2A6 (GLUT6). With the newest release, it seems to now have a nominal p-value of 5E-10 for both ABO, and also SLC2A6, but with opposite effects.
http://www.gtexportal.org/home/eqtls/calc?tissueName=Whole_Blood&geneId=SLC2A6&snpId=rs532436
http://www.gtexportal.org/home/eqtls/calc?tissueName=Whole_Blood&geneId=ABO&snpId=rs532436
We are not sure how far away we can take the interpretation of these eQTL signals without further experimental validation.

Q4. In the height example, you write that the variant you describe (rs4519508) is not “genome-wide significant”. What is the P-value for this SNP?

A4) The z-score is -4.31 which should correspond to a p-value of 8.1E-6.

Q5. Also, for this SNP, you should be able to get its effect size even if it’s imputed. I think you have the imputed Z-score and the SE of the effect size, so Z*SE = effect size.

A5) Yes, thank you for pointing this out. We would need the units of the standard error, but (Z= -4.31, Var= 2.6E-5) the effect would be -0.021 decrease on units of height. If the effect size is with respect the number of copies of the alternate allele, then a decrease in binding leads to a decrease in height.

Q6. I think the y-axis in Figure 5C and 5D are mislabeled–those don’t look like -log10(BFs), but maybe -log10(P) or log(BF). If you’re plotting the logBF column from fgwas, that’s log (base e) of the Bayes factor for association. Also, in the legend you call this the “prior association”, when I think it’s simply the Bayes factor.

A6) Thank you for pointing this out. Yes, these should have been labeled “ln(BFs)” using base “e”. We will revise this.

Thank you very much again for your comments. We hope the answers we provide address them and we will use them to revise the manuscript.

Reply ↓
- Joe Pickrell on August 6, 2014 at 4:49 pm said:
  
  Thanks Roger!

Pingback: Most viewed on Haldane’s Sieve: August 2014 | Haldane's Sieve

Haldane's Sieve

Discussing preprints in population and evolutionary genetics

Are all genetic variants in DNase I sensitivity regions functional?

4 thoughts on “Are all genetic variants in DNase I sensitivity regions functional?”

Leave a comment Cancel reply

Share this:

Related

4 thoughts on “Are all genetic variants in DNase I sensitivity regions functional?”

Leave a comment Cancel reply