Our paper “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing” has been published in the journal Biostatistics, the preprint is also updated on the arXiv. The paper discusses linear mixed models (LMM) and the models commonly used for (rare variants) SNP set testing in a unified Bayesian framework, where fixed and random effects are naturally treated as corresponding prior distributions. Based on this general Bayesian representation, we derive Bayes factors as our primary inference device and demonstrate their usage in solving problems of hypothesis testing (e.g., single SNP and SNP set testing) and variable selection (e.g., multiple SNP analysis) in genetic association analysis. Here, we take the opportunity to summarize our main findings for a general audience.
Agreement with Frequentist Inference
We are able to derive various forms of analytic approximations of the Bayes factor based on the unified Bayesian model, and we find that these analytic approximations are connected to the commonly used frequentist test statistics, namely the Wald statistic, score statistic in LMM and the variance component score statistic in SNP set testing.
In the case of LMM-based single SNP testing, we find that under a specific prior specification of genetic effects, the approximate Bayes factors become monotonic transformations of the Wald or score test statistics (hence their corresponding p-values) obtained from LMM. This connection is very similar to what is reported by Wakefield (2008) in the context of simple logistic regression. It should be noted the specific prior specification (following Wakefield (2008), we call it implicit p-value prior) essentially assumes a larger a priori effect for SNPs that are less informative (either due to a smaller sample size or minor allele frequency). Although, from the Bayesian point of view, there seems to be a lack of proper justification for such prior assumptions in general, we often note that the overall effect of the implicit p-value prior on the final inference may be negligible in practice, especially when the sample size is large (We demonstrate this point with numerical experiments in the paper).
For SNP set testing, we show that the variance component score statistic in the popular SKAT model (Wu et al. (2011)) is also monotonic to the approximate Bayes factor in our unified model if the prior effect size under the alternative scenario is assumed small. Interestingly, such prior assumption represents a “local” alternative scenario for which score tests are known to be most powerful.
The above connections are well expected: after all, the frequentist models and the Bayesian representation share the exact same likelihood functions. From the Bayesian point of view, the connections reveal the implicit prior assumptions in the Frequentist inference. These connections also provide a principled way to “translate” the relevant frequentist statistics/p-values into Bayes factors for fine mapping analysis using Bayesian hierarchical models as demonstrated by Maller et al. (2012) and Pickrell (2014).
Advantages of Bayesian Inference
Bayesian Model Averaging
Bayesian model averaging allows simultaneous modeling of multiple alternative scenarios which may not be nested or compatible with each other. One interesting example is in the application of rare variants SNP set testing, where two primary classes of competing models, burden model (assuming most rare causal variants in a SNP set are either consistently deleterious or consistently protective) and SKAT model (assuming the rare causal variants in a SNP set can have bi-directional effects), target complimentary alternative scenarios. In our Bayesian framework, we show that these two types of alternative models correspond to different prior specifications, and a Bayes factor jointly considering the two types of models can be trivially computed by model averaging. A frequentist approach, SKAT-O, proposed by Lee et al. (2012) achieves the similar goal by using a mixture kernel (or prior from the Bayesian perspective). We discuss the subtle theoretical difference between the Bayesian model averaging and the use of SKAT-O prior and show by simulations, the two approaches have similar performance. Moreover, we find that the Bayesian model averaging is more general and flexible. To this end, we demonstrate a Bayesian SNP set testing example where three categories of alternative scenarios are jointly considered: in addition to the two aforementioned rare SNP association models, a common SNP association model is also included for averaging. Such application can be useful for eQTL studies to identify genes harboring cis-eQTLs.
Prior Specification for Genetic Effects
The explicit specification of the prior distributions on genetic effects for alternative models is seemingly a distinct feature of Bayesian inference. However, as we have shown, even the most commonly applied frequentist test statistics can be viewed as resulting from some implicit Bayesian priors. Therefore, it is only natural to regard the prior specification as an integrative component in modeling alternative scenarios. Many authors have shown that it is effective to incorporate functional annotations into genetic association analysis through prior specifications. In addition, we also show that in many practical settings, the desired priors can be sufficiently “learned” from data facilitated by Bayes factors.
Multiple SNP Association Analysis
Built upon the Bayes factor results, we demonstrate an example of multiple SNP fine mapping analysis via Bayesian variable selection in the context of LMM. The advantages of Bayesian variable selection and its comparison to the popular conditional analysis approach have been thoroughly discussed in another recent paper of ours (Wen et al. 2015).
If single SNP/SNP set association testing is the end point of the analysis, the Bayesian and the commonly applied frequentist approaches yield similar results with very little practical difference. However, going beyond the simple hypothesis testing in genetic association analysis, we believe that the Bayesian approaches possess many unique advantages and is conceptually simple to apply in rather complicated practical settings.
The software/scripts, simulated and real data sets used in the paper are publicly available at github.
1. Wakefield, J. (2009). Bayes factors for genome‐wide association studies: comparison with P‐values. Genetic epidemiology, 33(1), 79-86.
2. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1), 82-93.
3. Maller, J. B., McVean, G., Byrnes, J., Vukcevic, D., Palin, K., Su, Z., Wellcome Trust Case Control Consortium., et al. (2012). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics, 44(12), 1294-1301.
4. Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4), 559-573.
5. Lee, S., Wu, M. C., & Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4), 762-775.
6. Wen, X., Luca, F., & Pique-Regi, R. (2014). Cross-population meta-analysis of eQTLs: fine mapping and functional study. bioRxiv, 008797.