[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]
Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.
Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.
The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.
Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.