Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Advertisements

3 thoughts on “Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

  1. Thanks for your comments on our paper, Alkes.

    One of the main messages of the paper is that the accuracy of the heritability estimates can depend on the (sometimes implicit) assumptions made by the model, which I think you reiterated well in your comments.

    Regarding your concern about the computational cost of the “hybrid” model, there are two possible routes to improvement: (1) propose a simpler approach that achieves similar accuracy, (2) propose a less computationally intensive algorithm for estimating posterior statistics. The short answer is that we can do much better than MCMC by employing variational inference techniques—we can often get good estimates at a much lower computational expense. We chose not to use a variational approximation (described in a recent Bayesian Analysis paper) because we were more interested in comparing heritability estimates under different modeling assumptions, and less concerned about how long it took to obtain the estimates (explaining the variational approximation would detract from the main point of the paper), but surely this will become more important when attempting to apply the “hybrid” approach to larger data sets where the MCMC inference is no longer feasible.

  2. Thanks for posting this paper.

    Also, thank you to the authors. This paper is beautifully written and although very technical, very readable. It tackles the description of current efforts to map phenotypic variation to genotypes. It also emphasizes, in the face of what is usually an unknown underlying genetic architecture, the need for statistical methods to learn from the data. Several other statisticians, including Thomas Mailund (http://www.mailund.dk/), have mentioned this need for algorithms to learn and be adaptive to the underlying genetic architecture. It seems to be a welcome trend.

    Also appreciated is the rich list of references.

  3. Pingback: Most viewed on Haldane’s Sieve: January 2013 | Haldane's Sieve

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s