Probabilistic models of genetic variation in structured populations applied to global human studies

Probabilistic models of genetic variation in structured populations applied to global human studies
Wei Hao, Minsun Song, John D. Storey
(Submitted on 7 Dec 2013)

Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important, unsolved problem is how to formulate and estimate probabilistic models of observed genotypes that allow for complex population structure. We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis (PCA) can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly mixed-membership model as a special case. Noting some drawbacks of this approach, we introduce a new “logistic factor analysis” (LFA) framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the human genome diversity panel and 1000 genomes project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions.

2 thoughts on “Probabilistic models of genetic variation in structured populations applied to global human studies

  1. Thanks for posting this to the arXiv. Just a couple quick thoughts, apologies for them not being polished.

    1. You might be interested in the Nicholson et al. model (http://onlinelibrary.wiley.com/doi/10.1111/1467-9868.00357/abstract) which I think reduces to the Balding-Nichols model in practice but starts from a very different underlying population genetic model.

    2. You write “it was shown that population genetic structure roughly corresponds to geographic characterizations of ancestry [26, 27]”. Reference 27 is Coop et al. (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000500); is this right? What result from that paper are you referring to? (Not that this is wrong, it’s just not obvious to me what you’re referencing here)

    3. I found the definition of “individual-specific allele frequencies” to be a bit confusing, i.e. “let \pi_{ij} = \pi_i (z_j) be the allele frequency for SNP i conditioned on the ancestry state of individual j. I initially read this as conditioning on the ancestry at a given position in the genome, but you’re talking about something completely different. That is, if the two alleles at SNP i in individual j are drawn from a binominal distribution with parameter p_{ij}, you’re talking about E[p_{ij}] (which is some linear combination of things), right? Or something like that? I guess the idea of this being an individual specific allele frequency makes sense after thinking about it a bit, but it doesn’t seem intuitive to me except when talking about it in terms of binomial sampling, which as written you don’t introduce until later.

    4. I think it’s worth pointing out (if this is true) that your statistical test for individual SNPs that show population differentiation is only a test for differentiation, not a test for selection. I don’t think this is in practice different from testing if Fst = 0? You might be interested in some work using patterns of differentiation to test for selection, for example Bhatia et al ( http://www.cell.com/AJHG/abstract/S0002-9297(11)00354-5 ) and Gunther and Coop (http://arxiv.org/abs/1209.3029).

  2. Pingback: Most viewed on Haldane’s Sieve: December 2013 | Haldane's Sieve

Leave a comment