*This guest post is by Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg on their preprint Genome scans for detecting footprints of local adaptation using a Bayesian factor model, available from the arXiv here.*

Finding genomic regions subject to local adaptation is a central part of population genomics, which is based on genotyping numerous molecular markers and looking for outlier loci. Most common approaches use measures of genetic differentiation such as Fst. There are many software implementing genome scans based on statistics related to Fst (BayeScan, DetSel, FDist2 , Lositan), and they contribute to the popularity of this approach in population genomics.

However, there are different statistical and computational problems that may arise with approaches based on Fst or related measures. The first problem arises because methods related to Fst assume the so-called F-model, which corresponds to a particular covariance structure for gene frequencies among populations (Bierne et al. 2013). When spatial structure departs from the assumption of the F-model, it can generate many false positives. A second potential problem concerns the computational burden of some Bayesian approaches, which can become an obstacle with large number of SNPs. The last problem is that individuals should be grouped into populations in advance whereas working at the scale of individuals is desirable because it avoids defining populations.

Using a Bayesian factor model, we address the three aforementioned problems. Factor models capture population structure by inferring latent variables called factors. Factor models have already been proposed to ascertain population structure (Engelhardt and Stephens 2010). Here we extend the framework of factor model in order to identify outlier loci in addition to the ascertainment of population structure. Our approach is not the first one to account for deviations to the assumptions of the F-model (Bonhomme et al. 2010, Günther and Coop 2013) but it does not require to define populations by contrast to the previous approaches. Using simulations, we show that factor model can achieve a 2-fold or more reduction of false discovery rate compared to the Fst-related approaches. We also analyze the HGDP human dataset to provide an example of how factor models can be used to detect local adaptation with a large number of SNPs. The Bayesian factor model is implemented in the PCAdapt software and we would be happy to answer to comments or questions regarding the software.

To explain why the factor model generates less false discoveries, we can introduce the notions of mechanistic and phenomenological models. Mechanistic models aim to mimic the biological processes that are thought to have given rise to the data whereas phenomenological models seek only to best describe the data using a statistical model. In the spectrum between mechanistic and phenomenological model, the F-model would stand close to mechanistic models whereas factor models would be closer to the phenomenological ones. Mechanistic models are appealing because they provide quantitative measures that can be related to biologically meaningful parameters. For instance, the parameters of the F-model measures genetic drift that can be related to migration rates, divergence times or population sizes. By contrast, phenomenological models work with mathematical abstractions such as latent factors that can be difficult to interpret biologically. The downside of mechanistic models is that violation of the modeling assumption can invalidate the proposed framework and generate many false discoveries in the context of selection scan. The F-model assumes a particular covariance matrix between populations which is found with star-like population trees for instance. However, more complex models of population structure can arise for various reasons including non-instantaneous divergence or isolation-by-distance, and they will violate the mechanistic assumptions and make phenomenological models preferable.

Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg

How is this related to Latent Factor Mixed Models?

Frichot, E., S. D. Schoville, G. Bouchard, and O. Francois. 2013. Testing for associations between loci and environmental gradients using latent factor mixed models. Mol Biol Evol 30:1687-1699.

Interesting question,

The difference is that Frichot et al. use environmental covariates (temperature, humidity,…) to look for signal of adaptation and consider as candidates the SNPs that are significantly correlated with the environmental covariates. By contrast, we do not use any covariate in addition to the SNP data. We look for SNPs atypically related with population structure as measured by the latent factors.

Try both!

P.S. E Frichot and N. Duforet-Frebourg are PhD students working in the same office.

From looking at your paper, this is how I interpret the method works (let me know if I’m wrong). K latent factors are estimated from the data, and each one explains a different aspect of population structure. Outliers are essentially evaluated for each K, and can be interpreted along the pattern of structure that K represents. It seems to be a cool approach.

From my understanding, LFMM loses power if K is specified too large. That appears to not be the case here, because outliers are specific to each K, correct?

It seems like it would be more appropriate to compare this method to the other methods that allow deviations from the F-model (FLK of Bonhomme et al and Bayenv2 of Gunther and Coop), rather than to Bayescan, which we know assumes the F-model. Also, some of the discussion of the F-model is confusing.

Choice of K

You go it right. Outliers SNP should be related to only one of the latent factors. By increasing K, you will look for other types of outliers. If by increasing K, the additional latent factors are pure noise, then we find that the method is quite robust and does not find outliers related to these new latent factors (Figure S2 or Figure 7, pbm of numbering in Arxiv, I should change that). The result of robustness has been obtained for simulated data.

For real data, the situation is more complex. When increasing K, we will typically find new latent factors that may truly capture something about population structure. The choice of K can be based on the research question instead of being based on a statistical criterion. For the HGDP analysis, we chose to restrict our analysis to K=4 because it corresponds to differentiation between continents. Increasing K would reveal interesting patterns of differentiation within continents, a scale we chose not to investigate here.

Comparison between methods

It is true that comparisons with many other methods would have been possible. However an important advantage of factor model is that you do not need to group individuals into populations. All other methods require populations, which can be OK with domestic animals for instance but more problematic with wild species for which isolation-by-distance is prevalent.

I would also be happy to know what is confusing about our discussion of the F-model.

Scratch “confusing,” I misread one sentence. But, I like the discussion on mechanistic and phenomenological models. Would you characterize FLK and Bayenv2 as phenomenological as well, or would they be considered “less phenomenological” because they require grouping individuals into populations?

One thing to clarify here is that Bayenv, and related methods really aren’t limited to populations. If you entered individual level data, then each “population” would consist of two alleles. The applications of these methods have usually focused on “populations” because genome-wide datasets have usually been collected in this way (in humans, e.g. HGDP). Most of these models take as their central principal that controlling for covariance of allele frequencies helps identify interesting loci (or equivalently lower false positive rate), and so could treat populations/individuals as their focal unit, and as long as their covariance matrix is sufficiently flexible they can also cope with both tree-like and isolation by distance like structures.

I think the main question here is in terms of the speed of fitting the model, and flexibility of the model fit, and whether the model simultaneously fits the alternative as well as the null model. I’m guessing that this model may well be beating previous approaches on some of these fronts.

Thanks Graham for the clarification.

Regarding Katie Lotterhos‘s question “Would you characterize FLK and Bayenv2 as phenomenological as well, or would they be considered “less phenomenological” because they require grouping individuals into populations?“, I think that the authors of the software would be the most capable people to provide an answer. In some sense, grouping individuals into populations (but see Graham’s reply) is somehow going into the direction of mechanistic models because there is an implicit assumption that individuals are clustered, which might be true or not depending on the data. In FLK, a tree of divergence between populations is required, which is also mechanistic.

You also ask if using the Dirichlet distribution makes the model mechanistic. In some sense, you are right because the Dirichlet is here to mimic frequencies (positive, sum to 1). In PCAdapt, we use simple Gaussian distributions and it works fine. However, using Gaussian distributions would not be the best approach for predicting allele frequencies for missing data because the prediction would not necessarily lye between 0 and 1. The advantage of using simple Gaussian distributions is that inference algorithms are faster (based on least squares).

Pingback: Most viewed on Haldane’s Sieve: February 2014 | Haldane's Sieve