This guest post is by Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg on their preprint Genome scans for detecting footprints of local adaptation using a Bayesian factor model, available from the arXiv here.
Finding genomic regions subject to local adaptation is a central part of population genomics, which is based on genotyping numerous molecular markers and looking for outlier loci. Most common approaches use measures of genetic differentiation such as Fst. There are many software implementing genome scans based on statistics related to Fst (BayeScan, DetSel, FDist2 , Lositan), and they contribute to the popularity of this approach in population genomics.
However, there are different statistical and computational problems that may arise with approaches based on Fst or related measures. The first problem arises because methods related to Fst assume the so-called F-model, which corresponds to a particular covariance structure for gene frequencies among populations (Bierne et al. 2013). When spatial structure departs from the assumption of the F-model, it can generate many false positives. A second potential problem concerns the computational burden of some Bayesian approaches, which can become an obstacle with large number of SNPs. The last problem is that individuals should be grouped into populations in advance whereas working at the scale of individuals is desirable because it avoids defining populations.
Using a Bayesian factor model, we address the three aforementioned problems. Factor models capture population structure by inferring latent variables called factors. Factor models have already been proposed to ascertain population structure (Engelhardt and Stephens 2010). Here we extend the framework of factor model in order to identify outlier loci in addition to the ascertainment of population structure. Our approach is not the first one to account for deviations to the assumptions of the F-model (Bonhomme et al. 2010, Günther and Coop 2013) but it does not require to define populations by contrast to the previous approaches. Using simulations, we show that factor model can achieve a 2-fold or more reduction of false discovery rate compared to the Fst-related approaches. We also analyze the HGDP human dataset to provide an example of how factor models can be used to detect local adaptation with a large number of SNPs. The Bayesian factor model is implemented in the PCAdapt software and we would be happy to answer to comments or questions regarding the software.
To explain why the factor model generates less false discoveries, we can introduce the notions of mechanistic and phenomenological models. Mechanistic models aim to mimic the biological processes that are thought to have given rise to the data whereas phenomenological models seek only to best describe the data using a statistical model. In the spectrum between mechanistic and phenomenological model, the F-model would stand close to mechanistic models whereas factor models would be closer to the phenomenological ones. Mechanistic models are appealing because they provide quantitative measures that can be related to biologically meaningful parameters. For instance, the parameters of the F-model measures genetic drift that can be related to migration rates, divergence times or population sizes. By contrast, phenomenological models work with mathematical abstractions such as latent factors that can be difficult to interpret biologically. The downside of mechanistic models is that violation of the modeling assumption can invalidate the proposed framework and generate many false discoveries in the context of selection scan. The F-model assumes a particular covariance matrix between populations which is found with star-like population trees for instance. However, more complex models of population structure can arise for various reasons including non-instantaneous divergence or isolation-by-distance, and they will violate the mechanistic assumptions and make phenomenological models preferable.
Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg