Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence

Generative Probabilistic Model for Detecting Selection on Dispersed Genomic Elements from Polymorphism and Divergence
Ilan Gronau, Leonardo Arbiza, Adam Siepel
(Submitted on 29 Sep 2011 (v1), last revised 13 Aug 2012 (this version, v3))

We present a new probabilistic method for measuring the influence of natural selection on a collection of short elements scattered across a genome based on observed patterns of polymorphism and divergence. This is a challenging task for various reasons, including variation across loci in mutation rates and genealogical backgrounds, and the influence of demography on patterns of polymorphism. In addition, accounting for the combined effects of different modes of selection is known to be a serious challenge for tests of selection that use patterns of polymorphism and divergence. Our method addresses these challenges by contrasting patterns of polymorphism and divergence in the elements of interest with those in flanking neutral sites. While this general approach is common to several existing tests of selection, our method improves substantially on these methods by making use of a full generative probabilistic model, directly accommodating weak negative selection, allowing information from many short elements to be combined in a statistically rigorous manner, and integrating phylogenetic information from multiple outgroup species with genome-wide population genetic data. Our model is able to account for of weak negative, strong negative, and strong positive selection, by making a small set of simple assumptions on their separate effects on polymorphism and divergence. We implemented an expectation maximization algorithm for inference under this model and applied it to simulated and real data. Using simulations, we show that our inference procedure effectively disentangles the different modes of selection and provides accurate estimates of the parameters of interest that are robust to demography. We demonstrate an application of our methods to real data by analyzing several collections of human transcription factor binding sites identified using recently generated genome-wide ChIP-seq data.