Our paper: Inference of population splits and mixtures from genome-wide allele frequency data

[This author post is by Joe Pickrell (@joe_pickrell) on Inference of population splits and mixtures from genome-wide allele frequency data, available from arXiv here]

Early last year, I began working (with Jonathan Pritchard) on methods for using genetics to understand population history. As we describe in our preprint, our approach was to build a parameterized model to describe the patterns of correlation in allele frequencies across populations. This type of approach dates back to brilliant work on building population trees by Luca Cavalli-Sforza, AWF Edwards, and Joe Felsenstein from around 40 years ago. The key to our work is that instead of representing history as a bifurcating tree, we additionally allow “migration events” to model admixture between populations. The output from our model (called TreeMix, and available here) is something like that shown below.

A graph of human population history, allowing 10 migration events. Populations are colored according to geographic region.

We applied this method to both human and dog history, with a mix of both known and novel historical results. I thought here I’d speculate about a couple of the novel results:

1. In the human data (see the graph above), one of the more surprising things to me was the arrow to the Cambodian population. The Cambodians appear to be an admixed population, with ~85% of their ancestry related to other southeast Asian populations (like the Dai) and ~15% of their ancestry from…it’s not totally clear. As you can see in the graph, the source of this admixture appears to be a population not particularly closely related to any other population in these data. So who was this population? A speculation is that this represents ancestry from a population related to the “Ancestral South Indian” population described by Reich et al. (2009), though other sources (e.g. Oceania) are plausible.

2. In the dog data (see Figures 5 and 6 in the pre-print), the most overwhelming signal in the data is that the Basenji, a central African dog breed, appears to trace ~25% of its ancestry to admixture with wolves since domestication. This signal is made somewhat surprising by the fact that there are no wolf populations currently living in Africa, which would seem to be a formidable barrier to admixture with an African dog breed. A hint for what’s going on here is provided by vonHoldt et al. (2010), who show that the basenji have an unusual amount of shared variation with wolves from the Middle East. One speculation, then, is that as the ancestors of the Basenji moved into Africa, they came into contact with Middle Eastern wolves and admixed with them.

Other suggestions for scenarios to explain these results are of course welcome. Overall, I’m hopeful that approaches like TreeMix will eventually supplant “standard” tree-building algorithms for situations in which gene flow is known to occur, though of course further development is necessary before this becomes reality.

Joe Pickrell

The genetic prehistory of southern Africa

The genetic prehistory of southern Africa

Joseph K. Pickrell, Nick Patterson, Chiara Barbieri, Falko Berthold, Linda Gerlach, Mark Lipson, Po-Ru Loh, Tom Güldemann, Blesswell Kure, Sununguko Wata Mpoloka, Hirosi Nakagawa, Christfried Naumann, Joanna L. Mountain, Carlos D. Bustamante, Bonnie Berger, Brenna M. Henn, Mark Stoneking, David Reich, Brigitte Pakendorf
(Submitted on 23 Jul 2012)

The hunter-gatherer populations of southern and eastern Africa are known to harbor some of the most ancient human lineages, but their historical relationships are poorly understood. We report data from 22 populations analyzed at over half a million single nucleotide polymorphisms (SNPs), using a genome-wide array designed for studies of history. The southern Africans-here called Khoisan-fall into two groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. All individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began 1,200 years ago. In addition, the Hadza, an east African hunter-gatherer population that speaks a language with click consonants, derive about a quarter of their ancestry from admixture with a population related to the Khoisan, implying an ancient genetic link between southern and eastern Africa.

The geography of recent genetic ancestry across Europe

The geography of recent genetic ancestry across Europe

Peter Ralph, Graham Coop
(Submitted on 16 Jul 2012 (v1), last revised 19 Jul 2012 (this version, v2))

The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Europeans (the POPRES dataset) to conduct one of the first surveys of recent genealogical ancestry over the past three thousand years at a continental scale. We detected 1.9 million shared genomic segments, and used the lengths of these to infer the distribution of shared ancestors across time and geography. We find that a pair of modern Europeans living in neighboring populations share around 10-50 genetic common ancestors from the last 1500 years, and upwards of 500 genetic ancestors from the previous 1000 years. These numbers drop off exponentially with geographic distance, but since genetic ancestry is rare, individuals from opposite ends of Europe are still expected to share millions of common genealogical ancestors over the last 1000 years. There is substantial regional variation in the number of shared genetic ancestors: especially high numbers of common ancestors between many eastern populations likely date to the Slavic and/or Hunnic expansions, while much lower levels of common ancestry in the Italian and Iberian peninsulas may indicate weaker demographic effects of Germanic expansions into these areas and/or more stably structured populations. Recent shared ancestry in modern Europeans is ubiquitous, and clearly shows the impact of both small-scale migration and large historical events. Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogical kinship of individuals across the world.

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Sebastien Roch
(Submitted on 17 Jul 2012)

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

Matthias Steinrücken, Joshua S. Paul, Yun S. Song
(Submitted on 25 Aug 2012)

Conditional sampling distributions (CSDs), sometimes referred to as copying models, underlie numerous practical tools in population genomic analyses. Though an important application that has received much attention is the inference of population structure, the explicit exchange of migrants at specified rates has not hitherto been incorporated into the CSD in a principled framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD has been developed as an accurate, efficient approximation to a principled CSD derived from the diffusion process dual to the coalescent with recombination. In this paper, the sequentially Markov CSD framework is extended to incorporate subdivided population structure, thus providing an efficiently computable CSD that admits a genealogical interpretation related to the structured coalescent with migration and recombination. As a concrete application, it is demonstrated empirically that the CSD developed here can be employed to yield accurate estimation of a wide range of migration rates.

Inference of population splits and mixtures from genome-wide allele frequency data

Inference of population splits and mixtures from genome-wide allele frequency data

Joseph K. Pickrell, Jonathan K. Pritchard
(Submitted on 11 Jun 2012)

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In this model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at this http URL

Detection of correlation between genotypes and environmental variables. A fast computational approach for genomewide studies

Detection of correlation between genotypes and environmental variables. A fast computational approach for genomewide studies
Gilles Guillot
(Submitted on 5 Jun 2012)

Genomic regions displaying outstanding correlation with some environmental variables are likely to be under selection and this is the rationale of recent methods of identifying selected loci and retrieve functional information about them. To be efficient, such methods need to be able to disentangle the potential effect of environmental variables from the confounding effect of population history. For the routine analysis of genomewide data-sets, one also need fast inference and model selection algorithms. We describe a method based on an explicit spatial model that builds on the theoretical and computational framework developed by Rue et al. (2009) and Lindgren et al. (2011}. The methods allows one to quantify correlation between genotypes and environmental variables and to rank loci accordingly. It works for SNP and AFLP data obtained either at the individual or at the population level. We provide R scripts with detailed comments that can be used readily for the analysis of real data without specific prior knowledge of the R language.

Approximate Bayesian computation via empirical likelihood

Approximate Bayesian computation via empirical likelihood
K. L. Mengersen (QUT, Brisbane), P. Pudlo (Universite Montpellier 2), C. P. Robert (Universite Paris-Dauphine)
(Submitted on 25 May 2012)

Approximate Bayesian computation (ABC) has now become an essential tool for the analysis of complex stochastic models when the likelihood function is unavailable. The well-established statistical method of empirical likelihood however provides another route to such settings that bypasses simulations from the model and the choices of the ABC parameters (summary statistics, distance, tolerance), while being provably convergent in the number of observations. Furthermore, avoiding model simulations leads to significant time savings in complex models, such as those used in population genetics. The ABCel algorithm we develop in this paper also provides an evaluation of its own performance through an associated effective sample size. The method is illustrated using several examples, including estimation of standard and quantile distributions, and time series and population genetics models.