Our paper: The genetic prehistory of southern Africa

[This author post is by Joe Pickrell (@joe_pickrell), Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf on The genetic prehistory of southern Africa, available from arXiv here]

The indigenous populations of southern Africa are phenotypically, linguistically, culturally, and genetically diverse. Although many groups speak Bantu languages (having arrived in the region during an expansion of Iron-Age agriculturalists), there are a number of populations who speak diverse non-Bantu languages with heavy use of click consonants. We refer to these populations as “Khoisan“. Most of the Khoisan populations are hunter-gatherers, but some are pastoralists; the extensive linguistic and cultural diversity of the Khoisan (who live in a relatively small region around the Kalahari semi-desert) is historically puzzling.

Two hunter-gatherer (or formerly hunter-gatherer) populations in East Africa, the Hadza and Sandawe, also speak languages that also make use of click consonants. Linguists see little in common between the languages in southern Africa and Hadza, although Sandawe might be genealogically related to some of the Khoisan languages. Nevertheless, the shared use of click consonants and a foraging lifestyle led many to hypothesize that the southern African Khoisan populations are genetically related to the Hadza and Sandawe, which would imply that their ancestors were once considerably more widespread. This hypothesis has been controversial for decades.

Tree relating the Khoisan-like proportion of ancestry (shown in blue in the barplot) in Khoisan, Hadza, and Sandawe after accounting for non-Khoisan admixture.

In our study, we use genetic data to address the history of the diverse groups within southern Africa and their relationship to the Hadza and Sandawe. Specifically, we genotyped individuals from 16 Khoisan populations, 5 neighboring populations that speak Bantu languages, and the Hadza (the latter thanks to Brenna Henn, Joanna Mountain, and Carlos Bustamante) on a SNP array designed for studies of human history, in that the SNP ascertainement scheme is known and includes SNPs ascertained in the Khoisan. We then merged in Hadza and Sandawe samples from a recent paper by Joseph Lachance, Sarah Tishkoff and colleagues. The main conclusions are as follows:

  1. Within the southern African Khoisan, there are two genetic groups, which correspond roughly to populations in the northwest and southeast Kalahari semi-desert. Populations from these two groups have been labeled in the tree in this post (see also Figure 1B in the preprint). We estimate that these two groups diverged within the last 30,000 years. However, this date should be taken as an upper bound due to point #2 below.
  2. All southern African Khoisan groups are admixed with non-Khoisan populations. Even the most isolated Khoisan groups (i.e. the “San” from the HGDP, who are included in the “Ju|’hoan_North” group in our paper) show some evidence of admixture with agricultualist and/or pastoralist groups. A subtle technical point is that this had not been previously noticed because methods that rely on correlations in allele frequencies are sometimes unable to detect admixture if all populations are admixed (this is related to Mr. Razib Khan’s post on why ADMIXTURE is not a test for admixure). To get around this, we developed new methods based on the decay of linkage disequilibrum.
  3. The Hadza and Sandawe trace part of their ancestry to admixture with a population related to the Khoisan. After accounting for admixture, we built a tree of “Khoisan-like” ancestry in the southern and eastern African populations (see the Figure above). The striking thing is that the Hadza and Sandawe fall with high confidence on the same branch as the Khoisan. This suggests that, prior to subsequent migrations of food-producing peoples over most of sub-Saharan Africa, populations related to the Khoisan were indeed spread continuously over a huge geographic range including Tanzania and southern Africa.

We’re excited about these results for a number of reasons. First of all, we’re now on our way towards understanding the history of the diverse Khoisan populations–for years these populations have been treated as genetically equivalent, but it’s clear that each population has its own complex history. Secondly, with the new statistical methods we’ve developed we were able to show not only the varying amounts of admixture that has occurred at different times in southern African populations, but were also able to peel away these layers of admixture to learn about the relationships among Khoisan populations that existed thousands of years ago. Finally, we think that these results have important implications for work using genetics to understand the geographic origin of modern humans within Africa. Though both southern and eastern Africa have been proposed as potential origins, from the tree in this post, we see no genetic evidence in favor of either; from our point of view this question remains open.

Joe Pickrell, Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf

Thoughts on: The date of interbreeding between Neandertals and modern humans.

The following are my (Graham Coop, @graham_coop) brief thoughts on Sriram Sankararaman et al.’s arXived article: “The date of interbreeding between Neandertals and modern humans.”. You can read the authors’ guest post here, along with comments by Sriram and others.

Overall it’s a great article, so I thought I’d spend sometime talking about the interpretation of the results. Please feel free to comment, our main reason for doing these posts is to facilitate early discussion of preprints.

The authors analysis relies on measuring the correlation along the genome between alleles that may have been inherited from the putative admixture event [so called admixture. The idea being that if there was in fact no admixture and these alleles have just been inherited from the common ancestral population (>300kya) then these correlations should be very weak, as there has been plenty of time for recombination to break down the correlation between these markers. If there has been a single admixture event, the rate at which the correlation decays with the genetic distance between the markers is proportional to this admixture time [i.e. slower decay for a more recent event, as there is less time for recombination]. These ideas for testing for admixture have been around in the literature for sometime [e.g. Machado et al], its the application and genome-wide application that is novel.

As you can tell from the title and abstract of the paper, the authors find pretty robust evidence that this curve is decaying slower than we’d expect if there had been no gene flow, and estimate this “admixture time” to be 37k-86k years ago. However, as the authors are careful to note in their discussion, this is not a definitive answer to whether modern humans and Neandertals interbred, nor is this number a definite time of admixture. Obviously the biological implications of the admixture result will get a lot of discussion, so I thought I’d instead spend a moment on these caveats. [This post has run long, so I’ll only get to the 1st point in this post and perhaps return to write another post on this later].

Okay so did Neandertals actually mate with humans?

The difficulty [as briefly discussed by the authors] is that we cannot know for sure from this analysis that the time estimated is the time of gene flow from Neandertals, and not some [now extinct] population that is somewhat closer to Neandertals than any modern humans.

Consider the figure below. We would like to say that the cartoon history on the left is true, where gene flow has happened directly from Neandertals into some subset of humans. The difficulty is that the same decay curve could be generated by the scenario on the right, where gene flow has occurred from some other population that shares more of its population history with Neandertals than any current day human population does.

Why is this? Well allele frequency change that occurred in the red branch [e.g. due to genetic drift] means that the frequencies in population X and Neandertals are correlated. This means that when we ask questions about correlations along the genome between alleles shared between Neanderthals and humans, we are also asking questions about correlations along the genome between population X and modern humans. So under scenario B I think the rate of decay of the correlation calculated in the paper is a function only of the admixture time of population X with Europeans, and so there may have been no direct admixture from Neandertals into Eurasians*.

First thing is first, that doesn’t diminish how interesting the result is. If interpretation of the decay as a signal of admixture is correct, then it still means that Eurasians interbred with some ancient human population, which was closer to Neandertals than other modern humans. That seems pretty awesome, regardless of whether that population is Neanderthals or some yet undetermined group.

At this point you are likely saying: well we know that Neandertals existed as a [somewhat] separate population/species who are these population X you keep talking about and where are their remains? Population X could easily be a subset of what we call Neandertals, in which case you’ve been reading this all for no reason [if you only want to know if we interbred with Neandertals]. However, my view is that in the next decade of ancient human population history things are going to get really interesting. We have already seen this from the Denisovian papers [1,2], and the work of ancient admixture in Africa (e.g. Hammer et al. 2011, Lachance et al. 2012). We will likely discover a bunch of cryptic somewhat distinct ancient populations, that we’ve previously [rightly] grouped into a relatively small number of labels based on their morphology and timing in the fossil record. We are not going to have names for many of these groups, but with large amounts of genomic data [ancient and modern] we are going to find all sorts of population structure. The question then becomes not an issue of naming these populations, but understanding the divergence and population genetic relationship among them.

There’s a huge range of (likely more plausible) scenarios that are hybrids between A and B that I think would still give the same difficulties with interpretations. For example, ongoing low levels of gene flow from population X into the Ancestral “population” of modern humans, consistent with us calling population X modern humans [see Figure below, **]. But all of the scenarios likely involve some thing pretty interesting happening in the past 100,000 years, with some form of contact between Eurasians and a somewhat diverged population.

As I say, the authors to their credit take the time in the discussion to point out this caveat. I thought some clarification of why this is the case would be helpful. The tools to address this problem more thoroughly are under development by some of the authors on this paper [Patterson et al 2012] and others [Lawson et al.]. So these tools along with more sequencing of ancient remains will help clarify all of this. It is an exciting time for human population genomics!

* I think I’m right in saying that the intercept of the curve with zero is the only thing that changes between Fig 1A and Fig 1B.

** Note that in the case shown in Figure 2, I think Sriram et al are mostly dating the red arrow, not any of the earlier arrows. This is because they condition their subset of alleles to represent introgression into European and to be at low frequency in Africa. We would likely not be able to date the deeper admixture arrow into the ancestor on Eurasian/Africa using the authors approach, as [I think] it relies on having a relatively non-admixed population to use as a control.

The genetic prehistory of southern Africa

The genetic prehistory of southern Africa

Joseph K. Pickrell, Nick Patterson, Chiara Barbieri, Falko Berthold, Linda Gerlach, Mark Lipson, Po-Ru Loh, Tom Güldemann, Blesswell Kure, Sununguko Wata Mpoloka, Hirosi Nakagawa, Christfried Naumann, Joanna L. Mountain, Carlos D. Bustamante, Bonnie Berger, Brenna M. Henn, Mark Stoneking, David Reich, Brigitte Pakendorf
(Submitted on 23 Jul 2012)

The hunter-gatherer populations of southern and eastern Africa are known to harbor some of the most ancient human lineages, but their historical relationships are poorly understood. We report data from 22 populations analyzed at over half a million single nucleotide polymorphisms (SNPs), using a genome-wide array designed for studies of history. The southern Africans-here called Khoisan-fall into two groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. All individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began 1,200 years ago. In addition, the Hadza, an east African hunter-gatherer population that speaks a language with click consonants, derive about a quarter of their ancestry from admixture with a population related to the Khoisan, implying an ancient genetic link between southern and eastern Africa.

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Sebastien Roch
(Submitted on 17 Jul 2012)

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.

Inference of population splits and mixtures from genome-wide allele frequency data

Inference of population splits and mixtures from genome-wide allele frequency data

Joseph K. Pickrell, Jonathan K. Pritchard
(Submitted on 11 Jun 2012)

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In this model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at this http URL