Author post: Inferring human population size and separation history from multiple genome sequences

This guest post is by Stephan Schiffels (@stschiff) on his paper with Richard Durbin Inferring human population size and separation history from multiple genome sequences biorxived here

In our paper, we study genome sequences to learn about human history and how human populations are related to each other. Remarkably, we only need a few individuals for this, because once we look sufficiently many generations into the past, every single genome contains fragments from a very large number of ancestors. This means that given only two genomes, say one individual from Africa and one individual from Europe, we typically find shared fragments from common ancestors (great great … great grandparents) from 2,000 or more generations ago. This trace of shared segments in our genomes can be detected and enables us to make inference about human history.

A few years ago, Heng Li and Richard Durbin introduced the PSMC method which is based on estimating this shared common ancestry in a single diploid genome to infer population sizes. We now introduced a major extension to this approach, called MSMC (Multiple Sequentially Markovian Coalescent), which is able to find and date traces of shared ancestry across multiple genome sequences. This is generally a hard problem because of the complex way of how sequences relate with each other through recombination and mutation (see an excellent blog post by Adam Siepel). In our method, we therefore made a choice to focus only on the pair of segments which coalesce first, i.e. share the most recent common ancestor of all pairs. Because of ancestral recombinations, this changes along the sequences.

Consider again the example of an African and a European individual, each of them carrying two copies of a chromosome. In one part of their genomes, the most recent ancestor of any two chromosomes may be shared between the two European chromosomes, in other parts it may be shared between the two African chromosomes, and in some cases it may actually be found across a European and an African chromosome. The relative frequency of how often we observe each of the three cases, and the distribution of times to the most recent common ancestor, give information about when the separation happened, and how long it took for the ancestral people to part fully from each other. In the case of West-Africans and Europeans, we found that the two populations started to separate from each other (at least genetically) long before the known out-of-Africa emigration 50,000 years ago. And we see the same thing if we compare West-Africans to Asians or Americans instead of Europeans. We can also see clearly how ancestors of Native Americans separated from Asians around 20,000 years ago, consistently preceding the known first arrival of people in the New World around 15,000 years ago.

Our method can also estimate effective population size changes through time. One consequence of our approach to look only for the first common ancestor is that we can now look into the much more recent past than was previously possible with similar methods, such as PSMC. For example, we can now see a deep bottleneck in Native American ancestors around 15,000 years ago which fits with the separation and immigration history described above, and we can see recent expansions that are consistent with the spread of agriculture in Africa.

We believe that MSMC is a useful tool for estimating population history from whole genome sequences. But more ideas and development are still needed in the future to expand this approach to more genomes and to look into the past even more recently than 2,000 years ago, which is our current limit with MSMC. Closely related approaches are currently developed by Yun Song, Thomas Mailund and others, which will complement MSMC. This is a great time to work in this field, given that many more high quality individual genome sequences are being generated, and in many cases from populations that we have not covered at all in our paper. All of this will help to greatly expand our knowledge of human population history.

7 thoughts on “Author post: Inferring human population size and separation history from multiple genome sequences

  1. Thanks for this, really interesting paper. I’m wondering about how to interpret the results from the Maasai–the Pagani et al. paper you cite (and ours) estimates that they have ~20% of their ancestry that is most closely related to western Eurasian (rather than eastern Eurasian) populations, and that this gene flow happened within the last 3,000y (and definitely after the split of western and eastern Eurasia). If I’m reading Figure 4d correctly, you all don’t seem to see any cross-coalescence between Maasai and Tuscans/CEU until the common ancestor of east and west Eurasian populations something like 50,000 years ago.

    I guess I’m wondering if these results contradict each other (which might tell us something interesting), or if there’s some obvious technical explanation I’m missing. If you simulate admixture with these types of parameters do you know what msmc does?

    (Staring at your Table 5, I don’t see cross-coalesence rates for MKK and TSI, did you run this?)

    • Thanks, these are good questions. As you say, the cross coalescence rate reaches 50% around 50,000 years ago, but Figure 4a also shows that it already reaches 35% at 20,000 years ago, which is much more recent than expected under a clean African/Non-African split model. I think that the “flatter” decay observed between CEU/MKK compared to clean split simulations is consistent with post-split gene flow from CEU into MKK. Interpreting this generously, I think our results are qualitatively in agreement with yours and Luca Pagani’s work. It’s just that the time point of gene flow is much older than what you find. I have talked to Luca about this as well and he also suggested to run simulations in order to fully understand this, which I haven’t done so far. However, I have run MSMC on other samples with known recent admixture. I found that in those cases it predicts a non-zero cross-coalescence rate in the most recent time interval, which is what I would expect under recent admixture/gene flow.

      About MKK/TSI: I indeed did not run this pair. At the time I skipped some population pairs for efficiency. Now the implementation and the pipelines are much more streamlined and I should probably just run it now, in particular in light of the point you raise here.

      • Thanks, that helps my intuition.

        I think that the “flatter” decay observed between CEU/MKK compared to clean split simulations is consistent with post-split gene flow from CEU into MKK

        I wonder if there’s a more direct test for this. E.g. what happens if you compare the MKK/TSI cross-coalescence with the MKK/JPT? If there’s no gene flow into MKK after the TSI/JPT split I’d expect these to be the same. You could even imagine doing this to compare the MKK/TSI cross-coalescence with the MKK/CEU (a priori I’d expect the former to be higher than the latter).

        I’ll have to think on this a bit, might play around with some simulations.

  2. Hi Stephen,

    I am wondering if you have thought about trying to reconcile your results with this recent paper:

    Bernard Sechel et al., The history of the North African mitochondrial DNA haplogroup U6 gene flow into the African, Eurasian and American continents. BMC Evolutionary Biology 2014.

    Given the genomes you choose for your analysis in this paper:

    “We applied our model to the genomes from one, two and four individuals sampled from each of 9 extended HapMap populations [20]: YRI (Nigeria), MKK (Kenya), LWK (Kenya), CEU (Northern European ancestry), TSI (Italy), GIH (North Indian ancestry), CHB (China), JPT(Japan), MXL (Mexican ancestry admixed to European ancestry)(details in Supplementary Table 1).”

    . . . I’m wondering how you would be able to detect U6 gene flow, which, in Africa, the Sechel et al. paper suggests is primarily West Africa?

    Is YRI necessarily a good proxy for all of West Africa?

    Is sampling only YRI, MKK and LSW sufficient if we want to consider back/forward migrations into/out of Africa from Eurasia?

    Marnie Dunsmore

    • Hi Marnie, thanks for your comment and sorry for the long delay for replying to your questions.

      I admit that I am not very knowledgeable with mtDNA haplogroups, thanks for pointing me at this paper! I fully agree that our coverage of the African populations is very sparse. With only three populations we clearly do not capture the complexity of within-Africa migrations and gene flow. The only thing we see and write about is the difference between East- and West-African ancestry compared to out-of-Africa populations (the closer MKK-non-African separation compared to the YRI-non-African separation).

      Having said that, I think that mtDNA necessarily has a very limited resolution of inferring population genetic processes. It is only a single locus, much subject to genetic drift. We should be careful with taking mtDNA trees as a direct estimate for population trees. Future work, in particular with high coverage whole genome sequencing from African samples, will help reconstructing African population history to much greater detail, I believe.

      Stephan

  3. Pingback: Most viewed on Haldane’s Sieve: June 2014 | Haldane's Sieve

  4. Pingback: Sifting through 2014 on Haldane’s Sieve | Haldane's Sieve

Leave a comment