This guest post is by Peter Ralph and Graham Coop (@graham_coop) on their paper The geography of recent genetic ancestry across Europe arXived here
In this paper we look at the genetic traces of very recent common ancestry between pairs of individuals from across Europe. We’ll likely write a few more accessible posts on this work when the paper is closer to publication, but for now (in the spirit of Haldane’s sieve) we write a little bit more of a technical post now [the full details are in the paper].
We started this project wanting to estimate recent migration rates, across continents like Europe — if we could learn how far away distant cousins are from each other, then, all else equal, we could then estimate typical migration distances. This isn’t where we ended up (that’s another project we are working on), but the basic idea, of looking at the geographic distribution of close relatives, led us to some interesting places.
As most populations lack the amazing pedigrees like that worked on by Decode in Iceland [e.g. here] we can’t actually know the true relationships between the samples (other than a few obvious siblings and full cousins). However, long segments of chromosome shared (almost) identically by descent (IBD) between two people have probably been inherited from a recent common ancestor. The length of these IBD segments tells us something about how long ago the ancestor lived,
since the older the ancestor, the more opportunity for recombination to whittle down the segment.
This has been worked on by a bunch of different groups, but the historical inference has usually been applied to small or relatively isolated populations. To really push the boundaries of these approaches we used the European subset of the POPRES dataset, which consists of thousands of human individuals. This is currently one of the best genome-scale, geographically indexed datasets, and represents a huge outbred population where we’d expect patterns of variation to be (at least partly) due to continuous migration, rather than, say, recent mixing of diverged populations or bottlenecks. So, we ran BEAGLE on the dataset to find IBD segments, and got lots of wonderful signal — it turned out that most pairs of people in the European sample (around 75%) shared IBD segments that were megabases long (i.e. longer than 1 centi-Morgan, cM). After a bunch of power and false-positive simulations, we were convinced that most of those blocks of IBD had been inherited from single common ancestors.
You could think of our results in two pieces: first, doing descriptive statistics on the distribution of IBD abundances and lengths across geography; and second, doing some inference on this distribution to see what we can learn about when those common ancestors lived.
As we hoped, there was a nice relationship to geography – people nearer to each other typically shared more and longer IBD than people farther away, in a nice monotonic relationship. This convinced us that continuous, local migration had played an important role in shaping current patterns of relatedness across Europe. Geographic distance was definitely not the only factor – superimposed on top of this was distinctive regional variation. For example, of one the strongest signals we saw was that there are higher levels of IBD sharing in Eastern Europe. As you’ll see from the paper, after further work, we think this is a potential signal of the Slavic or Hunnic expansions.
There were also some surprises to us along the way — like people in the UK sharing more IBD with Irish than with other people in the UK — that turned out to make sense after thinking about rapidly growing populations with directional migration (although there are other explanations). We correlated the patterns we saw with historical events, but (as with most genomic studies of human history) there was a lot of uncertainty. Sure, the patterns we see are consistent with the story we told, but there could potentially be a lot of other explanations, especially given the complicated and often unknown demographic history of European populations. What if all the IBD we saw came from the Neolithic expansion rather than the last two or three thousand years? This turns out to be a bigger worry than you might think — it’s fairly unlikely that two people inherited a 3cM block from a single common ancestor from 6000 years ago,but if they have enough common ancestors from back then (e.g. a strong enough bottleneck), it turns out to be reasonably likely.
So, we did some coalescent theory to work out the relationship between numbers of shared ancestors back through time (closely related to coalescent rate) and the observed distribution of IBD block lengths. We could then invert this relationship to estimate from the observed distribution of IBD blocks the mean number of ancestors that pairs of people from different parts of Europe share with each other, as a function of time. Unfortunately, this turns out to have a lot of unavoidable uncertainty– the inversion problem is “ill-conditioned” (in other terms, the likelihood surface is ridged), meaning that there were a lot of different histories that gave the same IBD length distribution.
Despite this, we could still rigorously learn a lot of good information — in particular, nearly all the IBD blocks we found did actually come from ancestors living during the last 3,000 years. Although we could only tie down the ages of the common ancestors to within a few hundred year, the major patterns can be likely tied to known historical events. There is quite a bit of uncertainty about the specific interpretations — it is still not straightforward to go from pairwise numbers of shared ancestors 1,500 years ago to conclusions about demographic events at the time — but used in conjunction with other sources of information has the promise to conclusively resolve some longstanding debates about recent history.
Finally, two addendums (addenda?) about the methods: The first is that we took an empirical approach to estimating the relationship between coalescent time distribution and observed IBD block length, by simulating a bunch (actually copying over blocks and re-running BEAGLE). We did this because BEAGLE is effectively a black box, for our purposes. This sort of approach is more common in experimental physics, where the empirical properties of detectors have to be worked out (and the problem of inferring the signal is known as “data unfolding”).
Second, we should emphasize that the uncertainty we came across in inferring dates is theoretically unavoidable, using IBD block length data. We think this is a common issue for many sorts of population genetics data — situations in which, even though we have a ton of data, getting specific, tightly constrained inferences requires making fairly strong assumptions (or equivalently, working in a specific set of parametric models). This has been highlighted in some cases [like this one], but more work is needed on this to ensure that we represent the inherent uncertainties in population genetics inferences correctly.
We’d love feedback from the popgen community about what aspects of the paper they’d like to see clarified/improved [obviously it is a pretty involved paper already, so concentrate on specific suggestions]. We have a tonne more ideas of how to improve this inference technology and extend it to other applications. But we’d love to hear your thoughts too.
I really enjoyed reading this paper. It is a serious attempt to use IBD to study ancestry and relationships among the people of Europe, and I think it is a valuable addition to the field. It is a useful application of IBD detection to population genetics, and is exciting in that it illustrates the potential of applying this relatively new technology to probe pop gen questions.
I had one major criticism, one more minor criticism, and a few questions that I wanted to raise in this forum.
My primary criticism had to do with the evaluation of BEAGLE and its use in addressing the author’s questions. One of the key results and a claim made in the paper is that levels of IBD sharing vary by region, including higher levels of sharing in Eastern Europe and lower sharing of Italians to other groups. While these signals may be valid, due to the complexity of IBD and the model used by BEAGLE, I’m not certain that the current analysis fully supports these conclusions. The paper evaluates BEAGLE’s power and false positive characteristics in only one population (CEU), but the results and conclusions are drawn from many distinct populations. This is potentially problematic because BEAGLE’s IBD calling method is complex, and in particular, it uses a frequency threshold to determine whether to call paris of identical segments as IBD. This means that segments that have a high frequency in a dataset may suffer a high false negative rate. This could differentially affect individual populations. For example in terms of the current study, it may be that, because of the Roman Empire, historically “Italian” segments are relatively ubiquitous throughout Europe. If so, these segments may be missed by BEAGLE, which would result in an apparently lower level of IBD sharing from Italians to other European groups.
I think it is very feasible to address this concern. Ideally, one would use trios from two or more European populations and show that the power and false positive rates are very similar among these groups. Short of this, and using publicly available data, one could use the HapMap CEU and YRI populations for this evaluation. Use of CEU and YRI for this purpose would be even more robust to potential population-specific biases in the IBD caller since CEU and YRI are so diverged. (Of course, the diverged nature of these populations, while more robust, is also potentially problematic, since differences between CEU and YRI may not be representative potential inter-European differences in IBD calling characteristics.)
A more minor criticism, but one that is easily addressed, is the reliance on computational phased data in order to plant IBD segments for the power evaluation. Computationally phased data, including even trio phased samples will have switch errors. But more importantly, even if tho trio phasing were perfect, these haplotypes are merged with other computationally phased haplotypes to generate genotypes. Computationally phased haplotypes are probably easier for BEAGLE to identify (since a phasing method already found them once). An alternative design that does not involve phasing is to take trio children’s diploid genotypes in a given region, and use these genotypes to replace the genotypes of some other sample. Then, one can include the trio parent as part of the dataset, and the planted diploid genotype is guaranteed to have at least one segment shared with that parent in the planted region.
Moving to the questions I have: the first related to the use of segments with length < 2 cM. It is evident from the curves you generated that power drops off markedly for segments shorter than 2 cM. Do the results change significantly if you restrict to segments that are 2 cM or longer? Or why did you chose to include the shorter segments?
Lastly, I found the demographic inferences interesting, but the dates seemed a bit too old given the genetic length of the segments. It wasn't clear to me why it should be that "the typical age of a 10cM block shared by two individuals from the UK is between 32 and 52 generations." I did some calculations for 10 cM chunks that suggested an average age of closer to 25 generations. There may be a good reason for this discrepancy, but it wasn't clear to me from my read.
Overall, I really liked this paper, and I think it is valuable in its current form. Addressing these issues and questions could make the results even more compelling, and I look forward to reading the paper in its final form (as well as hopefully engaging on these topics here).
Thanks for the substantive and useful comments; they are all spot-on (and, as a
side note, they echo closely our reviews, both in their helpful depth and
concerns) First, the concerns:
BEAGLE’s varying power and false positive rate by population was one of our
major concerns (and our reviewers); we’ve done a few more careful analyses of
this issue to add into the paper. The short answer is that there is
significant variation in FP by population, but only for short segments — the
difference is tiny and/or insignificant past 2cM (with the exception of false
positives between Portuguese pairs, strangely). The major regional differences,
e.g. Italy having much lower rates — hold even for really long blocks (say,
above 5cM, where BEAGLE has very high power). This is visible in figure 3 in
the paper, but a better way of viewing it as
an animation, with tooltip labels [UPDATED];
there it’s possible to see that IBD rates of different population
pairs, relative to the overall trend lines, are positioned more or less the
same for different length categories. This is one thing that really convinced
us that the differences between pairs of populations is real.
We agree, though, that it would be great to estimate the relevant error rates
using more populations. Your idea about how to use trios to avoid phasing
problems is really sharp; I wish we’d thought of that. Minor errors in false
positive rates or power is more of a concern when we estimate ages of common
ancestors, but we have done some exploration of this, and differences caused by
realistic variation in FP and TP rates are small compared to the intrinsic
uncertainty due to the conversion of block lengths to ages. We’re going to put
some exploration of this into the next draft, however.
And, the questions:
1) For the descriptive statistics, we used a cutoff of 1cM, and no, changing
that doesn’t affect anything much at all. (see for instance the animation
linked above) We miss most of the real 1cM blocks, and a bunch of what we have
are FP, but there’s still plenty of signal.
For the inference of block age, we restricted to 2cM for the reason you
mention, and also played around with this lower bound, but the differences here
are small compared to our intrinsic uncertainty, as above.
2) So, we do that calculation (finding the distribution of ages of common
ancestors who left a 10cM IBD block) basically using equations (6) and (7) with
no error — the mean number of 10cM blocks coming from generation n is mu(n) *
(d/dx)K(n, 10), the coalescent rate at generation n multiplied by the density
of the kernel of equation (6). Call this m(n). Then p(n) = m(n) / ( m(1) +
… + m(really big) ) is the proportion of 10cM blocks in the population that
we expect to come from generation n. It depends quite strongly on the
demographics, through mu; in our calculation we used various of our estimated
consistent histories to get the range 32–52. This is surprising, but we think
plausible; the consistent histories we estimate do a good job of predicting the
observed rates of long IBD sharing. What did you use for your calculation?
Again, thanks for the feedback. We’re working on the revisions and hope to get
them out… er, yesterday? Soon, anyhow. We’ll post them to the arXiv and
we’ll be interested in your thoughts on the changes.
Thanks for a great set of comments, they complement the reviewers comments nicely. I think Peter’s response covers these point well. One thing to note is that the FP rate [as a function of length] in the 1st submission was not just for CEU, it was a mean across all POPRES populations [constructed by permuting genotypes within countries]. Only our power was CEU specific, because those are the only publicly pedigree phased European individuals to our knowledge. The YRI trios are too different in allele frequencies to be a useful comparison here we think. We accidentally included a couple of putatively African POPRES individuals [because they were “mislabeled”] in our first BEAGLE run. They shared a lot of apparent IBD, likely because they shared a lot of alleles that were rare in Europe which would make interpretation of power studies difficult [e.g. we’d likely have good power to spot 1cM IBD African blocks].
If you know of any publicly-available southern European phased trios [on the Affy chip] we’d be interested in using them for power analyses. Perhaps we could consider using computational phased CEU chromosomes in our power study and seeing if that had significantly higher power than using the pedigree phased CEU chromosomes. If that were the case we might just go ahead and computational phase Italian chromosomes and use them in a Southern European version of the power study. However, that would be quite a lot of work, and would not be particularly water tight.
In the absence of such data I think our argument that our patterns hold even when we should have very good power [>5cM] is solid. Also Peter has tried altering our power curve, and it didn’t make a huge difference to our age inferences. We’ve also done a bunch of work on the effect of sample size on BEAGLE, which also shows that there are few differences across populations as a result of differences in sample size. Overall the signals we point to are pretty strong, and the inherent noise in the inversion dominates most other sources of error. We’ll put this all into the next draft and you’ll have to let us know what you think.
Other possibilities of why Eastern Europeans are broadly inter-related:
People travelled widely within the Byzantine world.
To discourage tribalism, Philip II of Macedon implemented a law forbidding marriage to a first or second cousin. This law became incorporated into Byzantine law. The impact was for men to frequently take wives from neighboring villages.
Nomadism or transhumant shepharding was and still is common in the Balkans.
A dark age never occured in the Byzantine World in the same way that it did in Western Europe. This meant that there was an interchange of people and ideas well into the Middle Ages across Eastern Europe.
The official split between the Western and Eastern Churches (the Great Schism) didn’t occur until 1054. Many areas of Eastern Europe that are today Catholic would have come within the influence of Byzantine law and custom until about a thousand years ago.
PS. Cool paper!
Following up on the potential for differences in power across populations: it may be that BEAGLE has equivalent power in any population to identify IBD segments of >= 5 cM, but to my knowledge no one has yet demonstrated this empirically. I agree that YRI is quite different, and it is certainly not ideal for this test, but I suspect that BEAGLE has population-specific power biases that could be affecting these results. A test based on computational phasing of one or a few populations, while not ideal, would be helpful here.
Another option, and probably a superior one, would be to use an entirely different IBD method — one without a frequency-based threshold for calling IBD — to check whether you see trends similar to your current results. In particular, you could phased the data and run GERMLINE to check if the general trends — just in terms of total numbers of shared segments — are the same. This too is not ideal, since it relies on phasing (so does BEAGLE, of course; it just explores more of the haplotype state space), and we already know, e.g., that Africans have higher switch error rates than Europeans. Nevertheless, having one method that doesn’t rely on frequency thresholds as check here seems good as a check since this kind of analysis hasn’t been done before.
The best way to run GERMLINE is to phase and use the -h_extend option. Without this option, GERMLINE finds likely IBD using its own model of diploid data, and some tests I’ve done on European Americans and African Americans showed that GERMLINE is biased towards reporting significant excess IBD compared with expectations for European Americans than African Americans. (I compared the number of IBD segments reported to the number of homozygous by descent segments. In a randomly mating population, the number of homozygous by descent segments is expected to be 1/4 the number of IBD segements. Without -h_extend, GERMLINE reported >2.1x the expected number of segments for European Americans vs. ~1.1x the expectation for African Americans. I did not see a bias when I used the -h_extend option.)
I’ll be very interested to see how this paper develops during the review process. Best of luck with it, and again, I think this is great work!
Thanks for your comment Amy.
We are investigating various power issues and will def. consider doing more population specific power analyses, however the absence of any other non-computationally phased southern European haplotypes makes this tricky [The Af. data really isn’t suitable].
That said we have high power by >5cM, and the difference between populations is not subtle. We’d need a large difference in power to explain this result, and that seems very unlikely given the small variation in allele frequencies within Europe. For example we see a relatively large difference between Ireland and the UK, which have a tiny Fst.
I’m not sure what you mean by “the number of homozygous by descent segments is expected to be 1/4 the number of IBD segments.” as the number of IBD segments and individual shares should increase linearly with sample size, while HBD per individual should remain constant. Perhaps you mean for a pair of individuals, vs. for a single individual? That would make sense with your factor of 1/4. Also is the ” >2.1x the expected number of segments” for the number of IBD or HBD segments?
Peter and I will be at ASHG, it would be good to talk in person then if you are aroung.
Quantifying the power differences that would be necessary to explain the differences you see at >5cM could certainly be convincing. I wanted to highlight a concern I had about this area in general and also mention biases I’ve seen with GERMLINE, but it sounds like you have a pretty good case here.
That’s right — the number of HBD segments should be ~1/4 the average number of segments between pairs of individuals — sorry. GERMLINE (without -h_extend) gives 2.1x more IBD per pair of individuals than what is expected, i.e., for European Americans, average # IBD segments / pair = 2.1 * 1/4 * average # HBD segments / individual.
Would love to chat at ASHG. Let’s definitely meet up!
That is a clever point about using HBD. The ratio would deviate somewhat from 1/4 due to subpopulation structure etcetera. How much it deviates ought to be a good measure of something; but it’s not clear to me yet what exactly it is. It would be fun to figure this out.
Presumably the inflation of HBD vs IBD is a inbreeding coeff., i.e. an increase in coalescence over a certain time period weighted by power to detect a segment of a certain size. However, the diff. in power to detect HBD vs IBD makes this a hard stat. to use in practice.
The potential of this paper looks very promising in its ability to unravel the genetic history of Europe, including the Balkans.
However, as a person who is married to someone from the Balkans, I would ask that you proceed cautiously with your conclusions regarding selective sweeps by Huns and Slavs. The area is only just beginning to recover from a century of extreme nationalism and genocide.
I would be very interested to learn more about the genetic prehistory of this very complex region, and the issue of the Slavic invasions is one demographic point of interest, but so are others, as I point out in the preceeding comments.
Another potential demographic effect in the Balkans, not mentioned above, might be due to the various epidemics there, which are well documented in history.
There has sometimes been a tendancy by Western Europeans and Americans to overgeneralize about Eastern European history. I hope the authors in this paper will stir away from that tendancy and consider the various possibilities for the demographic difference between Eastern and Western Europe.
While it may be the case that the effect you are seeing is due to Slavic or Hunnic invasions, the topic is worthy of a more careful investigation.
Thanks for your consideration.
^ The reason there’s extreme IBD sharing across Eastern Europe today is because the region has always been sparsely populated, but also experienced a few major in-situ expansions, like the Balto-Slavic dispersals.
The Hun hypothesis is a bit weird in this context, considering that Huns were never recorded north of the Carpathians, like in the East Baltic area, where the Eastern European IBD sharing reaches extreme levels. Instead, Huns were present in Bavaria, Italy and even in France.
@ Amy Williams
“One of the key results and a claim made in the paper is that levels of IBD sharing vary by region, including higher levels of sharing in Eastern Europe and lower sharing of Italians to other groups. While these signals may be valid, due to the complexity of IBD and the model used by BEAGLE, I’m not certain that the current analysis fully supports these conclusions.”
The results from this paper are in line with the fact that Italy has always had a relatively large population of diverse origin from around the Mediterranean region. Thus, with so many founders, Italians are less likely to be related to each other and to other Europeans.
On the other hand, Eastern Europeans, and especially Northeastern Europeans, derive from few founders, In fact, this is the region of Europe where most of the Mesolithic European ancestry has survived. As per above, this area has also experienced some massive in situ expansions. So it’s no wonder everyone’s related to each other, and that shows very clearly via the IBD stats in this paper.
Regarding Eastern Europe, it might be interesting to compare IBD stats of populations in the Carpathian-Julian Alps-Dinaric Alps-Pindos ranges against lowland continental populations and again against the coastal populations of the Adriatic, Aegean and Pontic coasts.
I would suspect that the highest level of endogamy would be in the Dinaric Alps and the lowest level along the coasts.
Pingback: Distant Genetic relatives in Europe | gcbias
Pingback: Most viewed on Haldane’s Sieve: October 2012 | Haldane's Sieve
Pingback: Haldane’s Sieve sifts through 2012 | Haldane's Sieve
I really enjoyed reading this paper.
I have some minor suggestions regarding IBD and comparison with IBS that may be can be helpful to the authors:
1) it could be interesting to quantify how much different the IBD and IBS matrices are by means of a Mantel test.
2) I would run AMOVA (or Analysis of Distance) with the IBD distance matrix and compare with IBS.
3) it could be interesting to plot the individual IBD relationships by means of Principal Coordinates (classical MDS), after transforming the IBD into a dissimilarity statistic (for example, using the approach of Gusev to obtain a normalized IBD based distance matrix between pairs of individuals). This plot could then be compared with the PCA result on genotypic data by means of a Procrustes analysis and also with the geographic coordinates of the samples.
Pingback: Antenati comuni degli europei | Sol Invictus
Pingback: Peter and I’s European genetic genealogy paper is out. | gcbias
Pingback: Most viewed on Haldane’s Sieve: May 2013 | Haldane's Sieve
Pingback: Sifting through 2013 with Haldane’s Sieve | Haldane's Sieve