Reconstructing the Population Genetic History of the Caribbean
Andres Moreno-Estrada, Simon Gravel, Fouad Zakharia, Jacob L. McCauley, Jake K. Byrnes, Christopher R. Gignoux, Patricia A. Ortiz-Tello, Ricardo J. Martinez, Dale J. Hedges, Richard W. Morris, Celeste Eng, Karla Sandoval, Suehelay Acevedo-Acevedo, Juan Carlos Martinez-Cruzado, Paul J. Norman, Zulay Layrisse, Peter Parham, Esteban Gonzalez Burchard, Michael L. Cuccaro, Eden R. Martin, Carlos D. Bustamante
(Submitted on 3 Jun 2013)
The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, by making use of genome-wide SNP array data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures in present day Afro-Caribbean genomes and shedding light on the genetic impact of the dynamics occurring during the slave trade in the Caribbean.
Fascinating in regards to the drift of a “Latino European” component from the parental Southwest European one. There is historical documentation of extensive polygyny among the early European male settlers, so this is not implausible. Now I am curious about possible future Y chromosomal analysis; a bottleneck should be evident in particular in this lineage. Did not see reference in the paper, so presume that there hasn’t been enough coverage (or, Y STRs don’t have the power, and we need to wait for sequencing of Y’s?).
Razib, thanks for the comments. We are working on sequencing y’s for several projects. What’s also pretty cool is that it seems to be a much stronger signal int the Caribbean and Colombia vs. Mexico.
Carlos. Congrats on what looks like a really nice paper.
Following up on Razib’s point, have you looked at the population specific drift on the X chromosome. Or is your sample size too small?
Also [as suggested here: http://dienekes.blogspot.com/2013/06/population-history-of-caribbean-moreno.html%5D do you think the location on the PCA component could reflect homogenizing gene flow in Europe subsequent to the founding of these populations?
We tried to get the PCA projection of IBD blocks working in our PLOS Bio. paper, after seeing it in one of your PNAS african popgen papers. However, it was always a mess, perhaps because migration within Europe is in all directions.
For Figure 4, the European ancestry PCA, are you running the PCA on all of the individuals? Or are you running it just on POPRES and then projecting your Eu. ancestry admixed haplotypes on to the first 2 PCAs? I think you are doing the latter, is that right?
If it is the latter, I’m missing something as I don’t see how subsequent drift in the founding of those populations could lead to them being further out on the PC axis [i.e. looking even more “Iberian”].
If we imagine a model where the ancestral population of the Eu component was isolated and under went a bout of drift, this branch specific drift would be independent of all previous drift. In that case wouldn’t I expect those samples to project to their ancestral location [obv. this could break down at some point], not to further out on the axis. Is that correct, it is always a little tricky to think in PCA space so perhaps I’ve gotten myself confused?
If you are doing the former then this seems less of an issue, although I’m still not sure how I’d expect it to behave. I suspect doing it both ways would allow some additional information.
Graham, I think you mean Figure 5? I have the same intuition–if you learn PCs on POPRES and then project, in principle you shouldn’t pick up extra drift in the projected populations because that drift isn’t present in the samples you learned the PCs from.
If this is indeed the analysis that was done, I think Dienekes’s explanation (gene flow involving Iberia in the last 500 years) is plausible. If modern Iberians can be modeled as Iberians of 500 years ago plus a bit of gene flow from elsewhere in Europe, I think you’d expect the Iberian samples to be shifted towards Europe relative to the “Caribbean-derived European component” in this analysis.
Graham, thanks for the comments and for triggering an interesting discussion. In response to your question about the European ASPCA (Fig 5), we are indeed doing as you describe in the former case, that is, using all the individuals to define the PCA space. This was exactly one of the questions that I also got from Ewan Birney at BoG in CSHL. We agree that in this case is better to run PCA on the full dataset rather than projecting a subset onto a pre-defined PCA space.
Thanks for the response, and for joining the conversation. I wasn’t sure which one you were doing, and that makes more sense now. I think my confusion arose from the Figure 5 caption:
“ancestry derived from insular Caribbean (black symbols) and mainland populations (gray symbols) are projected onto a reference panel”
That makes it sound [at least to me] like you are using only POPRES for the PCA, and then projecting the other samples [i.e the latter]. Perhaps aslight rewording the caption and methods would help avoid this confusion.
Perhaps you could do the latter method, to rule out the “subsequent homogenizing by gene flow within Europe” point. As if your ancestral european segments projected on top of the Iberian samples, it would show that they look just like “Iberian” when considered in the context of modern European variation. The point about mexican european component of ancestry [i.e. that it doesn’t look similarly drifted] makes that point, but it seems like it could be made more directly, via this route.
One slight concern I have is that the European ancestral blocks might look more “spanish”, as they need to look different from global mean frequencies in order for a block to be detected. Presumably you guys have thought a bunch about that, can you rule that out?
Totally agree, Figure 5 caption will need some rewording in order to make that clear. Thanks for spotting that!
I am not sure if I quite understand your last point about EUR blocks looking more “Spanish”. We are detecting continental-level blocks at K=3 (i.e., EUR, AFR, NAT) prior to running ASPCA with larger sub-continental reference panels, so any sub-EUR block is initially called just “European” with respect to African and Native American differentiation patterns (and then allowed to cluster anywhere within Europe using ASPCA).
I had the same worry as Graham; but you have a good point. The pattern in Figure 5 could, in principle, happen if segments that looked “more strongly Iberian” were more likely to be correctly identified as “European”. This doesn’t seem terribly likely. But it would be easy to check: construct some fake admixed genomes, apply the method, and project the segments called as European back on the map.
Actually, it might be that projection of a segment along PC2 correlates with the chance of being correctly identified as European — heuristically, both would correlate with density of ancestry-informative markers (or local informativeness). If some regions of the genome were more informative than others about “ancestry”, both at a local (within Europe) and global scale, you’d actually expect this to occur.
I do think your interpreation is plausible, especially given that the same thing does not occur with Mexican populations. (however, one might still worry even given this observation, if the length distribution of the tracts in the Mexican samples is quite different?)
Have you tried using only the EUR-blocks of all the populations involved to generate the PCA?
I ask because the Iberian samples in the K=3 analysis visually appear to show some membership in the AFR-cluster as well (while the more North and East European populations show slight membership in the NAT-cluster) and presumably the null hypothesis would that this would be the case for the founding Iberian populations of Latin America.
Is it possible that the “black” component (at best CV error value, K=7) actually represents a non-European ancestry like North African (incl. Canarian Guanche) or something else like Jewish? I’d really love if future versions of this paper would include a North African (and maybe also Jewish) control population, in order to discard (or confirm) this kind of hypothesis.
Another issue I have spotted is that in the population history reconstruction (fig. 3), the most recent African input in some Caribbean countries seems impossibly recent, being dated to a mere four generations ago, i.e. c. 1890. The most striking case is Haiti, which became independent 200+ years ago, almost double that time estimate. A similar date for Cuban African ancestry is also highly suspect.
Otherwise the paper is very interesting and I am fairly excited with it. Thanks to the authors for their effort.
Maju, thanks for the comments. We have extensively tested this, including North African and Middle Eastern control populations to rule out the possibility of a non-European origin of the Latino-specific component. It is definitely a good idea to add this to a revised version of the paper. Thanks again for your interest!
Thanks to you, Andrés. I find the study very interesting overall and I do think that it is very good already for usual publication standards in this field. However that would enrich it even more.
Something that has arisen once and again in my discussions on this paper (which has already got quite a bit of interest) is that surely having more sample points in Spain or historical Castile (Andalusians, Extremadurans, etc.) would also help to clarify things about this mystery component. On the other hand the v1 variety of European populations seems a bit excessive and pointless (Scandinavians or Swiss, for example, are definitely not any meaningful source of Latin American ancestry).
Pingback: The genetic legacy of the conquistadors : Gene Expression
Pingback: The genetic legacy of the conquistadors | Biology News by Biologged
One issue with this paper is that in Colombia at least it seems like people with African ancestry are heavily under-represented. There are regions of Colombia where the people have a majority of African ancestry, and other regions where there is significant African ancestry throughout a heavily admixed population.
Does anyone have information on from where these samples were collected? If they were self-submitted, it would make sense if they over-represented white wealthy urban elites who have the means/interest to participate in international genomics studies. But if that’s the case, the picture being painted of Colombia seems to be quite off…
In support of the founder effect theory, has anyone looked specifically at potential inbreeding resulting from 500 years of the same DNA being kicked around these small islands? My roots go back to Puerto Rico, and after testing myself and parents at 23andme I noticed that both parents shared dna with approximately 85% of my dna cousins… Seems like a very closely related population…
We just read the paper over here, and thought it was very interesting, and well-thought out. Great data analysis. Everything is generally convincing, but here’s some feedback on bits that could have been more convincing:
First: regarding the evidence for the bottleneck in the Europeans. There were two issues with the IBD analysis (figure S12): First, at 2Mb, we estimated the false positive rate to be about 50% in Europeans. It would probably be much higher in an admixed population. Second, shouldn’t you be comparing IBD rates to the Iberian subset of POPRES, not just an (unspecified) subset of Europeans? Also, it seemed strange that Admixture estimates the Latinos show a lower Fst to northern Europeans than to southern Europeans (table S3). Is this some sort of artifact of the clustering procedure?
Second: I think it’s plausible that you’d be able to see evidence of different origins within africa of the different periods of the slave trade, but I don’t think figure 6 establishes this unambiguously. If I understand correctly, you’re fitting some Gaussians in some PC space to the African populations, then reporting average posterior probabilities that the african portions of each haplotype come from each population. But, I think this is neglecting the error involved in the projection onto the PC space. Suppose that both long and short segments both come in the same proportions from these populations, but that the projections of short segments have a larger variance around the population mean. Then you’d expect to see something more or less like what you see — short segments have more even posterior assignment probabilities. Again, this is something you could test by constructing some fake admixed genomes.
Pingback: Most viewed on Haldane’s Sieve: June 2013 | Haldane's Sieve
Pingback: Some preprint comment streams at Haldane’s sieve and related sites | Haldane's Sieve
Pingback: Sifting through 2013 with Haldane’s Sieve | Haldane's Sieve