Genomic tests of variation in inbreeding among individuals and among chromosomes

Genomic tests of variation in inbreeding among individuals and among chromosomes

Joshua G. Schraiber, Stephannie Shih, Montgomery Slatkin
(Submitted on 26 Sep 2012)

We examine the distribution of heterozygous sites in nine European and nine Yoruban individuals whose genomic sequences were made publicly available by Complete Genomics. We show that it is possible to obtain detailed information about inbreeding when a relatively small set of whole-genome sequences is available. Rather than focus on testing for deviations from Hardy-Weinberg genotype frequencies at each site, we analyze the entire distribution of heterozygotes conditioned on the number of copies of the derived (non-chimpanzee) allele. Using Levene’s exact test, we reject Hardy-Weinberg in both populations. We generalized Levene’s distribution to obtain the exact distribution of the number of heterozygous individuals given that every individual has the same inbreeding coefficient, F. We estimated F to be 0.0026 in Europeans and 0.0005 in Yorubans, but we could also reject the hypothesis that F was the same in each individual. We used a composite likelihood method to estimate F in each individual and within each chromosome. Variation in F across chromosomes within individuals was too large to be consistent with sampling effects alone. Furthermore, estimates of F for each chromosome in different populations were not correlated. Our results show how detailed comparisons of population genomic data can be made to theoretical predictions. The application of methods to the Complete Genomics data set shows that the extent of apparent inbreeding varies across chromosomes and across individuals, and estimates of inbreeding coefficients are subject to unexpected levels of variation which might be partly accounted for by selection.

Diversity and abundance of the Abnormal chromosome 10 meiotic drive complex in Zea mays

Diversity and abundance of the Abnormal chromosome 10 meiotic drive complex in Zea mays
Lisa B. Kanizay, Tanja Pyhäjärvi, Elizabeth G. Lowry, Matthew B. Hufford, Daniel G. Peterson, Jeffrey Ross-Ibarra, R. Kelly Dawe
(Submitted on 25 Sep 2012)

Maize Abnormal chromosome 10 (Ab10) contains a classic meiotic drive system that exploits asymmetry of meiosis to preferentially transmit itself and other chromosomes containing specialized heterochromatic regions called knobs. The structure and diversity of the Ab10 meiotic drive haplotype is poorly understood. We developed a BAC library from an Ab10 line and used the data to develop sequence-based markers, focusing on the proximal portion of the haplotype that shows partial homology to normal chromosome 10. These molecular and additional cytological data demonstrate that two previously identified Ab10 variants (Ab10-I and Ab10-II) share a common origin. Dominant PCR markers were used with FISH to assay 160 diverse teosinte and maize landrace populations from across the Americas, resulting in the identification of a previously unknown but prevalent form of Ab10 (Ab10-III). We find that Ab10 occurs in at least 75% of teosinte populations at a mean frequency of 15%. Ab10 was also found in 13% of the maize landraces, but does not appear to be fixed in any wild or cultivated population. Quantitative analyses suggest that the abundance and distribution of Ab10 is governed by a complex combination of intrinsic fitness effects as well as extrinsic environmental variability.

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Daniël P. Melters, Keith R. Bradnam, Hugh A. Young, Natalie Telis, Michael R. May, J. Graham Ruby, Robert Sebra, Paul Peluso, John Eid, David Rank, José Fernando Garcia, Joseph L. DeRisi, Timothy Smith, Christian Tobias, Jeffrey Ross-Ibarra, Ian F. Korf, Simon W.-L. Chan
(Submitted on 22 Sep 2012)

Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. The assumption that the most abundant tandem repeat is the centromere DNA was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and in length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond ~50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution, including the appearance of higher order repeat structures in which several polymorphic monomers make up a larger repeating unit. While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animals and plants. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.

Our paper: The genetic prehistory of southern Africa

[This author post is by Joe Pickrell (@joe_pickrell), Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf on The genetic prehistory of southern Africa, available from arXiv here]

The indigenous populations of southern Africa are phenotypically, linguistically, culturally, and genetically diverse. Although many groups speak Bantu languages (having arrived in the region during an expansion of Iron-Age agriculturalists), there are a number of populations who speak diverse non-Bantu languages with heavy use of click consonants. We refer to these populations as “Khoisan“. Most of the Khoisan populations are hunter-gatherers, but some are pastoralists; the extensive linguistic and cultural diversity of the Khoisan (who live in a relatively small region around the Kalahari semi-desert) is historically puzzling.

Two hunter-gatherer (or formerly hunter-gatherer) populations in East Africa, the Hadza and Sandawe, also speak languages that also make use of click consonants. Linguists see little in common between the languages in southern Africa and Hadza, although Sandawe might be genealogically related to some of the Khoisan languages. Nevertheless, the shared use of click consonants and a foraging lifestyle led many to hypothesize that the southern African Khoisan populations are genetically related to the Hadza and Sandawe, which would imply that their ancestors were once considerably more widespread. This hypothesis has been controversial for decades.

Tree relating the Khoisan-like proportion of ancestry (shown in blue in the barplot) in Khoisan, Hadza, and Sandawe after accounting for non-Khoisan admixture.

In our study, we use genetic data to address the history of the diverse groups within southern Africa and their relationship to the Hadza and Sandawe. Specifically, we genotyped individuals from 16 Khoisan populations, 5 neighboring populations that speak Bantu languages, and the Hadza (the latter thanks to Brenna Henn, Joanna Mountain, and Carlos Bustamante) on a SNP array designed for studies of human history, in that the SNP ascertainement scheme is known and includes SNPs ascertained in the Khoisan. We then merged in Hadza and Sandawe samples from a recent paper by Joseph Lachance, Sarah Tishkoff and colleagues. The main conclusions are as follows:

  1. Within the southern African Khoisan, there are two genetic groups, which correspond roughly to populations in the northwest and southeast Kalahari semi-desert. Populations from these two groups have been labeled in the tree in this post (see also Figure 1B in the preprint). We estimate that these two groups diverged within the last 30,000 years. However, this date should be taken as an upper bound due to point #2 below.
  2. All southern African Khoisan groups are admixed with non-Khoisan populations. Even the most isolated Khoisan groups (i.e. the “San” from the HGDP, who are included in the “Ju|’hoan_North” group in our paper) show some evidence of admixture with agricultualist and/or pastoralist groups. A subtle technical point is that this had not been previously noticed because methods that rely on correlations in allele frequencies are sometimes unable to detect admixture if all populations are admixed (this is related to Mr. Razib Khan’s post on why ADMIXTURE is not a test for admixure). To get around this, we developed new methods based on the decay of linkage disequilibrum.
  3. The Hadza and Sandawe trace part of their ancestry to admixture with a population related to the Khoisan. After accounting for admixture, we built a tree of “Khoisan-like” ancestry in the southern and eastern African populations (see the Figure above). The striking thing is that the Hadza and Sandawe fall with high confidence on the same branch as the Khoisan. This suggests that, prior to subsequent migrations of food-producing peoples over most of sub-Saharan Africa, populations related to the Khoisan were indeed spread continuously over a huge geographic range including Tanzania and southern Africa.

We’re excited about these results for a number of reasons. First of all, we’re now on our way towards understanding the history of the diverse Khoisan populations–for years these populations have been treated as genetically equivalent, but it’s clear that each population has its own complex history. Secondly, with the new statistical methods we’ve developed we were able to show not only the varying amounts of admixture that has occurred at different times in southern African populations, but were also able to peel away these layers of admixture to learn about the relationships among Khoisan populations that existed thousands of years ago. Finally, we think that these results have important implications for work using genetics to understand the geographic origin of modern humans within Africa. Though both southern and eastern Africa have been proposed as potential origins, from the tree in this post, we see no genetic evidence in favor of either; from our point of view this question remains open.

Joe Pickrell, Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

Justin P. Blumenstiel, Miaomiao He, Casey M. Bergman
(Submitted on 16 Sep 2012)

How natural selection acts to limit the proliferation of transposable elements (TEs) in genomes has been of interest to evolutionary biologists for many years. To describe TE dynamics in populations, many previous studies have relied on the assumption of equilibrium between transposition and selection. However, since TE invasions are known to happen in bursts through time, this assumption may not be reasonable. Here we derive a test of neutrality for TE insertions that does not rely on the assumption of transpositional equilibrium. We consider the case of TE insertions that have been ascertained from a single haploid reference genome sequence and have had their allele frequency estimated in a population sample. By conditioning on age information provided within the sequence of a TE insertion in the form of the number of substitutions that have occurred within the fragment since insertion into a reference genome, we derive the probability distribution for the TE allele frequency in a population sample under neutrality. Taking models of population fluctuation into account, we then test the fit of predictions of our model to allele frequency data from 190 retrotransposon insertion loci in North American and African populations of Drosophila melanogaster. Using this non-equilibrium model, we are able to explain about 80% of the variance in TE insertion allele frequencies. Controlling for nonequilibrium dynamics of transposition and host demography, we demonstrate how one may detect negative selection acting against most TEs as well as evidence for a small subset of TEs being driven to high frequency by positive selection. Our work establishes a new framework for the analysis of the evolutionary forces governing large insertion mutations like TEs or gene duplications.

Thoughts on: The date of interbreeding between Neandertals and modern humans.

The following are my (Graham Coop, @graham_coop) brief thoughts on Sriram Sankararaman et al.’s arXived article: “The date of interbreeding between Neandertals and modern humans.”. You can read the authors’ guest post here, along with comments by Sriram and others.

Overall it’s a great article, so I thought I’d spend sometime talking about the interpretation of the results. Please feel free to comment, our main reason for doing these posts is to facilitate early discussion of preprints.

The authors analysis relies on measuring the correlation along the genome between alleles that may have been inherited from the putative admixture event [so called admixture. The idea being that if there was in fact no admixture and these alleles have just been inherited from the common ancestral population (>300kya) then these correlations should be very weak, as there has been plenty of time for recombination to break down the correlation between these markers. If there has been a single admixture event, the rate at which the correlation decays with the genetic distance between the markers is proportional to this admixture time [i.e. slower decay for a more recent event, as there is less time for recombination]. These ideas for testing for admixture have been around in the literature for sometime [e.g. Machado et al], its the application and genome-wide application that is novel.

As you can tell from the title and abstract of the paper, the authors find pretty robust evidence that this curve is decaying slower than we’d expect if there had been no gene flow, and estimate this “admixture time” to be 37k-86k years ago. However, as the authors are careful to note in their discussion, this is not a definitive answer to whether modern humans and Neandertals interbred, nor is this number a definite time of admixture. Obviously the biological implications of the admixture result will get a lot of discussion, so I thought I’d instead spend a moment on these caveats. [This post has run long, so I’ll only get to the 1st point in this post and perhaps return to write another post on this later].

Okay so did Neandertals actually mate with humans?

The difficulty [as briefly discussed by the authors] is that we cannot know for sure from this analysis that the time estimated is the time of gene flow from Neandertals, and not some [now extinct] population that is somewhat closer to Neandertals than any modern humans.

Consider the figure below. We would like to say that the cartoon history on the left is true, where gene flow has happened directly from Neandertals into some subset of humans. The difficulty is that the same decay curve could be generated by the scenario on the right, where gene flow has occurred from some other population that shares more of its population history with Neandertals than any current day human population does.

Why is this? Well allele frequency change that occurred in the red branch [e.g. due to genetic drift] means that the frequencies in population X and Neandertals are correlated. This means that when we ask questions about correlations along the genome between alleles shared between Neanderthals and humans, we are also asking questions about correlations along the genome between population X and modern humans. So under scenario B I think the rate of decay of the correlation calculated in the paper is a function only of the admixture time of population X with Europeans, and so there may have been no direct admixture from Neandertals into Eurasians*.

First thing is first, that doesn’t diminish how interesting the result is. If interpretation of the decay as a signal of admixture is correct, then it still means that Eurasians interbred with some ancient human population, which was closer to Neandertals than other modern humans. That seems pretty awesome, regardless of whether that population is Neanderthals or some yet undetermined group.

At this point you are likely saying: well we know that Neandertals existed as a [somewhat] separate population/species who are these population X you keep talking about and where are their remains? Population X could easily be a subset of what we call Neandertals, in which case you’ve been reading this all for no reason [if you only want to know if we interbred with Neandertals]. However, my view is that in the next decade of ancient human population history things are going to get really interesting. We have already seen this from the Denisovian papers [1,2], and the work of ancient admixture in Africa (e.g. Hammer et al. 2011, Lachance et al. 2012). We will likely discover a bunch of cryptic somewhat distinct ancient populations, that we’ve previously [rightly] grouped into a relatively small number of labels based on their morphology and timing in the fossil record. We are not going to have names for many of these groups, but with large amounts of genomic data [ancient and modern] we are going to find all sorts of population structure. The question then becomes not an issue of naming these populations, but understanding the divergence and population genetic relationship among them.

There’s a huge range of (likely more plausible) scenarios that are hybrids between A and B that I think would still give the same difficulties with interpretations. For example, ongoing low levels of gene flow from population X into the Ancestral “population” of modern humans, consistent with us calling population X modern humans [see Figure below, **]. But all of the scenarios likely involve some thing pretty interesting happening in the past 100,000 years, with some form of contact between Eurasians and a somewhat diverged population.

As I say, the authors to their credit take the time in the discussion to point out this caveat. I thought some clarification of why this is the case would be helpful. The tools to address this problem more thoroughly are under development by some of the authors on this paper [Patterson et al 2012] and others [Lawson et al.]. So these tools along with more sequencing of ancient remains will help clarify all of this. It is an exciting time for human population genomics!

* I think I’m right in saying that the intercept of the curve with zero is the only thing that changes between Fig 1A and Fig 1B.

** Note that in the case shown in Figure 2, I think Sriram et al are mostly dating the red arrow, not any of the earlier arrows. This is because they condition their subset of alleles to represent introgression into European and to be at low frequency in Africa. We would likely not be able to date the deeper admixture arrow into the ancestor on Eurasian/Africa using the authors approach, as [I think] it relies on having a relatively non-admixed population to use as a control.

Robust identification of local adaptation from allele frequencies

Robust identification of local adaptation from allele frequencies

Torsten Günther, Graham Coop
(Submitted on 13 Sep 2012)

Comparing allele frequencies among populations that differ in environment has long been a tool for detecting loci involved in local adaptation. However, such analyses are complicated by an imperfect knowledge of population allele frequencies and neutral correlations of allele frequencies among populations due to shared population history and gene flow. Here we develop a set of methods to robustly test for unusual allele frequency patterns, and correlations between environmental variables and allele frequencies while accounting for these complications based on a Bayesian model previously implemented in the software Bayenv. Using this model, we calculate a set of `standardized allele frequencies’ that allows investigators to apply tests of their choice to multiple populations, while accounting for sampling and covariance due to population history. We illustrate this first by showing that these standardized frequencies can be used to calculate powerful tests to detect non-parametric correlations with environmental variables, which are also less prone to spurious results due to outlier populations. We then demonstrate how these standardized allele frequencies can be used to construct a test to detect SNPs that deviate strongly from neutral population structure. This test is conceptually related to FST but should be more powerful as we account for population history. We also extend the model to next-generation sequencing of population pools, which is a cost-efficient way to estimate population allele frequencies, but it implies an additional level of sampling noise. The utility of these methods is demonstrated in simulations and by re-analyzing human SNP data from the HGDP populations. An implementation of our method will be available from this http URL.

Our paper: The date of interbreeding between Neandertals and modern humans

This post is by Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich on their paper The date of interbreeding between Neandertals and modern humans arXived here

The relationship between modern humans and archaic hominins such as Neandertals has been the subject of intense debate. The sequencing of a Neandertal genome, a couple of years back (Green et al, Science 2010), showed that Neandertals are more closely related to non-African genomes than African genomes. One possible model consistent with this observation is one involving gene flow from Neandertals to modern non-Africans after the divergence of African and non-African populations. Another model that can explain these observations is one in which the population ancestral to modern humans and Neandertals is structured e.g. imagine that the population ancestral to Neandertals and modern humans consists of three groups, A,B and C, where A,B and C represent the ancestors of modern Africans, non-Africans and Neandertals respectively. The extra proximity of Neandertals to non-Africans over Africans could occur if A and B, and B and C exchanged genes with each other followed by C diverging to form Neandertals, and A and B not completely hybridizing before their divergence to form Africans and non-Africans.

The Neandertal (Green et al, Science 2010) and the Denisova genome (Reich et al, Nature 2010) papers considered the possibility of both models — either scenario was shown to produce the skew in the observed D-statistics (a measure of the excess sharing of alleles across groups) that led to Neandertals appearing closer to non-Africans than Africans. Indeed, a recent paper by Eriksson and Manica (Eriksson and Manica, PNAS 2012) used an Approximate Bayesian Computation framework with D-statistics as the summary statistics and arrived at similar conclusions.

A paper from Monty Slatkin’s group (Yang et al, MBE 2012) attempted to differentiate the two scenarios by using the site frequency spectrum. Yang et al considered the site frequency spectrum in Europeans conditioned on observing a derived allele in Neandertal and an ancestral allele in Africans (termed the doubly-conditioned frequency spectrum, dcfs). They used theory and simulations to show that an ancient structure model produces a linear dcfs. On the other hand, they showed that recent gene flow can produce an excess of rare variants which matches the observed dcfs. Interestingly, they also observed that bottlenecks post gene flow had the effect of making the dcfs linear suggesting that gene flow from Neandertals could not have preceded strong bottlenecks in the non-African populations.

A different idea that we explored was to ask if patterns of linkage disequilibrium (LD) might discriminate the two scenarios. If we could pick out haplotypes that came into modern humans from Neandertal, recombination is expected to break these haplotypes down at a fixed rate every generation (assuming neutrality). Haplotypes that came in 1000 generations ago (under recent gene flow) should be expected to be 10 times longer on average than haplotypes that came in 10000 generations ago (under ancient structure). And if we could measure LD precisely enough, we could even date these ancient events. To date such ancient events, we had to address two technical challenges : i) measures of LD can be sensitive to demographic events, ii) for events that occurred 1000s of generations ago, we need to measure LD at size scales at which genetic maps can be quite noisy and this noise can bias estimates of dates.

Theory indicates that the expected LD (measured by Lewontin’s D), across SNPs that arose on the Nenadertal lineage and introgressed, decays exponentially with genetic distance at a rate given by the time of gene flow and is robust to demographic events. This result does not hold in practice due to imperfect ascertainment of these SNPs. We did simulations to show that this decay of LD does provide accurate estimates and can differentiate gene flow and ancient structure. We also came up with a model to assess errors in genetic maps which we then used to obtain a corrected date.

Our results support the recent gene flow scenario with a likely date of gene flow into the ancestors of modern Europeans 37000-86000 years BP although this does not exclude the possibility of ancient structure. A broader methodological question we are exploring is whether LD-based analyses might be generally applicable as a tool for dating other ancient gene flow events.

Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich

Our paper: A faster-X effect for gene expression in Drosophila embryos

[This author post is by Alex Kalinka and Pavel Tomancak on their paper, An excess of gene expression divergence on the X chromosome in Drosophila embryos: implications for the faster-X hypothesis, posted to the arXiv here.]

We have been working towards publishing our study of gene expression evolution on the X chromosome in Drosophila embryos since the beginning of March this year. Recently, Casey Bergman suggested that we upload our manuscript to the arXiv, and after we did so, we were kindly invited by Graham Coop to write a guest post about our work for Haldane's Sieve.

It makes sense to post here since the roots of our study go back to Haldane in 1924 [1]; he recognised that the unusual inheritance pattern of the X chromosome, in which a single copy is present in the heterogametic sex versus two copies in the homogametic sex, could in turn lead to unusual evolutionary patterns on the X relative to the autosomes. If, for example, a beneficial mutation is recessive, then it would be more exposed to natural selection in the heterogametic sex where, relative to an equivalent autosomal allele, it would spend less time being masked by the dominant, less beneficial allele [1]. The prediction that adaptive evolution might proceed more quickly on the X than on the autosomes has been dubbed the faster-X hypothesis. However, the X chromosome might also be expected to evolve more rapidly for non-adaptive reasons. In each mating pair there will be 3 copies of the X chromosome versus 4 copies of each autosome, which might in turn lead to a lower chromosomal effective population size for the X thereby increasing the strength of random genetic drift.

While some studies have reported evidence for a faster-X effect for adaptive protein evolution in Drosophila, other studies have reported that there is no difference between the X and the autosomes, and to date the evidence is somewhat inconclusive. As we focused our study on gene expression, we had an opportunity to relax the implicit assumption that the majority of adaptive evolution occurs in coding regions. To help disentangle adaptive and non-adaptive evolutionary signatures in our data, we used both between-species measures of gene expression divergence and within-species measures of gene expression variation using inbred strains of D. melanogaster generated by the Drosophila Genetic Reference Panel (DGRP).

We found an excess of gene expression divergence on the X chromosome between six Drosophila species (a mean increase of ~20%). In contrast, we found that the X exhibits a significantly lower level of gene expression variation between inbred strains of D. melanogaster (a mean decrease of ~10%). Taken together, these results suggest that the divergence that we find between species is not driven by a relaxation of selective constraint on the X chromosome. To further explore whether such a signature could be driven by the hemizygosity of the X, we analysed gene expression in mutation accumulation lines of D. melanogaster. If the single copy of the X in males is driving the excess of expression divergence that we found on the X, then we would expect to find an excess of expression variation between lines that have independently accumulated mutations. In fact, we found the opposite was the case – the X chromosome displayed a significantly lower rate of mutation accumulation than the autosomes suggesting that the hemizygosity of the X alone is not sufficient to drive a higher rate of fixation of gene expression mutations.

Overall, we argue that the excess divergence we find on the X is best understood within the framework of the faster-X hypothesis. In support of this interpretation, we find that there is an excess of gene expression divergence on Muller's D element along the branch leading to the obscura sub-group; Muller's D element segregates as a neo-X chromosome in the obscura sub-group, and hence provides a powerful, independent test of faster evolution of the X chromosome.

Several questions remain, however, and we hope that our findings will help to stimulate further research into the details underpinning the differences we find on the X. In particular, work needs to be done to discover the genetics underlying divergence on the X, such as the relative importance of cis versus trans-acting factors, and, crucially, we need to develop a better understanding of how variation in gene expression impacts organismal fitness. Research into the latter question is essential if we are to bridge the conspicuous gaps between sequence variation, gene expression variation, and organismal fitness.

Dave Gerrard initially found elevated gene expression divergence on the X in the course of analysing data for our developmental hourglass paper, and spoke about his findings at the 43rd population genetics conference; that was more than two years ago. Since then we collected new data, and it took a while to put the paper together although it is still not certain that it will be published in a traditional journal. The arXiv is a great way to let the scientific community know about your results before the academic process runs its course. We only regret we didn't make use of this excellent outlet back in March.

[1] Haldane JBS (1924) A mathematical theory of natural and artificial selection. Part I. Trans Camb Phil Soc 23: 19-41.

Our paper: Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture

[This author post is by John Pool on his paper: Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture arXived here.]

We are in the process of publishing this analysis of >100 sequenced Drosophila melanogaster genomes (largely haploid genomes at >25X depth).  These genomes come from more than 20 geographic locations, largely within sub-Saharan Africa, where the species is thought to originate.  Truth be told, this sampling scheme was somewhat accidental – we wanted to identify a population representing a “center of genetic diversity” for the species, which for us involved sequencing small numbers of genomes from many different population samples (some from previous lab stocks, others from newly collected lines).  Ultimately we did find the sample we were looking for, and we are in the process of sequencing ~300 genomes from this Zambian population.  Still, it seemed more than worthwhile to analyze the “geographic scatter” of genomes we had obtained from across sub-Saharan Africa (as well as one small sample from Europe).

Our ambitions for this paper were largely descriptive – a preliminary analysis of genetic variation within and among the sampled populations.  We envisioned being able to compare diversity levels and genetic structure across Africa (much as I once did with a dramatically smaller data set), and to identify specific loci with signatures of selection.  And we were able to do that.  We found the highest levels of genetic diversity in and around Zambia, raising the prospect of a southern-central African origin for D. melanogaster.  We found low-to-moderate levels of genetic structure across most of sub-Saharan Africa, with only Ethiopian populations showing stronger genetic differentiation (along with some morphological differentiation, but that’s another story).  Analyses of allele frequencies within and between populations revealed a substantial number of loci with evidence of recent natural selection – many GO categories enriched for such outliers pertained to gene regulation, much as we had observed in another recent population genomic analysis.

Of course that’s how we normally think of natural selection’s influence on genetic variation – specific beneficial mutations leading to selective sweeps (whether hard or soft, partial or complete), each one influencing diversity on a limited genomic scale.  And at least in
species with large outbreeding populations like Drosophila, recurrent hitchhiking may be common enough to affect diversity at random sites in the genome (e.g. 1, 2, 3).  So we weren’t surprised to find sweep signals.  The bigger surprise to us was finding evidence that specific episodes of natural selection had affected genetic variation on the scale of whole chromosome arms or the entire genome.

The first major surprise concerned genomic patterns of non-African admixture in African D. melanogaster populations.  The occurrence of such introgression had been documented before, and there were previous findings that non-African genotypes were associated with urban environments in Africa, and that admixture levels could vary within the genome. We developed a hidden Markov model approach to detect admixed chromosomal regions (based simply on the reduced diversity found in populations outside sub-Saharan Africa).  Whereas we tend to think of admixture as a selectively neutral force, the genomic patterns of admixture we observed did not seem consistent with passive gene flow.  Non-African genotypes had displaced large portions of the gene pool of presumably quite large African populations, and this had occurred within a very short time (judging by the megabase scale of admixture tracts).  Levels of admixture across the genome showed both broad-scale heterogeneity (chromosomal differences) and relatively narrow “spikes” of admixture.  These peaks of admixture quite often overlapped with outliers for high FST between Africa and Europe, as would be expected if these regions contained functional differences between populations for which introgressing non-African alleles may now be favored in some African environments (e.g. modernizing cities).  

The second surprise came as we documented population genetic patterns associated with polymorphic inversions (as further analyzed in a forthcoming paper by Russ Corbett-Detig and Dan Hartl).  It was already known that inversions tend to differ in frequency between D. melanogaster populations, but theory and most empirical data suggested that only diversity around the inversion breakpoints should be affected.  Instead, we observed some African populations in which elevated inversion frequencies were associated with notable reductions in diversity for entire chromosome arms (and ultimately affecting genome-wide average diversity), consistent with directional selection on rearrangements or linked loci.  Perhaps more surprisingly, mostinversions found in the non-African sample (France) served to substantially increase diversity across whole chromosome arms (by up to 29% in the case of inversions on arm 3R), and by 12% genome-wide.  Here, we can only suggest that selection may have acted to favor inverted chromosomes that recently originated from a more genetically diverse (e.g. African or African-admixed) population.  Accounting for these inversions substantially alters chromosomal diversity ratios between African and European populations.

Hence, we may have the curious situation of natural selection driving introgression in both directions across the sub-Saharan/cosmopolitan population genetic divide in D. melanogaster.

You can find our draft manuscript here, supplemental items here, and the data here.

 I’m definitely glad we were able to post a draft at arXiv – it was time to communicate our findings to the research community (especially to facilitate our colleagues’ analysis and publication plans for this data set), and there’s really no downside to us as authors.  I also appreciate the chance to post here at Haldane’s Sieve, and it would be great to discuss any aspect of our draft.

John Pool