Our paper: Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait

This guest post is by Pleuni Pennings on the paper “Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait”, available from the arXiv here. This is cross-posted from her website here

Tobias Pamminger, Susanne Foitzik, Dirk Metzler and I analyzed the small scale spatial structure of ants of the species Temnothorax longispinosus. These ants are the host of a slavemaking ant. The slavemakers go on raids, and steal young from the host species to work as slaves in their nests. We wanted to know whether the slaves still have relatives in the nearby nests. If they do, then their behavior – which influences the slavemakers – could have an effect on their relatives and therefore on their indirect fitness.

To find out if slaves are related to their neighbours, we collected lots of ant nests (they nest in acorns), both in New York and in West Virginia, marked exactly where we found them and genotyped them at six microsatellites.

Ants in acorn

Photograph by Andreas Gros
Temnothorax longispinosus in acorn

US2009 132

We put little flags at the exact location of an ant nest to measure the distances between the nests.

Microsat Data

This is one of the figures from the manuscript. Plot R (from West Virginia) is is shown to demonstrate the distribution of colonies within a plot and to show the distribution of alleles of one of the six microsatellite loci (GT1) among colonies. Each colony is represented by a pie-diagram with the frequencies of different GT1 alleles amongst the genotyped individuals of the colony. R3 is a slavemaker nest (we genotyped the slaves, not the slavemakers) and shares most of its alleles with the free nest R7. R13 and R15 are free living host colonies in close proximity and appear to be related.

Our main conclusion is that the enslaved ants are indeed related to their neighbors. The manuscript can be found on the arXiv here: http://arxiv.org/abs/1212.0790

The manuscript was peer-reviewed at Peerage of Science, a new and very useful community of scientists who agree to review each others papers fairly. See http://www.peerageofscience.org/

The manuscript is part of Tobias Pamminger’s PhD thesis. Tobias defends his thesis this week in Mainz!! Congrats Tobias!

Tobias came up with the awesome title for the paper “Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait.”

Our paper: Bacterial diversity associated with Drosophila in the laboratory and in the natural environment

For next guest post Fabian Staubach and Dmitri Petrov write about their paper (along with coauthors) Bacterial diversity associated with Drosophila in the laboratory and in the natural environment arXived here.

Host associated bacterial communities are ubiquitous, have a variety of effects on the host phenotype and play a role in host adaptation to new environments. Some clear examples of such adaptations are known but generally these are ancient associations between host and symbiont, such as the association between aphids and the obligate symbiotic bacterium Buchnera that provides the aphid with essential amino acids or the association between bee wolfs and Streptomyces that protects bee wolf larvae from fungal infections. We are investigating the potential of bacterial communities to underlie short-term adaptation using adaptation of D. melanogaster and D. simulans to different fruit as a study system.

As the first step we profiled the diversity and composition of bacterial communities associated with Drosophila across multiple species, habitats, and substrates. We amplified and sequenced a region of the bacterial ribosomal DNA from whole body fly samples using 454 technology. We focused on comparing the bacterial communities of the sibling species D. melanogaster and D. simulans in the lab and in an ecologically and evolutionary relevant setting: their natural environment. In most cases we were able to study flies from these two species collected by aspiration from the same fruit. We also included nine different species spanning the Drosophila phylogeny to test whether phylogenetic distance and distance between bacterial communities are correlated.

We show that natural bacterial communities associated with Drosophila contain more different bacterial taxa than previously thought. Comparison to a mammalian fecal data set reveals that although mammal-associated bacterial communities are more diverse on average, the diversity of some mammalian fecal samples lies within the range or is even lower than that of the Drosophila samples we analyzed. This finding is interesting because it has been a matter of debate whether organisms with an adaptive immune system can in general accommodate higher bacterial diversity. By comparing the bacterial communities of D. melanogaster and D. simulans collected directly from different natural food substrates we demonstrate that bacterial communities differ primarily between substrates and very weakly among fly species.

We find acetic acid bacteria of the genera Acetobacter and Gluconobacter to be associated with all wild-caught flies constituting two thirds of all sequences. Acetic acid bacteria oxidize sugars and ethanol to acetic acid and are known to be directly involved in the development of a specific process of decay called ‘sour rot’ on grapes that causes wine spoilage. There is previous evidence that Drosophila is vital for the dispersal of acetic acid bacteria among rotting fruit: grapes covered with nets in the field do acquire yeasts, but no acetic acid bacteria and acetic acid bacteria thrive on grapes only when flies are present. At the same time, Acetobacter has been shown to promote Drosophila larval growth and shorten development time under certain nutritional conditions. Therefore, we argue that the relationship between Acetobacteraceae and Drosophila is likely mutualistic.

Individual natural fly samples are dominated by bacteria known to be pathogenic in Drosophila, such as Enterococcus and Providencia. These bacteria are known to reach very high cell counts during systemic infections of Drosophila and we believe that the inclusion of systemically infected flies in these samples is the most likely explanation for the observed pattern. The observation that it is in principle possible to identify potential candidate pathogens in natural populations using standard, high throughput microbial community screening techniques opens up opportunities for large scale epidemiological studies in nature and can help to identify candidate pathogenic bacterial species for further investigation in the laboratory.

In the laboratory, fly associated bacterial communities are similar irrespective of phylogenetic distance between fly species, suggesting that host genetic factors either play a minor role in shaping the bacterial communities associated with Drosophila or, as suggested by the difference of bacterial communities between D. melanogster and D. simulans in the wild, require natural conditions to manifest themselves. High variability of Drosophila bacterial communities within and between laboratories is a potential source of experimental noise when studying phenotypic variation. The impact of microbes on Drosophila phenotypes ranges from influencing growth to cold tolerance and it is hard to imagine traits that are not subject in principle to alteration by microbes.

We hope that our data will serve as a solid foundation for future studies especially for the growing community of scientists that are interested in the microbial communities that are associated with Drosophila.

Fabian Staubach and Dmitri Petrov

My paper ” HIV drug resistance: problems and perspectives”

Our next guest post is by Pleuni Pennings (@pleunipennings) on her paper HIV drug resistance: problems and perspectives arXived here, cross posted from her website here.

A few days ago, I submitted a review paper to Infectious Disease Reports. The review is an invited essay for the special issue they are planning around the World AIDS Day (December 1st).

I was pleasantly surprised to see that the author guidelines of Infectious Disease Reports said: “Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.” So, I decided to upload the manuscript to the arXiv.

The essay describes the current situation of drug resistance in HIV. The main conclusion is that, overall, drug resistance is not as big a problem as one may think. Treatments have become very good, which means that the rate of evolution of drug resistance is low. At the same time, many new drugs have become available so that when drug resistance evolves, the patient can be switched to another set of drugs. However, in poor countries, where viral genotyping, viral load monitoring and many new drugs are not available, drug resistance still poses a serious threat to people’s health.

In the essay, I explain that transmitted drug resistance occurs, but at a level that is lower than many would have expected. Roughly 10% of newly infected patients are infected with an HIV strain with at least one major drug-resistance mutation. If the virus is genotyped before treatment is started (as is standard in rich, but not in poor, countries), then treatment success is very high for these patients.

Acquired drug resistance (when resistance evolves during treatment) is more common than transmitted drug resistance, and resistance can evolve even after many years of successful treatment. It can also happen that the virus becomes resistant against multiple drugs. Nowadays, there are many different drugs available, so that even patients with multi-class drug resistance can often be treated successfully, although this is not the case in poor countries, simply because the newer drugs are expensive.

I also describe what is known about resistance due to treatment for the prevention of mother-to-child-transmission (which is a big problem) and resistance due to pre-exposure prophylaxis (which occurs, but is uncommon). I also discuss the issue of low-frequency resistance mutations and their clinical relevance. Throughout the essay, I explain how certain effects are expected or surprising from an evolutionary perspective.

I thank my collaborators Daniel Rosenbloom and Alison Hill (both at Harvard) for useful comments on an earlier version of the manuscript.

Pleuni Pennings

Our paper: The geography of recent genetic ancestry across Europe

This guest post is by Peter Ralph and Graham Coop (@graham_coop) on their paper The geography of recent genetic ancestry across Europe arXived here

In this paper we look at the genetic traces of very recent common ancestry between pairs of individuals from across Europe. We’ll likely write a few more accessible posts on this work when the paper is closer to publication, but for now (in the spirit of Haldane’s sieve) we write a little bit more of a technical post now [the full details are in the paper].

We started this project wanting to estimate recent migration rates, across continents like Europe — if we could learn how far away distant cousins are from each other, then, all else equal, we could then estimate typical migration distances. This isn’t where we ended up (that’s another project we are working on), but the basic idea, of looking at the geographic distribution of close relatives, led us to some interesting places.

As most populations lack the amazing pedigrees like that worked on by Decode in Iceland [e.g. here] we can’t actually know the true relationships between the samples (other than a few obvious siblings and full cousins). However, long segments of chromosome shared (almost) identically by descent (IBD) between two people have probably been inherited from a recent common ancestor. The length of these IBD segments tells us something about how long ago the ancestor lived,
since the older the ancestor, the more opportunity for recombination to whittle down the segment.

This has been worked on by a bunch of different groups, but the historical inference has usually been applied to small or relatively isolated populations. To really push the boundaries of these approaches we used the European subset of the POPRES dataset, which consists of thousands of human individuals. This is currently one of the best genome-scale, geographically indexed datasets, and represents a huge outbred population where we’d expect patterns of variation to be (at least partly) due to continuous migration, rather than, say, recent mixing of diverged populations or bottlenecks. So, we ran BEAGLE on the dataset to find IBD segments, and got lots of wonderful signal — it turned out that most pairs of people in the European sample (around 75%) shared IBD segments that were megabases long (i.e. longer than 1 centi-Morgan, cM). After a bunch of power and false-positive simulations, we were convinced that most of those blocks of IBD had been inherited from single common ancestors.

You could think of our results in two pieces: first, doing descriptive statistics on the distribution of IBD abundances and lengths across geography; and second, doing some inference on this distribution to see what we can learn about when those common ancestors lived.

As we hoped, there was a nice relationship to geography – people nearer to each other typically shared more and longer IBD than people farther away, in a nice monotonic relationship. This convinced us that continuous, local migration had played an important role in shaping current patterns of relatedness across Europe. Geographic distance was definitely not the only factor – superimposed on top of this was distinctive regional variation. For example, of one the strongest signals we saw was that there are higher levels of IBD sharing in Eastern Europe. As you’ll see from the paper, after further work, we think this is a potential signal of the Slavic or Hunnic expansions.

There were also some surprises to us along the way — like people in the UK sharing more IBD with Irish than with other people in the UK — that turned out to make sense after thinking about rapidly growing populations with directional migration (although there are other explanations). We correlated the patterns we saw with historical events, but (as with most genomic studies of human history) there was a lot of uncertainty. Sure, the patterns we see are consistent with the story we told, but there could potentially be a lot of other explanations, especially given the complicated and often unknown demographic history of European populations. What if all the IBD we saw came from the Neolithic expansion rather than the last two or three thousand years? This turns out to be a bigger worry than you might think — it’s fairly unlikely that two people inherited a 3cM block from a single common ancestor from 6000 years ago,but if they have enough common ancestors from back then (e.g. a strong enough bottleneck), it turns out to be reasonably likely.

So, we did some coalescent theory to work out the relationship between numbers of shared ancestors back through time (closely related to coalescent rate) and the observed distribution of IBD block lengths. We could then invert this relationship to estimate from the observed distribution of IBD blocks the mean number of ancestors that pairs of people from different parts of Europe share with each other, as a function of time. Unfortunately, this turns out to have a lot of unavoidable uncertainty– the inversion problem is “ill-conditioned” (in other terms, the likelihood surface is ridged), meaning that there were a lot of different histories that gave the same IBD length distribution.

Despite this, we could still rigorously learn a lot of good information — in particular, nearly all the IBD blocks we found did actually come from ancestors living during the last 3,000 years. Although we could only tie down the ages of the common ancestors to within a few hundred year, the major patterns can be likely tied to known historical events. There is quite a bit of uncertainty about the specific interpretations — it is still not straightforward to go from pairwise numbers of shared ancestors 1,500 years ago to conclusions about demographic events at the time — but used in conjunction with other sources of information has the promise to conclusively resolve some longstanding debates about recent history.

Finally, two addendums (addenda?) about the methods: The first is that we took an empirical approach to estimating the relationship between coalescent time distribution and observed IBD block length, by simulating a bunch (actually copying over blocks and re-running BEAGLE). We did this because BEAGLE is effectively a black box, for our purposes. This sort of approach is more common in experimental physics, where the empirical properties of detectors have to be worked out (and the problem of inferring the signal is known as “data unfolding”).

Second, we should emphasize that the uncertainty we came across in inferring dates is theoretically unavoidable, using IBD block length data. We think this is a common issue for many sorts of population genetics data — situations in which, even though we have a ton of data, getting specific, tightly constrained inferences requires making fairly strong assumptions (or equivalently, working in a specific set of parametric models). This has been highlighted in some cases [like this one], but more work is needed on this to ensure that we represent the inherent uncertainties in population genetics inferences correctly.

We’d love feedback from the popgen community about what aspects of the paper they’d like to see clarified/improved [obviously it is a pretty involved paper already, so concentrate on specific suggestions]. We have a tonne more ideas of how to improve this inference technology and extend it to other applications. But we’d love to hear your thoughts too.

Our paper: An age-of-allele test of neutrality for transposable element insertions not at equilibrium

[This author post is by Justin Blumenstiel and Casey Bergman on An age-of-allele test of neutrality for transposable element insertions not at equilibrium, available from the arXiv here]

Studies over the past several decades in Drosophila melanogaster have demonstrated that TE insertion alleles in natural populations tend to segregate at low frequency, particularly in regions of the genome that have a high recombination rate where natural selection is most effective. These results have largely supported a model where natural selection acts to remove deleterious TE insertions from the genome.  The prevailing model of why TE insertions are deleterious is that they lead to chromosomal aberrations that occur when dispersed, non-allelic repeated sequences crossover with one another. This model is known as the ectopic recombination model and it has an important feature. Since each new insertion has the potential to recombine with all the other copies in the genome, fitness will go down faster and faster with each new copy. This yields a stable equilibrium in TE copy number.

But, are TEs at equilibrium in natural populations? Genome sequencing studies have shown that the rate of TE proliferation can vary widely over time and any given TE family may demonstrate non-equilibrium “boom and bust” behavior. How do we reconcile studies that assume equilibrium with the fact that we know TE dynamics are not at equilibrium? To deal with this problem, I began developing this model out of a class project with John Wakeley while I was a graduate student over a decade ago. This model arose of some work I published in 200­2 with Hartl and Lozovsky on the age structure of non-LTR elements in D. melanogaster. I wrote this model up for my Ph.D. thesis and presented a preliminary version in a paper with Neafsey and Hartl in 2004, but it sat on the back burner until I reviewed a paper by Bergman and Bensasson in 2007 that showed many TE families in D. melanogaster have recently inserted in the genome and may not be at equilibrium.

Shortly after their paper came out I contacted Casey with the model from my thesis and we decided to push this idea forward as a collaboration, which has taken several a few years to come to fruition (both being busy with other projects and starting our labs). Things started to really move ahead when Miaomiao He in Casey’s lab generated a crucial data set that could be specifically applied to the model – strain-specific presence/absence data for a very large number of TE insertions ascertained from the D. melanogaster genome sequence.  After a few more years with it on simmer, working out several kinks in the mean time (e.g. incorporating host  demography, trying many different methods for estimating the posterior distribution of TE ages), Casey and I finally wrapped it up just as Haldane’s Sieve is starting to hit its stride. I expect that all my papers in the future will be pre-released on arXiv.

I could speak at length on the specific results, but I would just be saying what is already in the abstract. So, I would like to bring up three points for potential conversation.

First, what does it mean for TEs to be at transposition-selection balance when we know different TE families show a signature of “boom and bust” in genome sequences? There may be one way to reconcile this apparent problem. Any particular TE family may in fact not be at transposition-selection balance. For example, the P element, which invaded Drosophila melanogaster only a few decades ago, is hardly at transposition-selection balance. Therefore, one must be careful in using insertion frequencies for P elements to describe general TE dynamics. However, by integrating over all TE families in the genome, one may in fact reach an approximation that might be reasonable for assuming equilibrium transposition-selection balance. But one must be careful of something I call “family ascertainment bias”. Sometimes the most recently activated TEs are the ones easiest to discover and annotate because these ones are easily cloned from insertion mutations or are most frequent in genome sequences.

Second, in this paper, we derive the probability distribution for each individual TE insertion frequency based on its age. We demonstrate that this provides a method for TE insertions that are either positively or negatively selected. In the case where we show allele frequencies are less than expected (i.e. predicted to be negatively selected), many of these are copies that have zero substitutions. In principle, all of these could have inserted one generation before the reference strain was collected for genome sequencing. The inference that selection is acting against these TEs implicitly assumes either: 1) this wasn’t the case for many of these insertions, and the posterior distribution of ages is a good representation of the true age distribution, or 2) it may have been the case, but natural selection has already acted to remove slightly older TEs from the population, therefore making them absent from the genome sequence.

Third, when putting the finishing touches on our analysis of TE insertion data in North America, we ran up against the issue that nobody has yet published an explicit demographic scenario for North American populations of D. melanogaster, similar to those that have been developed by Wolfgang Stephan‘s Lab and others for European and African populations. We found one paper by Yukilevich et al (2010) from John True’s Lab that generated similar findings to the demography of European populations, which is consistent with the idea that North America populations of D. melanogaster are mainly derived from European ancestors.  However, Yukilevich et al (2010) didn’t explicitly model the admixture with African populations, which is known to occur in North American populations as shown by Caracristi and Schlötterer in 2003. We were surprised that an explicit admixture scenario has not been published yet, especially since this is crucial for interpreting the data from population genomic projects like the Drosophila Genetic Reference Panel. This should be an important line of work for someone to pursue (if it isn’t being done already) and if anyone has information about this a demographic model for North American populations of D. melanogaster, we’d be keen to know more so we can see if might improve our analysis.

Justin and Casey

Our paper: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution

This guest post is by Daniël Melters [@DPMelters] and Keith Bradnam [@kbradnam] on their paper [along with co-authors]: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. ArXived here.

The centromere poses an interesting paradox; although its function is essential, its molecular components are fast evolving. Centromeres in many animal and plant genomes have been characterized by the presence of large tandem repeat arrays. Numerous studies have suggested that the composition and length of the repeat units that comprise these arrays vary between species.
In this paper we tried to answer three main questions:
1) Can we identify the candidate centromere repeat sequences in genomes from hundreds of different species?
2) Do candidate centromere repeat sequences from different species share any common properties (sequence composition, length, GC% etc)?
3) How do these tandem repeats evolve?
To answer these questions, we took advantage of the large number of species with publicly available whole genome shotgun sequence data from various sequencing platforms. In total we analyzed 282 animal and plant genomes for the presence of high copy tandem repeat sequences, with the assumption that the most abundant tandem repeat is a good candidate for the centromere repeat.

We found high copy tandem repeats in the vast majority of the 282 genomes that we analyzed. For the smaller number of species with published cytology data, we correctly identified the published repeat sequence in 38 out of 43 cases. This confirms our assumption that the most abundant tandem repeat in any genome is likely to be the centromere repeat. In the five cases were we did not find the published centromere tandem repeats, we did not have data from sequencing platforms that would have allowed us to identify these repeats.

If an individual sequencing read contains at least four tandem repeats, then there is the possibility of detecting higher order repeat (HOR) structure. I.e. where a tandem array is made up of two alternating types of related sequence (A and B) to produce an A->B->A->B structure. In these cases, the AB dimer is more similar to other AB dimers, than A is to B. We found that HOR structure was surprisingly common in the candidate centromere repeats of many different species. The very long reads from Pacific Biosciences (PacBio) sequencing allowed us to further characterize repeat structure in great detail (for a few selected species), and this revealed additional levels of HOR structure.

To address the important question of ‘how similar are centromere repeats in different species?’, we performed an all-vs-all comparison between the most abundant tandem repeat in every species. Surprisingly, we found only 26 groups of species that shared any significant sequence similarity in their candidate centromere repeat sequence. The species that make up these 26 groups were always closely related species which had diverged less than 50 million years ago. When comparing the repeat sequences in these groups of closely related species, we found that repeats evolve not only by accumulation of mutations, but also by the spread of indels or by repeat doubling.

These results are in line with the ‘library’ hypothesis, which aims to describe how ratios of repeat variants can change over time. In addition, PacBio sequencing found very long tandem repeats (~1,500 bp). Furthermore, in switchgrass (Panicum virgatum) we identified several centromere repeat variants, but PacBio sequences did not show any mixing of these repeat variants. In summary, tandem repeats are frequently associated with the centromere function and most probably evolve according to the “library” hypothesis (a.k.a. molecular drive).

This paper is dedicated to the late Simon Chan, who passed away on the 22nd of August 2012 at the young age of 38 (see here for more infomation).

Daniël Melters and Keith Bradnam
PS. Supplementary table can be provided upon email request.

Our paper: Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

[This author post is by Peter Carbonetto on Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease, available from the arXiv here.]

I expect that most readers of this blog appreciate the impact that genome-wide association studies have had on our understanding of many common diseases. Still, I think it is important to reiterate a major appeal of genome-wide association studies: the analysis is conceptually straightforward to understand, even for people who have never had to suffer through a course on statistics or epidemiology. To find links between genetic loci and disease, the analysis consists of systematically searching across the genome for variants that show statistically significant correlation with susceptibility to disease. These correlations signal the presence of nearby genes—or perhaps DNA elements that regulate other genes—that are risk factors for disease.

Many readers of this blog will also appreciate, due to the multifactorial nature of most common diseases, the difficulty of establishing compelling evidence for disease-variant correlations. Hence the search for more effective data-driven strategies for discovering genetic factors underlying common diseases.

One strategy is to assess evidence for the accumulation, or “enrichment,” of disease-conferring mutations within known biological pathways. The intuition is that identifying the accumulation of small genetic effects acting in a common pathway is easier than mapping the individual genes within the pathway that contribute to disease susceptibility.

We asked whether identifying these enriched pathways can also give us useful feedback about the individual gene variants associated with disease. To answer this question, we developed a statistical method that adjusts the support for disease-variant associations to reflect enrichment of associations in a pathway. Our approach was to introduce an enrichment parameter that quantifies the increase in the probability that each variant in the pathway is associated with disease risk.

Is this a valid approach? To investigate, we applied our approach to data from the Wellcome Trust Crohn’s disease study from 2007. First, we identified a broad class of cytokine signaling genes that were enriched for genetic associations with Crohn’s disease. Next, by prioritizing variants in this pathway, we discovered candidates for association—including the STAT3 gene, the IBD5 locus, and the MHC class II genes—that were not identified in conventional analyses of the same data. These results help validate our approach, as these genetic associations have been independently confirmed in other studies and meta-analyses with much larger combined samples.

Several other important lessons emerged from our case study:

1. Interrogate as many pathways as possible. Because we collected over 3000 candidate pathways from several sources (Reactome, KEGG, BioCarta, BioCyc, etc.), many of the pathways highlighted in previous analyses of the same data were eclipsed by much stronger enrichment signals in our analysis.

2. Assess evidence for combinations of enriched pathways. Some pathways become interesting only after assessing enrichment of the pathway in combination with another pathway.

3. Account for the heterogeneity of effect sizes in Crohn’s disease. One of the assumptions we made in our analysis, mainly out of convenience, was that the additive effects on disease risk are normally distributed. While this assumption simplified this analysis, we suspect that a normal distribution does not adequately capture the smaller effect sizes in pathways, leading to a loss of power to detect enriched pathways.

At conferences, and around the lab, I’ve heard many complaints about pathway analysis (or gene set enrichment analysis) for genome-wide association studies. One complaint is that the results are difficult to interpret. Another common complaint is that the findings are sensitive to arbitrary significance thresholds. While we didn’t devote much space in the paper to a discussion of these issues, we believe that our approach offers a coherent solution to many of these problems.

Ultimately, we would like other researchers to use our methods to analyze data from their own genome-wide association studies. We tried to make our paper as accessible as possible, especially to biologists that are not well-acquainted with Bayesian approaches, by carefully explaining how to interpret the Bayes factors and posterior statistics used in the analysis. We are working on releasing the full source code (in R and MATLAB) for all our methods, and accompanying documentation.

Peter Carbonetto

Our paper: The genetic prehistory of southern Africa

[This author post is by Joe Pickrell (@joe_pickrell), Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf on The genetic prehistory of southern Africa, available from arXiv here]

The indigenous populations of southern Africa are phenotypically, linguistically, culturally, and genetically diverse. Although many groups speak Bantu languages (having arrived in the region during an expansion of Iron-Age agriculturalists), there are a number of populations who speak diverse non-Bantu languages with heavy use of click consonants. We refer to these populations as “Khoisan“. Most of the Khoisan populations are hunter-gatherers, but some are pastoralists; the extensive linguistic and cultural diversity of the Khoisan (who live in a relatively small region around the Kalahari semi-desert) is historically puzzling.

Two hunter-gatherer (or formerly hunter-gatherer) populations in East Africa, the Hadza and Sandawe, also speak languages that also make use of click consonants. Linguists see little in common between the languages in southern Africa and Hadza, although Sandawe might be genealogically related to some of the Khoisan languages. Nevertheless, the shared use of click consonants and a foraging lifestyle led many to hypothesize that the southern African Khoisan populations are genetically related to the Hadza and Sandawe, which would imply that their ancestors were once considerably more widespread. This hypothesis has been controversial for decades.

Tree relating the Khoisan-like proportion of ancestry (shown in blue in the barplot) in Khoisan, Hadza, and Sandawe after accounting for non-Khoisan admixture.

In our study, we use genetic data to address the history of the diverse groups within southern Africa and their relationship to the Hadza and Sandawe. Specifically, we genotyped individuals from 16 Khoisan populations, 5 neighboring populations that speak Bantu languages, and the Hadza (the latter thanks to Brenna Henn, Joanna Mountain, and Carlos Bustamante) on a SNP array designed for studies of human history, in that the SNP ascertainement scheme is known and includes SNPs ascertained in the Khoisan. We then merged in Hadza and Sandawe samples from a recent paper by Joseph Lachance, Sarah Tishkoff and colleagues. The main conclusions are as follows:

  1. Within the southern African Khoisan, there are two genetic groups, which correspond roughly to populations in the northwest and southeast Kalahari semi-desert. Populations from these two groups have been labeled in the tree in this post (see also Figure 1B in the preprint). We estimate that these two groups diverged within the last 30,000 years. However, this date should be taken as an upper bound due to point #2 below.
  2. All southern African Khoisan groups are admixed with non-Khoisan populations. Even the most isolated Khoisan groups (i.e. the “San” from the HGDP, who are included in the “Ju|’hoan_North” group in our paper) show some evidence of admixture with agricultualist and/or pastoralist groups. A subtle technical point is that this had not been previously noticed because methods that rely on correlations in allele frequencies are sometimes unable to detect admixture if all populations are admixed (this is related to Mr. Razib Khan’s post on why ADMIXTURE is not a test for admixure). To get around this, we developed new methods based on the decay of linkage disequilibrum.
  3. The Hadza and Sandawe trace part of their ancestry to admixture with a population related to the Khoisan. After accounting for admixture, we built a tree of “Khoisan-like” ancestry in the southern and eastern African populations (see the Figure above). The striking thing is that the Hadza and Sandawe fall with high confidence on the same branch as the Khoisan. This suggests that, prior to subsequent migrations of food-producing peoples over most of sub-Saharan Africa, populations related to the Khoisan were indeed spread continuously over a huge geographic range including Tanzania and southern Africa.

We’re excited about these results for a number of reasons. First of all, we’re now on our way towards understanding the history of the diverse Khoisan populations–for years these populations have been treated as genetically equivalent, but it’s clear that each population has its own complex history. Secondly, with the new statistical methods we’ve developed we were able to show not only the varying amounts of admixture that has occurred at different times in southern African populations, but were also able to peel away these layers of admixture to learn about the relationships among Khoisan populations that existed thousands of years ago. Finally, we think that these results have important implications for work using genetics to understand the geographic origin of modern humans within Africa. Though both southern and eastern Africa have been proposed as potential origins, from the tree in this post, we see no genetic evidence in favor of either; from our point of view this question remains open.

Joe Pickrell, Nick Patterson, Mark Stoneking, David Reich, and Brigitte Pakendorf

Our paper: The date of interbreeding between Neandertals and modern humans

This post is by Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich on their paper The date of interbreeding between Neandertals and modern humans arXived here

The relationship between modern humans and archaic hominins such as Neandertals has been the subject of intense debate. The sequencing of a Neandertal genome, a couple of years back (Green et al, Science 2010), showed that Neandertals are more closely related to non-African genomes than African genomes. One possible model consistent with this observation is one involving gene flow from Neandertals to modern non-Africans after the divergence of African and non-African populations. Another model that can explain these observations is one in which the population ancestral to modern humans and Neandertals is structured e.g. imagine that the population ancestral to Neandertals and modern humans consists of three groups, A,B and C, where A,B and C represent the ancestors of modern Africans, non-Africans and Neandertals respectively. The extra proximity of Neandertals to non-Africans over Africans could occur if A and B, and B and C exchanged genes with each other followed by C diverging to form Neandertals, and A and B not completely hybridizing before their divergence to form Africans and non-Africans.

The Neandertal (Green et al, Science 2010) and the Denisova genome (Reich et al, Nature 2010) papers considered the possibility of both models — either scenario was shown to produce the skew in the observed D-statistics (a measure of the excess sharing of alleles across groups) that led to Neandertals appearing closer to non-Africans than Africans. Indeed, a recent paper by Eriksson and Manica (Eriksson and Manica, PNAS 2012) used an Approximate Bayesian Computation framework with D-statistics as the summary statistics and arrived at similar conclusions.

A paper from Monty Slatkin’s group (Yang et al, MBE 2012) attempted to differentiate the two scenarios by using the site frequency spectrum. Yang et al considered the site frequency spectrum in Europeans conditioned on observing a derived allele in Neandertal and an ancestral allele in Africans (termed the doubly-conditioned frequency spectrum, dcfs). They used theory and simulations to show that an ancient structure model produces a linear dcfs. On the other hand, they showed that recent gene flow can produce an excess of rare variants which matches the observed dcfs. Interestingly, they also observed that bottlenecks post gene flow had the effect of making the dcfs linear suggesting that gene flow from Neandertals could not have preceded strong bottlenecks in the non-African populations.

A different idea that we explored was to ask if patterns of linkage disequilibrium (LD) might discriminate the two scenarios. If we could pick out haplotypes that came into modern humans from Neandertal, recombination is expected to break these haplotypes down at a fixed rate every generation (assuming neutrality). Haplotypes that came in 1000 generations ago (under recent gene flow) should be expected to be 10 times longer on average than haplotypes that came in 10000 generations ago (under ancient structure). And if we could measure LD precisely enough, we could even date these ancient events. To date such ancient events, we had to address two technical challenges : i) measures of LD can be sensitive to demographic events, ii) for events that occurred 1000s of generations ago, we need to measure LD at size scales at which genetic maps can be quite noisy and this noise can bias estimates of dates.

Theory indicates that the expected LD (measured by Lewontin’s D), across SNPs that arose on the Nenadertal lineage and introgressed, decays exponentially with genetic distance at a rate given by the time of gene flow and is robust to demographic events. This result does not hold in practice due to imperfect ascertainment of these SNPs. We did simulations to show that this decay of LD does provide accurate estimates and can differentiate gene flow and ancient structure. We also came up with a model to assess errors in genetic maps which we then used to obtain a corrected date.

Our results support the recent gene flow scenario with a likely date of gene flow into the ancestors of modern Europeans 37000-86000 years BP although this does not exclude the possibility of ancient structure. A broader methodological question we are exploring is whether LD-based analyses might be generally applicable as a tool for dating other ancient gene flow events.

Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich

Our paper: A faster-X effect for gene expression in Drosophila embryos

[This author post is by Alex Kalinka and Pavel Tomancak on their paper, An excess of gene expression divergence on the X chromosome in Drosophila embryos: implications for the faster-X hypothesis, posted to the arXiv here.]

We have been working towards publishing our study of gene expression evolution on the X chromosome in Drosophila embryos since the beginning of March this year. Recently, Casey Bergman suggested that we upload our manuscript to the arXiv, and after we did so, we were kindly invited by Graham Coop to write a guest post about our work for Haldane's Sieve.

It makes sense to post here since the roots of our study go back to Haldane in 1924 [1]; he recognised that the unusual inheritance pattern of the X chromosome, in which a single copy is present in the heterogametic sex versus two copies in the homogametic sex, could in turn lead to unusual evolutionary patterns on the X relative to the autosomes. If, for example, a beneficial mutation is recessive, then it would be more exposed to natural selection in the heterogametic sex where, relative to an equivalent autosomal allele, it would spend less time being masked by the dominant, less beneficial allele [1]. The prediction that adaptive evolution might proceed more quickly on the X than on the autosomes has been dubbed the faster-X hypothesis. However, the X chromosome might also be expected to evolve more rapidly for non-adaptive reasons. In each mating pair there will be 3 copies of the X chromosome versus 4 copies of each autosome, which might in turn lead to a lower chromosomal effective population size for the X thereby increasing the strength of random genetic drift.

While some studies have reported evidence for a faster-X effect for adaptive protein evolution in Drosophila, other studies have reported that there is no difference between the X and the autosomes, and to date the evidence is somewhat inconclusive. As we focused our study on gene expression, we had an opportunity to relax the implicit assumption that the majority of adaptive evolution occurs in coding regions. To help disentangle adaptive and non-adaptive evolutionary signatures in our data, we used both between-species measures of gene expression divergence and within-species measures of gene expression variation using inbred strains of D. melanogaster generated by the Drosophila Genetic Reference Panel (DGRP).

We found an excess of gene expression divergence on the X chromosome between six Drosophila species (a mean increase of ~20%). In contrast, we found that the X exhibits a significantly lower level of gene expression variation between inbred strains of D. melanogaster (a mean decrease of ~10%). Taken together, these results suggest that the divergence that we find between species is not driven by a relaxation of selective constraint on the X chromosome. To further explore whether such a signature could be driven by the hemizygosity of the X, we analysed gene expression in mutation accumulation lines of D. melanogaster. If the single copy of the X in males is driving the excess of expression divergence that we found on the X, then we would expect to find an excess of expression variation between lines that have independently accumulated mutations. In fact, we found the opposite was the case – the X chromosome displayed a significantly lower rate of mutation accumulation than the autosomes suggesting that the hemizygosity of the X alone is not sufficient to drive a higher rate of fixation of gene expression mutations.

Overall, we argue that the excess divergence we find on the X is best understood within the framework of the faster-X hypothesis. In support of this interpretation, we find that there is an excess of gene expression divergence on Muller's D element along the branch leading to the obscura sub-group; Muller's D element segregates as a neo-X chromosome in the obscura sub-group, and hence provides a powerful, independent test of faster evolution of the X chromosome.

Several questions remain, however, and we hope that our findings will help to stimulate further research into the details underpinning the differences we find on the X. In particular, work needs to be done to discover the genetics underlying divergence on the X, such as the relative importance of cis versus trans-acting factors, and, crucially, we need to develop a better understanding of how variation in gene expression impacts organismal fitness. Research into the latter question is essential if we are to bridge the conspicuous gaps between sequence variation, gene expression variation, and organismal fitness.

Dave Gerrard initially found elevated gene expression divergence on the X in the course of analysing data for our developmental hourglass paper, and spoke about his findings at the 43rd population genetics conference; that was more than two years ago. Since then we collected new data, and it took a while to put the paper together although it is still not certain that it will be published in a traditional journal. The arXiv is a great way to let the scientific community know about your results before the academic process runs its course. We only regret we didn't make use of this excellent outlet back in March.

[1] Haldane JBS (1924) A mathematical theory of natural and artificial selection. Part I. Trans Camb Phil Soc 23: 19-41.