Our paper: Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture

[This author post is by John Pool on his paper: Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture arXived here.]

We are in the process of publishing this analysis of >100 sequenced Drosophila melanogaster genomes (largely haploid genomes at >25X depth).  These genomes come from more than 20 geographic locations, largely within sub-Saharan Africa, where the species is thought to originate.  Truth be told, this sampling scheme was somewhat accidental – we wanted to identify a population representing a “center of genetic diversity” for the species, which for us involved sequencing small numbers of genomes from many different population samples (some from previous lab stocks, others from newly collected lines).  Ultimately we did find the sample we were looking for, and we are in the process of sequencing ~300 genomes from this Zambian population.  Still, it seemed more than worthwhile to analyze the “geographic scatter” of genomes we had obtained from across sub-Saharan Africa (as well as one small sample from Europe).

Our ambitions for this paper were largely descriptive – a preliminary analysis of genetic variation within and among the sampled populations.  We envisioned being able to compare diversity levels and genetic structure across Africa (much as I once did with a dramatically smaller data set), and to identify specific loci with signatures of selection.  And we were able to do that.  We found the highest levels of genetic diversity in and around Zambia, raising the prospect of a southern-central African origin for D. melanogaster.  We found low-to-moderate levels of genetic structure across most of sub-Saharan Africa, with only Ethiopian populations showing stronger genetic differentiation (along with some morphological differentiation, but that’s another story).  Analyses of allele frequencies within and between populations revealed a substantial number of loci with evidence of recent natural selection – many GO categories enriched for such outliers pertained to gene regulation, much as we had observed in another recent population genomic analysis.

Of course that’s how we normally think of natural selection’s influence on genetic variation – specific beneficial mutations leading to selective sweeps (whether hard or soft, partial or complete), each one influencing diversity on a limited genomic scale.  And at least in
species with large outbreeding populations like Drosophila, recurrent hitchhiking may be common enough to affect diversity at random sites in the genome (e.g. 1, 2, 3).  So we weren’t surprised to find sweep signals.  The bigger surprise to us was finding evidence that specific episodes of natural selection had affected genetic variation on the scale of whole chromosome arms or the entire genome.

The first major surprise concerned genomic patterns of non-African admixture in African D. melanogaster populations.  The occurrence of such introgression had been documented before, and there were previous findings that non-African genotypes were associated with urban environments in Africa, and that admixture levels could vary within the genome. We developed a hidden Markov model approach to detect admixed chromosomal regions (based simply on the reduced diversity found in populations outside sub-Saharan Africa).  Whereas we tend to think of admixture as a selectively neutral force, the genomic patterns of admixture we observed did not seem consistent with passive gene flow.  Non-African genotypes had displaced large portions of the gene pool of presumably quite large African populations, and this had occurred within a very short time (judging by the megabase scale of admixture tracts).  Levels of admixture across the genome showed both broad-scale heterogeneity (chromosomal differences) and relatively narrow “spikes” of admixture.  These peaks of admixture quite often overlapped with outliers for high FST between Africa and Europe, as would be expected if these regions contained functional differences between populations for which introgressing non-African alleles may now be favored in some African environments (e.g. modernizing cities).  

The second surprise came as we documented population genetic patterns associated with polymorphic inversions (as further analyzed in a forthcoming paper by Russ Corbett-Detig and Dan Hartl).  It was already known that inversions tend to differ in frequency between D. melanogaster populations, but theory and most empirical data suggested that only diversity around the inversion breakpoints should be affected.  Instead, we observed some African populations in which elevated inversion frequencies were associated with notable reductions in diversity for entire chromosome arms (and ultimately affecting genome-wide average diversity), consistent with directional selection on rearrangements or linked loci.  Perhaps more surprisingly, mostinversions found in the non-African sample (France) served to substantially increase diversity across whole chromosome arms (by up to 29% in the case of inversions on arm 3R), and by 12% genome-wide.  Here, we can only suggest that selection may have acted to favor inverted chromosomes that recently originated from a more genetically diverse (e.g. African or African-admixed) population.  Accounting for these inversions substantially alters chromosomal diversity ratios between African and European populations.

Hence, we may have the curious situation of natural selection driving introgression in both directions across the sub-Saharan/cosmopolitan population genetic divide in D. melanogaster.

You can find our draft manuscript here, supplemental items here, and the data here.

 I’m definitely glad we were able to post a draft at arXiv – it was time to communicate our findings to the research community (especially to facilitate our colleagues’ analysis and publication plans for this data set), and there’s really no downside to us as authors.  I also appreciate the chance to post here at Haldane’s Sieve, and it would be great to discuss any aspect of our draft.

John Pool

Our paper: Genealogies of rapidly adapting populations

[This author post is by Richard Neher on his paper with Oskar Hallatschek: Genealogies of rapidly adapting populations arXived here.


That selection distorts genealogies is a well-known fact, but properties of genealogies shaped by selection are poorly understood. We set out to investigate genealogies in a simple model of rapid adaptation in asexuals: The fitness of individuals is changed by small amounts though frequent mutation, while the overall population size is kept constant by a carrying capacity. We simulated the model and tracked genealogies.

The genealogies we found have two striking features incompatible with the standard neutral coalescent: (i) Many lineages merge almost simultaneously. (ii) Forward in time, the trees often branch very asymmetrically, i.e., almost the entire population descends from one branch while the other branches share the remaining minority. Using branching process approximations and a mapping to range expansion problems (see Brunet et al, (2007)), we show that the genealogies are similar to those expected from the Bolthausen-Sznitman coalescent (BSC), a special case of multiple merger coalescents. Very similar conclusions have been reached in another recent preprint by Desai, Walczak and Fisher. The BSC is well studied and we can build on many results from the mathematical literature.

The difference between Kingman and multiple merger coalescence is closely related to the distinct stochastic properties of genetic drift and draft. While drift describes short term fluctuations in offspring number which are bounded, draft refers to stochasticity through linked selection. Draft can result in fluctuations of the same order as the population size. Even if very rare, such large fluctuations are important. Lumping drift and draft together and labeling the result as effective population size is rarely helpful and often confusing.

Why should we care? We often want to learn about past dynamics from snapshots of populations (sequence samples). To this end, we compare the diversity patterns in the sample to model predictions and infer model parameters. If we use an inappropriate model, we get meaningless answers. Furthermore, some events that are very unlikely under Kingman’s coalescent are quite common when multiple mergers are allowed. Consider for example a lone haplotype in a large sample that connects to the root of the tree. This is very unlikely in neutral coalescent models and one might take it as evidence for immigration from a diverged population. If multiple mergers dominate coalescence, this does not come as a surprise. Similarly, an excess of singletons is not necessarily evidence for expanding populations or deleterious mutations but might be due to draft. I wonder whether more potential pitfalls of this sort exist.

Richard Neher

Our paper: Inference of population splits and mixtures from genome-wide allele frequency data

[This author post is by Joe Pickrell (@joe_pickrell) on Inference of population splits and mixtures from genome-wide allele frequency data, available from arXiv here]

Early last year, I began working (with Jonathan Pritchard) on methods for using genetics to understand population history. As we describe in our preprint, our approach was to build a parameterized model to describe the patterns of correlation in allele frequencies across populations. This type of approach dates back to brilliant work on building population trees by Luca Cavalli-Sforza, AWF Edwards, and Joe Felsenstein from around 40 years ago. The key to our work is that instead of representing history as a bifurcating tree, we additionally allow “migration events” to model admixture between populations. The output from our model (called TreeMix, and available here) is something like that shown below.

A graph of human population history, allowing 10 migration events. Populations are colored according to geographic region.

We applied this method to both human and dog history, with a mix of both known and novel historical results. I thought here I’d speculate about a couple of the novel results:

1. In the human data (see the graph above), one of the more surprising things to me was the arrow to the Cambodian population. The Cambodians appear to be an admixed population, with ~85% of their ancestry related to other southeast Asian populations (like the Dai) and ~15% of their ancestry from…it’s not totally clear. As you can see in the graph, the source of this admixture appears to be a population not particularly closely related to any other population in these data. So who was this population? A speculation is that this represents ancestry from a population related to the “Ancestral South Indian” population described by Reich et al. (2009), though other sources (e.g. Oceania) are plausible.

2. In the dog data (see Figures 5 and 6 in the pre-print), the most overwhelming signal in the data is that the Basenji, a central African dog breed, appears to trace ~25% of its ancestry to admixture with wolves since domestication. This signal is made somewhat surprising by the fact that there are no wolf populations currently living in Africa, which would seem to be a formidable barrier to admixture with an African dog breed. A hint for what’s going on here is provided by vonHoldt et al. (2010), who show that the basenji have an unusual amount of shared variation with wolves from the Middle East. One speculation, then, is that as the ancestors of the Basenji moved into Africa, they came into contact with Middle Eastern wolves and admixed with them.

Other suggestions for scenarios to explain these results are of course welcome. Overall, I’m hopeful that approaches like TreeMix will eventually supplant “standard” tree-building algorithms for situations in which gene flow is known to occur, though of course further development is necessary before this becomes reality.

Joe Pickrell

Our paper: Blood ties: ABO is a trans-species polymorphism in primates

[This author post is by Laure Ségurel [a postdoc in the Przeworski Lab] on the paper Blood ties: ABO is a trans-species polymorphism in primates, posted on the arXiv here]

The mysteries of the ABO blood group were first brought to our attention by Carole Ober. When we started working on it, we were mostly surprised by how little was known about the function of such a heavily studied gene and such an important clinical phenotype. Indeed, the expression of A, B and/or O antigens at the surface of some cells is a polymorphic phenotype shared by species as diverse as macaques and baboons in Africa, gibbons in Asia, squirrel monkeys in the Americas and, of course, humans throughout most of the world yet, many questions remain unanswered, such as: What is the biological role of ABO in different cell types? Why did Hominoids evolve toward its expression of blood cells whereas other primates express it only on epithelial/endothelial cells? Why is the O allele at such high frequency only in humans? What are the selective agents responsible for the maintenance of this polymorphism? And why did chimpanzees and bonobos apparently loose the polymorphism?

One question that we became interested in answering with population genetic tools was that of the origin of such blood types. When did the genetic polymorphism first emerge and which species share it identical by descent (as opposed to by convergent evolution)? Answers to these questions could tell us where and when having multiple alleles at this locus became advantageous. We therefore sequenced as many Hominoids, Old World monkeys and New World monkeys we could get our hands on, and, even more interestingly, we started thinking about the expectations under a model of convergent evolution, i.e., one where the AB genetic polymorphism was created independently multiple times in different species (and then maintained by balancing selection in these lineages) versus under a model of trans-species polymorphism, i.e., in which the AB genetic polymorphism arose early in time and was transmitted identical by descent to distinct species. Key to distinguishing the two predictions is the age of different selected alleles within a polymorphic population.

We therefore compared alleles within humans, orangutans, gibbons, macaques, baboons and colobus monkeys (all polymorphic species for the A and B alleles), and showed that, at least among Hominoids and among Old World monkeys, the observed genetic pattern is not compatible with a model of convergent evolution but on the contrary matches the expectations under a model of a trans-species polymorphism maintained by multi-allelic balancing selection. In other words, the data indicate that the AB polymorphism was present at least around 20 Millions of years ago, if not earlier. Also, interestingly, it seems that the A, B and O functional classes do not provide a complete description of the allelic classes natural selection is acting on, which underscores the need for more detailed functional studies of ABO sub-groups.

By submitting the paper to arXiv, we hope to circulate it to a diverse audience and without delay. In particular, we hope that the study will motivate more experimental/functional work about the role of this polymorphism in immune response, e.g., to pathogen infections.

Laure Ségurel

Our paper: Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals

Our next “our paper” guest post is by Vincent Lynch [@VinJLynch] who’s just joined the UChicago faculty from a postdoc at Yale. He’s posting about his recently arXived paper:

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals. ArXived here.
_________________________________________________________________________________
Explaining how morphology evolves is a major challenge in biology. While it’s clear that changes in gene regulation are ultimately responsible for the development and evolution of complex characters, we are only just beginning to understand the molecular mechanisms of gene regulatory evolution. This is largely due to the emergence of new technologies, such as mRNA-Seq and ChIP-Seq, which give biologists the tools to explore evolution across the genome and in non-model species.

We took advantage of these methods to explore the evolution of gene expression in the uterus during the origin of pregnancy in mammals. Using mRNA-Seq, we show that gene expression evolved extremely rapidly during major stages in the evolution of pregnancy, for example during the origin of maternal resource provisioning in the stem-lineage of Mammalia, placentation in the stem-lineage of Theria, and implantation in the stem-lineage of Eutheria. Using ChIP-Seq to identify the cis-regulatory elements of genes recruited into uterine expression in mammals suggests that the majority of enhancers and promoters derived from mammalian lineage-specific transposons.

While recent technological advances are changing the way we do biology (see Wagner 2013), as these emerging methods come into the mainstream we must collectively define our new standards of evidence. What experiments and methods build a convincing case for X? Is it sufficient, for example, to conclude that a transposon donated a novel promoter to a gene if a ChIP-Seq peak for a histone mark associated with promoters lies within the transposon? If we then expand that observation across the genome, can we reasonably conclude that transposons are casually responsible for gene regulatory change? For these reasons we chose to post our manuscript as a work-in-progress to arXiv, both as our contribution to the larger discussion of what constitutes the standards of evidence in this emerging field of biology and as an opportunity to receive feedback from our colleagues to complement formal peer-review.

Vincent Lynch

Our paper: Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster

Casey Bergman [@caseybergman and @bergmanlab] kindly wrote a post about his recently arXived paper:
Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster
ArXived here.
__________________________________________________________________
As part of the Drosophila 12 Genome Project, Steve Salzberg and colleagues’ published a pioneering paper in 2005 showing that complete genomes of the bacterial endosymbiont Wolbachia pipientis can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. This paper always left an impression on me as a very clever use of extracting new biology from existing genomic data, and when the era of resequencing multiple strains of D. melanogaster kicked off a few years ago, it seemed like a natural extension to ask if this approach could be adapted to a next-generation sequencing data to study the co-evolution of Wolbachia and Drosophila using whole genome data.

In the current work, we used short-read next generation sequencing data from two major resequencing efforts in D. melanogaster — the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP) — together with the reference Wolbachia genome published by Wu et al. (2005) and extracted over 175 complete Wolbachia genomes and nearly 300 complete mitochondrial genomes. Readers can find the main results in the paper, which is currently in review. I’d like to discuss here the social context of the project and some of the reasons we submitted to arXiv.

This project started out as summer project for a masters student, Mark Richardson, in 2010 who did an amazing job developing the initial pipeline made most of the initial discoveries in the paper. Mark and I started a collaboration with Frank Jiggins and Mike McGwire shortly after to verify that our in silico genotyping results were making sense, who suggested to bring in Lucy Weinart and John Welch to help with the more sophisticated Bayesian phylogenetic analysis. Another PhD student in my lab, Raquel Linheiro, adapted her transposable element detection pipeline to identify particular Wolbachia sublineages which was crucial to linking our data with previous results. This was a great collaboration, where everyone made significant contributions, and I would collaborate with everyone again (and I hope to!).

At the time (summer 2010), we only had access to the North American strains from the DGRP sample; knowing that North American D. melanogaster are derived populations, we were cautious about the impact that population structure had on our results. We planned in early 2011 to publish on only the DGRP dataset since Mark was going off to do a PhD in Australia and I didn’t have anyone else in the group working on this project. In the summer of 2011, the African DPGP data came online and I decided to take a peek and run the pipeline on the African strains as well. This led to a major overhaul of the project and set us back a year, since all the data had to be reanalyzed again together and the interpretation of the biogeography results was substantially altered. This was in some ways lucky because our initial interpretation of evidence for a selective sweep on one of the cytoplasmic lineages was probably wrong, and it saved us from having to back peddle on this misinterpretation in a later publication.

As we plugged away at trying to finish this project, we had inquiries about the status of the project from several other groups working in the Wolbachia field. Honestly this stressed me out quite a bit, since some of the inquiries were coming from post-docs in big labs. But instead of just sitting on the data, after we finalized the dataset we decided to release these data openly on our lab blog in April 2012. We decided on an open release as a way to help these teams (and others we didn’t know about), but also to get some priority in this area by providing the “gold standard” that other groups could use (and cite!). For the record, I will note that we asked two teams who contacted us about our project if they would reciprocate by sharing unpublished genomic data or in one case published genomic data that was not submitted to GenBank; both declined.

After making the decision to release the data pre-publication, it was a natural step to submit the manuscript to arXiv. I’m an open science advocate and used the Nature Preprint server occasionally in the past. I never really liked the Nature Preprint server, though, since I thought people posted there to give their manuscript the stink of being “Nature (in prep)” on their CV. And I never posted to arXiv in the past, since I always thought it was for more hardcore computational or mathematical biology. But recently, I was convinced by Rosie Redfield, Leonid Kruglyak and colleagues putting their Arsenic Life paper on arXiv that more empirical work in quantitative biology was arXiv-able. And just as with releasing our data early, it seemed like the best way to prevent being scooped was to get our results out as quickly as possible and letting people know about it.

So we went for it. And I have to say the experience has been thoroughly rewarding. Submitting was a piece of cake, easier than any journal I’ve ever submitted to. Having a URL to point to allowed me to tweet about it, which got some exposure to the paper and some new colleagues on twitter. It also allowed me to send a submitted manuscript around to colleagues for informal review, without cluttering up their inboxes with big attachments or providing a moral dilemma about who they can share the manuscript with. And somehow submitting to arXiv pushed the “it’s submitted” button in my brain, which made me a whole lot less stressed about the possibility of being scooped and I’ve been more relaxed throughout the formal submission process. Finally, I know that the pre-publication release of the data and posting of the manuscript has led to a group in Russia using these sequences into their work, and I’ve just gotten a manuscript to review from this group citing our arXiv manuscript and extending our results before our paper is even published! This is what research is all about, right: doing science, getting it out, and letting others build on it. I’ll definitely submit to arXiv for all my papers from my lab, and look forward to the Haldane’s Sieve readership giving us a hard time about our manuscripts while they evolve into formal publications.

Casey Bergman

Our paper: The Genomic Signature of Crop-Wild Introgression in Maize

Our inaugural author post is by Matt Hufford and Jeff Ross-Ibarra [@lab_ri] on their paper:
The Genomic Signature of Crop-Wild Introgression in Maize ArXived here.

Evolutionary biologists have long been fascinated by introgressive hybridization. Numerous examples in which introgression has played an important evolutionary role are known, but genetic characterization has typically focused on only a handful of loci.

We took advantage of the recent development of inexpensive genotyping to address a long-standing question of introgression in maize evolution. Maize was domesticated in the warm low elevations of southwest Mexico, and likely colonized the highlands of central Mexico only thousands of years later. Maize is frequently cultivated in sympatry with its wild relatives the teosintes and is known to hybridize with them. Hybridization is especially common in the highlands, where maize and teosinte share several derived morphological features thought to be adaptive to high elevation.

We set out to discover the genomic extent of introgression in highland maize and teosinte populations and the degree to which this has been adaptive. We genotyped 9 sympatric population pairs of maize and teosinte at ~39,000 SNPs. We used two different algorithms (in the software STRUCTURE and HAPMIX) to model chromosomes as mosaics of maize and teosinte, and characterized regions of putative introgression. Surprisingly, we found shared regions of introgression across many populations and primarily only from teosinte into maize. To test whether this introgression may have facilitated maize adaptation to the highlands, we conducted a growth chamber experiment that revealed significant differences in putatively adaptive morphological traits between maize populations with and without introgression.

We submitted the paper to arXiv because this is a fast-moving area for empirical evolutionary genomics and we hoped to start the dialogue early on how to move forward with our results. We’d like feedback on the paper and specifically the following questions:

Are there recent advances in modeling admixture and introgression that we should apply?

Are our main findings surprising considering the putative history of maize diffusion?

Matt Hufford and Jeff Ross Ibarra