Our paper: Blood ties: ABO is a trans-species polymorphism in primates

[This author post is by Laure Ségurel [a postdoc in the Przeworski Lab] on the paper Blood ties: ABO is a trans-species polymorphism in primates, posted on the arXiv here]

The mysteries of the ABO blood group were first brought to our attention by Carole Ober. When we started working on it, we were mostly surprised by how little was known about the function of such a heavily studied gene and such an important clinical phenotype. Indeed, the expression of A, B and/or O antigens at the surface of some cells is a polymorphic phenotype shared by species as diverse as macaques and baboons in Africa, gibbons in Asia, squirrel monkeys in the Americas and, of course, humans throughout most of the world yet, many questions remain unanswered, such as: What is the biological role of ABO in different cell types? Why did Hominoids evolve toward its expression of blood cells whereas other primates express it only on epithelial/endothelial cells? Why is the O allele at such high frequency only in humans? What are the selective agents responsible for the maintenance of this polymorphism? And why did chimpanzees and bonobos apparently loose the polymorphism?

One question that we became interested in answering with population genetic tools was that of the origin of such blood types. When did the genetic polymorphism first emerge and which species share it identical by descent (as opposed to by convergent evolution)? Answers to these questions could tell us where and when having multiple alleles at this locus became advantageous. We therefore sequenced as many Hominoids, Old World monkeys and New World monkeys we could get our hands on, and, even more interestingly, we started thinking about the expectations under a model of convergent evolution, i.e., one where the AB genetic polymorphism was created independently multiple times in different species (and then maintained by balancing selection in these lineages) versus under a model of trans-species polymorphism, i.e., in which the AB genetic polymorphism arose early in time and was transmitted identical by descent to distinct species. Key to distinguishing the two predictions is the age of different selected alleles within a polymorphic population.

We therefore compared alleles within humans, orangutans, gibbons, macaques, baboons and colobus monkeys (all polymorphic species for the A and B alleles), and showed that, at least among Hominoids and among Old World monkeys, the observed genetic pattern is not compatible with a model of convergent evolution but on the contrary matches the expectations under a model of a trans-species polymorphism maintained by multi-allelic balancing selection. In other words, the data indicate that the AB polymorphism was present at least around 20 Millions of years ago, if not earlier. Also, interestingly, it seems that the A, B and O functional classes do not provide a complete description of the allelic classes natural selection is acting on, which underscores the need for more detailed functional studies of ABO sub-groups.

By submitting the paper to arXiv, we hope to circulate it to a diverse audience and without delay. In particular, we hope that the study will motivate more experimental/functional work about the role of this polymorphism in immune response, e.g., to pathogen infections.

Laure Ségurel

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing
Erik Garrison, Gabor Marth
(Submitted on 17 Jul 2012 (v1), last revised 20 Jul 2012 (this version, v2))

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.

Our paper: Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals

Our next “our paper” guest post is by Vincent Lynch [@VinJLynch] who’s just joined the UChicago faculty from a postdoc at Yale. He’s posting about his recently arXived paper:

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals. ArXived here.
_________________________________________________________________________________
Explaining how morphology evolves is a major challenge in biology. While it’s clear that changes in gene regulation are ultimately responsible for the development and evolution of complex characters, we are only just beginning to understand the molecular mechanisms of gene regulatory evolution. This is largely due to the emergence of new technologies, such as mRNA-Seq and ChIP-Seq, which give biologists the tools to explore evolution across the genome and in non-model species.

We took advantage of these methods to explore the evolution of gene expression in the uterus during the origin of pregnancy in mammals. Using mRNA-Seq, we show that gene expression evolved extremely rapidly during major stages in the evolution of pregnancy, for example during the origin of maternal resource provisioning in the stem-lineage of Mammalia, placentation in the stem-lineage of Theria, and implantation in the stem-lineage of Eutheria. Using ChIP-Seq to identify the cis-regulatory elements of genes recruited into uterine expression in mammals suggests that the majority of enhancers and promoters derived from mammalian lineage-specific transposons.

While recent technological advances are changing the way we do biology (see Wagner 2013), as these emerging methods come into the mainstream we must collectively define our new standards of evidence. What experiments and methods build a convincing case for X? Is it sufficient, for example, to conclude that a transposon donated a novel promoter to a gene if a ChIP-Seq peak for a histone mark associated with promoters lies within the transposon? If we then expand that observation across the genome, can we reasonably conclude that transposons are casually responsible for gene regulatory change? For these reasons we chose to post our manuscript as a work-in-progress to arXiv, both as our contribution to the larger discussion of what constitutes the standards of evidence in this emerging field of biology and as an opportunity to receive feedback from our colleagues to complement formal peer-review.

Vincent Lynch

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Sebastien Roch
(Submitted on 17 Jul 2012)

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

Matthias Steinrücken, Joshua S. Paul, Yun S. Song
(Submitted on 25 Aug 2012)

Conditional sampling distributions (CSDs), sometimes referred to as copying models, underlie numerous practical tools in population genomic analyses. Though an important application that has received much attention is the inference of population structure, the explicit exchange of migrants at specified rates has not hitherto been incorporated into the CSD in a principled framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD has been developed as an accurate, efficient approximation to a principled CSD derived from the diffusion process dual to the coalescent with recombination. In this paper, the sequentially Markov CSD framework is extended to incorporate subdivided population structure, thus providing an efficiently computable CSD that admits a genealogical interpretation related to the structured coalescent with migration and recombination. As a concrete application, it is demonstrated empirically that the CSD developed here can be employed to yield accurate estimation of a wide range of migration rates.

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection

Matthias Steinrücken, Y. X. Rachel Wang, Yun S. Song
(Submitted on 25 Aug 2012)

Characterizing time-evolution of allele frequencies in a population is a fundamental problem in population genetics. In the Wright-Fisher diffusion, such dynamics is captured by the transition density function, which satisfies well-known partial differential equations. For a multi-allelic model with general diploid selection, various theoretical results exist on representations of the transition density, but finding an explicit formula has remained a difficult problem. In this paper, a technique recently developed for a diallelic model is extended to find an explicit transition density for an arbitrary number of alleles, under a general diploid selection model with recurrent parent-independent mutation. Specifically, the method finds the eigenvalues and eigenfunctions of the generator associated with the multi-allelic diffusion, thus yielding an accurate spectral representation of the transition density. Furthermore, this approach allows for efficient, accurate computation of various other quantities of interest, including the normalizing constant of the stationary distribution and the rate of convergence to this distribution.

Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions

Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions

Xiaoquan Wen, Matthew Stephens
(Submitted on 4 Nov 2011 (v1), last revised 8 Nov 2011 (this version, v2))

In genetic association analyses, it is often desired to analyze data from multiple potentially-heterogeneous subgroups. The amount of expected heterogeneity can vary from modest (as might typically be expected in a meta-analysis of multiple studies of the same phenotype, for example), to large (e.g. a strong gene-environment interaction, where the environmental exposure defines discrete subgroups). Here, we consider a flexible set of Bayesian models and priors that can capture these different levels of heterogeneity. We provide accurate numerical approaches to compute approximate Bayes Factors for these different models, and also some simple analytic forms which have natural interpretations and, in some cases, close connections with standard frequentist test statistics. These approximations also have the convenient feature that they require only summary-level data from each subgroup (in the simplest case, a point estimate for the genetic effect, and its standard error, from each subgroup). We illustrate the flexibility of these approaches on three examples: an analysis of a potential gene-environment interaction for a recombination phenotype, a large scale meta-analysis of genome-wide association data from the Global Lipids consortium, and a cross-population analysis for expression quantitative trait loci (eQTLs).

Inference of population splits and mixtures from genome-wide allele frequency data

Inference of population splits and mixtures from genome-wide allele frequency data

Joseph K. Pickrell, Jonathan K. Pritchard
(Submitted on 11 Jun 2012)

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In this model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at this http URL

Our paper: Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster

Casey Bergman [@caseybergman and @bergmanlab] kindly wrote a post about his recently arXived paper:
Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster
ArXived here.
__________________________________________________________________
As part of the Drosophila 12 Genome Project, Steve Salzberg and colleagues’ published a pioneering paper in 2005 showing that complete genomes of the bacterial endosymbiont Wolbachia pipientis can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. This paper always left an impression on me as a very clever use of extracting new biology from existing genomic data, and when the era of resequencing multiple strains of D. melanogaster kicked off a few years ago, it seemed like a natural extension to ask if this approach could be adapted to a next-generation sequencing data to study the co-evolution of Wolbachia and Drosophila using whole genome data.

In the current work, we used short-read next generation sequencing data from two major resequencing efforts in D. melanogaster — the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP) — together with the reference Wolbachia genome published by Wu et al. (2005) and extracted over 175 complete Wolbachia genomes and nearly 300 complete mitochondrial genomes. Readers can find the main results in the paper, which is currently in review. I’d like to discuss here the social context of the project and some of the reasons we submitted to arXiv.

This project started out as summer project for a masters student, Mark Richardson, in 2010 who did an amazing job developing the initial pipeline made most of the initial discoveries in the paper. Mark and I started a collaboration with Frank Jiggins and Mike McGwire shortly after to verify that our in silico genotyping results were making sense, who suggested to bring in Lucy Weinart and John Welch to help with the more sophisticated Bayesian phylogenetic analysis. Another PhD student in my lab, Raquel Linheiro, adapted her transposable element detection pipeline to identify particular Wolbachia sublineages which was crucial to linking our data with previous results. This was a great collaboration, where everyone made significant contributions, and I would collaborate with everyone again (and I hope to!).

At the time (summer 2010), we only had access to the North American strains from the DGRP sample; knowing that North American D. melanogaster are derived populations, we were cautious about the impact that population structure had on our results. We planned in early 2011 to publish on only the DGRP dataset since Mark was going off to do a PhD in Australia and I didn’t have anyone else in the group working on this project. In the summer of 2011, the African DPGP data came online and I decided to take a peek and run the pipeline on the African strains as well. This led to a major overhaul of the project and set us back a year, since all the data had to be reanalyzed again together and the interpretation of the biogeography results was substantially altered. This was in some ways lucky because our initial interpretation of evidence for a selective sweep on one of the cytoplasmic lineages was probably wrong, and it saved us from having to back peddle on this misinterpretation in a later publication.

As we plugged away at trying to finish this project, we had inquiries about the status of the project from several other groups working in the Wolbachia field. Honestly this stressed me out quite a bit, since some of the inquiries were coming from post-docs in big labs. But instead of just sitting on the data, after we finalized the dataset we decided to release these data openly on our lab blog in April 2012. We decided on an open release as a way to help these teams (and others we didn’t know about), but also to get some priority in this area by providing the “gold standard” that other groups could use (and cite!). For the record, I will note that we asked two teams who contacted us about our project if they would reciprocate by sharing unpublished genomic data or in one case published genomic data that was not submitted to GenBank; both declined.

After making the decision to release the data pre-publication, it was a natural step to submit the manuscript to arXiv. I’m an open science advocate and used the Nature Preprint server occasionally in the past. I never really liked the Nature Preprint server, though, since I thought people posted there to give their manuscript the stink of being “Nature (in prep)” on their CV. And I never posted to arXiv in the past, since I always thought it was for more hardcore computational or mathematical biology. But recently, I was convinced by Rosie Redfield, Leonid Kruglyak and colleagues putting their Arsenic Life paper on arXiv that more empirical work in quantitative biology was arXiv-able. And just as with releasing our data early, it seemed like the best way to prevent being scooped was to get our results out as quickly as possible and letting people know about it.

So we went for it. And I have to say the experience has been thoroughly rewarding. Submitting was a piece of cake, easier than any journal I’ve ever submitted to. Having a URL to point to allowed me to tweet about it, which got some exposure to the paper and some new colleagues on twitter. It also allowed me to send a submitted manuscript around to colleagues for informal review, without cluttering up their inboxes with big attachments or providing a moral dilemma about who they can share the manuscript with. And somehow submitting to arXiv pushed the “it’s submitted” button in my brain, which made me a whole lot less stressed about the possibility of being scooped and I’ve been more relaxed throughout the formal submission process. Finally, I know that the pre-publication release of the data and posting of the manuscript has led to a group in Russia using these sequences into their work, and I’ve just gotten a manuscript to review from this group citing our arXiv manuscript and extending our results before our paper is even published! This is what research is all about, right: doing science, getting it out, and letting others build on it. I’ll definitely submit to arXiv for all my papers from my lab, and look forward to the Haldane’s Sieve readership giving us a hard time about our manuscripts while they evolve into formal publications.

Casey Bergman

Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture

Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture
John E. Pool, Russell B. Corbett-Detig, Ryuichi P. Sugino, Kristian A. Stevens, Charis M. Cardeno, Marc W. Crepeau, Pablo Duchen, J. J. Emerson, Perot Saelao, David J. Begun, Charles H. Langley
(Submitted on 23 Aug 2012)

(ABRIDGED) We report the genome sequencing of 139 wild-derived strains of D. melanogaster, representing 22 population samples from the sub-Saharan ancestral range of this species, along with one European population. Most genomes were sequenced above 25X depth from haploid embryos. Results indicated a pervasive influence of non-African admixture in many African populations, motivating the development and application of a novel admixture detection method. Admixture proportions varied among populations, with greater admixture in urban locations. Admixture levels also varied across the genome, with localized peaks and valleys suggestive of a non-neutral introgression process. Genomes from the same location differed starkly in ancestry, suggesting that isolation mechanisms may exist within African populations. After removing putatively admixed genomic segments, the greatest genetic diversity was observed in southern Africa (e.g. Zambia), while diversity in other populations was largely consistent with a geographic expansion from this potentially ancestral region. The European population showed different levels of diversity reduction on each chromosome arm, and some African populations displayed chromosome arm-specific diversity reductions. Inversions in the European sample were associated with strong elevations in diversity across chromosome arms. Genomic scans were conducted to identify loci that may represent targets of positive selection. A disproportionate number of candidate selective sweep regions were located near genes with varied roles in gene regulation. Outliers for Europe-Africa FST were found to be enriched in genomic regions of locally elevated cosmopolitan admixture, possibly reflecting a role for some of these loci in driving the introgression of non-African alleles into African populations.