Integrated Nested Laplace Approximation for Bayesian Nonparametric Phylodynamics

Integrated Nested Laplace Approximation for Bayesian Nonparametric Phylodynamics

Julia A. Palacios, Vladimir N. Minin
(Submitted on 16 Oct 2012)

The goal of phylodynamics, an area on the intersection of phylogenetics and population genetics, is to reconstruct population size dynamics from genetic data. Recently, a series of nonparametric Bayesian methods have been proposed for such demographic reconstructions. These methods rely on prior specifications based on Gaussian processes and proceed by approximating the posterior distribution of population size trajectories via Markov chain Monte Carlo (MCMC) methods. In this paper, we adapt an integrated nested Laplace approximation (INLA), a recently proposed approximate Bayesian inference for latent Gaussian models, to the estimation of population size trajectories. We show that when a genealogy of sampled individuals can be reliably estimated from genetic data, INLA enjoys high accuracy and can replace MCMC entirely. We demonstrate significant computational efficiency over the state-of-the-art MCMC methods. We illustrate INLA-based population size inference using simulations and genealogies of hepatitis C and human influenza viruses.

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data
Alison F. Feder, Dmitri A. Petrov, Alan O. Bergland
(Submitted on 8 Oct 2012)
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r2) between pairs of SNPs that can be observed within and among single reads. LDx also reports r2 estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r2 estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r2 estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.

Our paper: An age-of-allele test of neutrality for transposable element insertions not at equilibrium

[This author post is by Justin Blumenstiel and Casey Bergman on An age-of-allele test of neutrality for transposable element insertions not at equilibrium, available from the arXiv here]

Studies over the past several decades in Drosophila melanogaster have demonstrated that TE insertion alleles in natural populations tend to segregate at low frequency, particularly in regions of the genome that have a high recombination rate where natural selection is most effective. These results have largely supported a model where natural selection acts to remove deleterious TE insertions from the genome.  The prevailing model of why TE insertions are deleterious is that they lead to chromosomal aberrations that occur when dispersed, non-allelic repeated sequences crossover with one another. This model is known as the ectopic recombination model and it has an important feature. Since each new insertion has the potential to recombine with all the other copies in the genome, fitness will go down faster and faster with each new copy. This yields a stable equilibrium in TE copy number.

But, are TEs at equilibrium in natural populations? Genome sequencing studies have shown that the rate of TE proliferation can vary widely over time and any given TE family may demonstrate non-equilibrium “boom and bust” behavior. How do we reconcile studies that assume equilibrium with the fact that we know TE dynamics are not at equilibrium? To deal with this problem, I began developing this model out of a class project with John Wakeley while I was a graduate student over a decade ago. This model arose of some work I published in 200­2 with Hartl and Lozovsky on the age structure of non-LTR elements in D. melanogaster. I wrote this model up for my Ph.D. thesis and presented a preliminary version in a paper with Neafsey and Hartl in 2004, but it sat on the back burner until I reviewed a paper by Bergman and Bensasson in 2007 that showed many TE families in D. melanogaster have recently inserted in the genome and may not be at equilibrium.

Shortly after their paper came out I contacted Casey with the model from my thesis and we decided to push this idea forward as a collaboration, which has taken several a few years to come to fruition (both being busy with other projects and starting our labs). Things started to really move ahead when Miaomiao He in Casey’s lab generated a crucial data set that could be specifically applied to the model – strain-specific presence/absence data for a very large number of TE insertions ascertained from the D. melanogaster genome sequence.  After a few more years with it on simmer, working out several kinks in the mean time (e.g. incorporating host  demography, trying many different methods for estimating the posterior distribution of TE ages), Casey and I finally wrapped it up just as Haldane’s Sieve is starting to hit its stride. I expect that all my papers in the future will be pre-released on arXiv.

I could speak at length on the specific results, but I would just be saying what is already in the abstract. So, I would like to bring up three points for potential conversation.

First, what does it mean for TEs to be at transposition-selection balance when we know different TE families show a signature of “boom and bust” in genome sequences? There may be one way to reconcile this apparent problem. Any particular TE family may in fact not be at transposition-selection balance. For example, the P element, which invaded Drosophila melanogaster only a few decades ago, is hardly at transposition-selection balance. Therefore, one must be careful in using insertion frequencies for P elements to describe general TE dynamics. However, by integrating over all TE families in the genome, one may in fact reach an approximation that might be reasonable for assuming equilibrium transposition-selection balance. But one must be careful of something I call “family ascertainment bias”. Sometimes the most recently activated TEs are the ones easiest to discover and annotate because these ones are easily cloned from insertion mutations or are most frequent in genome sequences.

Second, in this paper, we derive the probability distribution for each individual TE insertion frequency based on its age. We demonstrate that this provides a method for TE insertions that are either positively or negatively selected. In the case where we show allele frequencies are less than expected (i.e. predicted to be negatively selected), many of these are copies that have zero substitutions. In principle, all of these could have inserted one generation before the reference strain was collected for genome sequencing. The inference that selection is acting against these TEs implicitly assumes either: 1) this wasn’t the case for many of these insertions, and the posterior distribution of ages is a good representation of the true age distribution, or 2) it may have been the case, but natural selection has already acted to remove slightly older TEs from the population, therefore making them absent from the genome sequence.

Third, when putting the finishing touches on our analysis of TE insertion data in North America, we ran up against the issue that nobody has yet published an explicit demographic scenario for North American populations of D. melanogaster, similar to those that have been developed by Wolfgang Stephan‘s Lab and others for European and African populations. We found one paper by Yukilevich et al (2010) from John True’s Lab that generated similar findings to the demography of European populations, which is consistent with the idea that North America populations of D. melanogaster are mainly derived from European ancestors.  However, Yukilevich et al (2010) didn’t explicitly model the admixture with African populations, which is known to occur in North American populations as shown by Caracristi and Schlötterer in 2003. We were surprised that an explicit admixture scenario has not been published yet, especially since this is crucial for interpreting the data from population genomic projects like the Drosophila Genetic Reference Panel. This should be an important line of work for someone to pursue (if it isn’t being done already) and if anyone has information about this a demographic model for North American populations of D. melanogaster, we’d be keen to know more so we can see if might improve our analysis.

Justin and Casey

Genomic tests of variation in inbreeding among individuals and among chromosomes

Genomic tests of variation in inbreeding among individuals and among chromosomes

Joshua G. Schraiber, Stephannie Shih, Montgomery Slatkin
(Submitted on 26 Sep 2012)

We examine the distribution of heterozygous sites in nine European and nine Yoruban individuals whose genomic sequences were made publicly available by Complete Genomics. We show that it is possible to obtain detailed information about inbreeding when a relatively small set of whole-genome sequences is available. Rather than focus on testing for deviations from Hardy-Weinberg genotype frequencies at each site, we analyze the entire distribution of heterozygotes conditioned on the number of copies of the derived (non-chimpanzee) allele. Using Levene’s exact test, we reject Hardy-Weinberg in both populations. We generalized Levene’s distribution to obtain the exact distribution of the number of heterozygous individuals given that every individual has the same inbreeding coefficient, F. We estimated F to be 0.0026 in Europeans and 0.0005 in Yorubans, but we could also reject the hypothesis that F was the same in each individual. We used a composite likelihood method to estimate F in each individual and within each chromosome. Variation in F across chromosomes within individuals was too large to be consistent with sampling effects alone. Furthermore, estimates of F for each chromosome in different populations were not correlated. Our results show how detailed comparisons of population genomic data can be made to theoretical predictions. The application of methods to the Complete Genomics data set shows that the extent of apparent inbreeding varies across chromosomes and across individuals, and estimates of inbreeding coefficients are subject to unexpected levels of variation which might be partly accounted for by selection.

Non-stationary patterns of isolation-by-distance: inferring measures of genetic friction

Non-stationary patterns of isolation-by-distance: inferring measures of genetic friction

Nicolas Duforet-Frebourg, Michael G. B. Blum
(Submitted on 24 Sep 2012)

The pattern of isolation-by-distance arises when population differentiation increases with increasing geographic distances. This pattern is usually caused by local spatial dispersal which explains why differences of allele frequencies between populations accumulate with distance. However, the pattern of isolation-by-distance can mask complex variations of demographic parameters. Spatial variations of demographic parameters such as migration rate or population density generate non-stationary patterns of isolation-by-distance where the rate at which genetic differentiation accumulates varies across space. Barriers to gene flow are particularly well studied examples that generate non-stationary patterns of isolation-by-distance. Using the concept of genetic friction, we develop a statistical method that characterizes non-stationary patterns of isolation-by-distance. Genetic friction at a sampled site corresponds to the local genetic differentiation between the sampled population and fictive populations living in the neighborhood of the sampling site. To avoid defining populations in advance, the method can also be applied at the scale of individuals. The proposed framework is appropriate for dealing with massive data because it relies on a pairwise similarity matrix, which can be obtained with computationally efficient methods. A simulation study shows that maps of genetic friction can detect barriers to gene flow but also other patterns such as continuous variations of gene flow across habitat. The potential of the method is illustrated with 2 data sets: genome-wide SNP data for the human Swedish populations, and AFLP markers for alpine plant species. The software FRICTION implementing the method is available at this http URL

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

An age-of-allele test of neutrality for transposable element insertions not at equilibrium

Justin P. Blumenstiel, Miaomiao He, Casey M. Bergman
(Submitted on 16 Sep 2012)

How natural selection acts to limit the proliferation of transposable elements (TEs) in genomes has been of interest to evolutionary biologists for many years. To describe TE dynamics in populations, many previous studies have relied on the assumption of equilibrium between transposition and selection. However, since TE invasions are known to happen in bursts through time, this assumption may not be reasonable. Here we derive a test of neutrality for TE insertions that does not rely on the assumption of transpositional equilibrium. We consider the case of TE insertions that have been ascertained from a single haploid reference genome sequence and have had their allele frequency estimated in a population sample. By conditioning on age information provided within the sequence of a TE insertion in the form of the number of substitutions that have occurred within the fragment since insertion into a reference genome, we derive the probability distribution for the TE allele frequency in a population sample under neutrality. Taking models of population fluctuation into account, we then test the fit of predictions of our model to allele frequency data from 190 retrotransposon insertion loci in North American and African populations of Drosophila melanogaster. Using this non-equilibrium model, we are able to explain about 80% of the variance in TE insertion allele frequencies. Controlling for nonequilibrium dynamics of transposition and host demography, we demonstrate how one may detect negative selection acting against most TEs as well as evidence for a small subset of TEs being driven to high frequency by positive selection. Our work establishes a new framework for the analysis of the evolutionary forces governing large insertion mutations like TEs or gene duplications.

Thoughts on: The date of interbreeding between Neandertals and modern humans.

The following are my (Graham Coop, @graham_coop) brief thoughts on Sriram Sankararaman et al.’s arXived article: “The date of interbreeding between Neandertals and modern humans.”. You can read the authors’ guest post here, along with comments by Sriram and others.

Overall it’s a great article, so I thought I’d spend sometime talking about the interpretation of the results. Please feel free to comment, our main reason for doing these posts is to facilitate early discussion of preprints.

The authors analysis relies on measuring the correlation along the genome between alleles that may have been inherited from the putative admixture event [so called admixture. The idea being that if there was in fact no admixture and these alleles have just been inherited from the common ancestral population (>300kya) then these correlations should be very weak, as there has been plenty of time for recombination to break down the correlation between these markers. If there has been a single admixture event, the rate at which the correlation decays with the genetic distance between the markers is proportional to this admixture time [i.e. slower decay for a more recent event, as there is less time for recombination]. These ideas for testing for admixture have been around in the literature for sometime [e.g. Machado et al], its the application and genome-wide application that is novel.

As you can tell from the title and abstract of the paper, the authors find pretty robust evidence that this curve is decaying slower than we’d expect if there had been no gene flow, and estimate this “admixture time” to be 37k-86k years ago. However, as the authors are careful to note in their discussion, this is not a definitive answer to whether modern humans and Neandertals interbred, nor is this number a definite time of admixture. Obviously the biological implications of the admixture result will get a lot of discussion, so I thought I’d instead spend a moment on these caveats. [This post has run long, so I’ll only get to the 1st point in this post and perhaps return to write another post on this later].

Okay so did Neandertals actually mate with humans?

The difficulty [as briefly discussed by the authors] is that we cannot know for sure from this analysis that the time estimated is the time of gene flow from Neandertals, and not some [now extinct] population that is somewhat closer to Neandertals than any modern humans.

Consider the figure below. We would like to say that the cartoon history on the left is true, where gene flow has happened directly from Neandertals into some subset of humans. The difficulty is that the same decay curve could be generated by the scenario on the right, where gene flow has occurred from some other population that shares more of its population history with Neandertals than any current day human population does.

Why is this? Well allele frequency change that occurred in the red branch [e.g. due to genetic drift] means that the frequencies in population X and Neandertals are correlated. This means that when we ask questions about correlations along the genome between alleles shared between Neanderthals and humans, we are also asking questions about correlations along the genome between population X and modern humans. So under scenario B I think the rate of decay of the correlation calculated in the paper is a function only of the admixture time of population X with Europeans, and so there may have been no direct admixture from Neandertals into Eurasians*.

First thing is first, that doesn’t diminish how interesting the result is. If interpretation of the decay as a signal of admixture is correct, then it still means that Eurasians interbred with some ancient human population, which was closer to Neandertals than other modern humans. That seems pretty awesome, regardless of whether that population is Neanderthals or some yet undetermined group.

At this point you are likely saying: well we know that Neandertals existed as a [somewhat] separate population/species who are these population X you keep talking about and where are their remains? Population X could easily be a subset of what we call Neandertals, in which case you’ve been reading this all for no reason [if you only want to know if we interbred with Neandertals]. However, my view is that in the next decade of ancient human population history things are going to get really interesting. We have already seen this from the Denisovian papers [1,2], and the work of ancient admixture in Africa (e.g. Hammer et al. 2011, Lachance et al. 2012). We will likely discover a bunch of cryptic somewhat distinct ancient populations, that we’ve previously [rightly] grouped into a relatively small number of labels based on their morphology and timing in the fossil record. We are not going to have names for many of these groups, but with large amounts of genomic data [ancient and modern] we are going to find all sorts of population structure. The question then becomes not an issue of naming these populations, but understanding the divergence and population genetic relationship among them.

There’s a huge range of (likely more plausible) scenarios that are hybrids between A and B that I think would still give the same difficulties with interpretations. For example, ongoing low levels of gene flow from population X into the Ancestral “population” of modern humans, consistent with us calling population X modern humans [see Figure below, **]. But all of the scenarios likely involve some thing pretty interesting happening in the past 100,000 years, with some form of contact between Eurasians and a somewhat diverged population.

As I say, the authors to their credit take the time in the discussion to point out this caveat. I thought some clarification of why this is the case would be helpful. The tools to address this problem more thoroughly are under development by some of the authors on this paper [Patterson et al 2012] and others [Lawson et al.]. So these tools along with more sequencing of ancient remains will help clarify all of this. It is an exciting time for human population genomics!

* I think I’m right in saying that the intercept of the curve with zero is the only thing that changes between Fig 1A and Fig 1B.

** Note that in the case shown in Figure 2, I think Sriram et al are mostly dating the red arrow, not any of the earlier arrows. This is because they condition their subset of alleles to represent introgression into European and to be at low frequency in Africa. We would likely not be able to date the deeper admixture arrow into the ancestor on Eurasian/Africa using the authors approach, as [I think] it relies on having a relatively non-admixed population to use as a control.

Robust identification of local adaptation from allele frequencies

Robust identification of local adaptation from allele frequencies

Torsten Günther, Graham Coop
(Submitted on 13 Sep 2012)

Comparing allele frequencies among populations that differ in environment has long been a tool for detecting loci involved in local adaptation. However, such analyses are complicated by an imperfect knowledge of population allele frequencies and neutral correlations of allele frequencies among populations due to shared population history and gene flow. Here we develop a set of methods to robustly test for unusual allele frequency patterns, and correlations between environmental variables and allele frequencies while accounting for these complications based on a Bayesian model previously implemented in the software Bayenv. Using this model, we calculate a set of `standardized allele frequencies’ that allows investigators to apply tests of their choice to multiple populations, while accounting for sampling and covariance due to population history. We illustrate this first by showing that these standardized frequencies can be used to calculate powerful tests to detect non-parametric correlations with environmental variables, which are also less prone to spurious results due to outlier populations. We then demonstrate how these standardized allele frequencies can be used to construct a test to detect SNPs that deviate strongly from neutral population structure. This test is conceptually related to FST but should be more powerful as we account for population history. We also extend the model to next-generation sequencing of population pools, which is a cost-efficient way to estimate population allele frequencies, but it implies an additional level of sampling noise. The utility of these methods is demonstrated in simulations and by re-analyzing human SNP data from the HGDP populations. An implementation of our method will be available from this http URL.

Our paper: The date of interbreeding between Neandertals and modern humans

This post is by Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich on their paper The date of interbreeding between Neandertals and modern humans arXived here

The relationship between modern humans and archaic hominins such as Neandertals has been the subject of intense debate. The sequencing of a Neandertal genome, a couple of years back (Green et al, Science 2010), showed that Neandertals are more closely related to non-African genomes than African genomes. One possible model consistent with this observation is one involving gene flow from Neandertals to modern non-Africans after the divergence of African and non-African populations. Another model that can explain these observations is one in which the population ancestral to modern humans and Neandertals is structured e.g. imagine that the population ancestral to Neandertals and modern humans consists of three groups, A,B and C, where A,B and C represent the ancestors of modern Africans, non-Africans and Neandertals respectively. The extra proximity of Neandertals to non-Africans over Africans could occur if A and B, and B and C exchanged genes with each other followed by C diverging to form Neandertals, and A and B not completely hybridizing before their divergence to form Africans and non-Africans.

The Neandertal (Green et al, Science 2010) and the Denisova genome (Reich et al, Nature 2010) papers considered the possibility of both models — either scenario was shown to produce the skew in the observed D-statistics (a measure of the excess sharing of alleles across groups) that led to Neandertals appearing closer to non-Africans than Africans. Indeed, a recent paper by Eriksson and Manica (Eriksson and Manica, PNAS 2012) used an Approximate Bayesian Computation framework with D-statistics as the summary statistics and arrived at similar conclusions.

A paper from Monty Slatkin’s group (Yang et al, MBE 2012) attempted to differentiate the two scenarios by using the site frequency spectrum. Yang et al considered the site frequency spectrum in Europeans conditioned on observing a derived allele in Neandertal and an ancestral allele in Africans (termed the doubly-conditioned frequency spectrum, dcfs). They used theory and simulations to show that an ancient structure model produces a linear dcfs. On the other hand, they showed that recent gene flow can produce an excess of rare variants which matches the observed dcfs. Interestingly, they also observed that bottlenecks post gene flow had the effect of making the dcfs linear suggesting that gene flow from Neandertals could not have preceded strong bottlenecks in the non-African populations.

A different idea that we explored was to ask if patterns of linkage disequilibrium (LD) might discriminate the two scenarios. If we could pick out haplotypes that came into modern humans from Neandertal, recombination is expected to break these haplotypes down at a fixed rate every generation (assuming neutrality). Haplotypes that came in 1000 generations ago (under recent gene flow) should be expected to be 10 times longer on average than haplotypes that came in 10000 generations ago (under ancient structure). And if we could measure LD precisely enough, we could even date these ancient events. To date such ancient events, we had to address two technical challenges : i) measures of LD can be sensitive to demographic events, ii) for events that occurred 1000s of generations ago, we need to measure LD at size scales at which genetic maps can be quite noisy and this noise can bias estimates of dates.

Theory indicates that the expected LD (measured by Lewontin’s D), across SNPs that arose on the Nenadertal lineage and introgressed, decays exponentially with genetic distance at a rate given by the time of gene flow and is robust to demographic events. This result does not hold in practice due to imperfect ascertainment of these SNPs. We did simulations to show that this decay of LD does provide accurate estimates and can differentiate gene flow and ancient structure. We also came up with a model to assess errors in genetic maps which we then used to obtain a corrected date.

Our results support the recent gene flow scenario with a likely date of gene flow into the ancestors of modern Europeans 37000-86000 years BP although this does not exclude the possibility of ancient structure. A broader methodological question we are exploring is whether LD-based analyses might be generally applicable as a tool for dating other ancient gene flow events.

Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, and David Reich

The date of interbreeding between Neandertals and modern humans

The date of interbreeding between Neandertals and modern humans

Sriram Sankararaman, Nick Patterson, Heng Li, Svante Pääbo, David Reich
(Submitted on 10 Aug 2012)

Comparisons of DNA sequences between Neandertals and present-day humans have shown that Neandertals share more genetic variants with non-Africans than with Africans. This could be due to interbreeding between Neandertals and modern humans when the two groups met subsequent to the emergence of modern humans outside Africa. However, it could also be due to population structure that antedates the origin of Neandertal ancestors in Africa. We measure the extent of linkage disequilibrium (LD) in the genomes of present-day Europeans and find that the last gene flow from Neandertals (or their relatives) into Europeans likely occurred 37,000-86,000 years before the present (BP), and most likely 47,000-65,000 years ago. This supports the recent interbreeding hypothesis, and suggests that interbreeding may have occurred when modern humans carrying Upper Paleolithic technologies encountered Neandertals as they expanded out of Africa.