Thoughts on: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

I (@joe_pickrell) was recently asked to review a preprint by Decker et al., Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle for a journal. Below are the comments I sent the journal.

In this paper, the authors apply a suite of population genetics analyses to a set of cattle breeds. The basic data consists of around 1,500 individuals from 143 breeds typed at around 40,000 SNPs. The authors use this data to build population trees/graphs using TreeMix and visualize population structure with PCA/ADMIXTURE. They then interpret the results of these programs in light of their knowledge of the history of cattle domestication. I had no knowledge of cattle history prior to reading this manuscript, so I enjoyed reading it. I have first a few comments on the manuscript as a whole, then on individual points.

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. It’s probably most interesting to focus interpretation on the edges that are consistent.

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not. For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but it’s difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources. In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc…

Specific comments:

There are quite a few results claimed in this paper, so I’m going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine…”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We also identify the influence of species other than Bos taurus in the formation of Asian breeds.

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statstics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here. In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write “share ancestry with Shorthorn cattle” when using the Durham shorthorns as one reference?

Iberian and Italian cattle possess introgression from African taurine.

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

American Criollo cattle are shown to be of Iberian, and not African, decent.

I found this difficult to follow–the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

Indicine introgression into American cattle occurred in the Americas, and not Europe

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.

Some background on Bhaskar and Song’s paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”

This post is by Graham Coop [@Graham_Coop], and is an introduction to the background of Anand Bhaskar and Yun Song preprint: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

Anand and Yun’s preprint focuses on what we can learn about population demographic history from the site frequency spectrum. It takes as its starting point an article by Myers et al (Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.). I wrote about Myers et al’s article back in 2008, when it came out (see here). I thought I’d post an edited version of that post by way of additional background to Anand and Yun’s preprint and guest post.

Edited version of original post:
The best way to learn about demography from population genetic data is to look at patterns of diversity across many unlinked regions. The distribution of frequencies in a populations of unlinked neutral alleles at SNPs (the site frequency spectrum) is potentially very informative about population history. For example an excess of low frequency mutations is consistent with recent population growth, as the increase in population size introduces new mutations but these mutations have not yet had time to drift to higher frequencies. Many authors have made use of the frequency spectrum of unlinked, putatively neutral SNPs to learn about demography.

Back in 2008 a technical but elegant article by Myers et al shows that while informative about demography, the site frequency spectrum at unlinked SNPs can not help you chose between certain demographic histories. This is not a question of imperfect knowledge of the site frequency spectrum (which more data would solve) but because for any particular demographic model, as Myers et al formally show, there are a large family of demographic histories that can give rise to the same site frequency spectrum. They explained: ‘Informally, changes in population size at some past time are canceled out by other changes in the opposite direction’. I think that this lack of information comes from the fact that each unlinked SNP only tells you about the placement of a single mutation on the genealogy of the population at that site, and over sites you learn about the expected amount of time in different parts of the genealogy. By fluctuating the population size in just the right way, i.e. speeding up and slowing down the rate of coalescence, we can get very different population histories to give us the same expected coalescent times.

For example Myers et al showed that the population size history below, gives the same population frequency spectrum as a a constant population
Figure taken from Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

That’s somewhat worrying as it says that when we fit a model of population size changes over time there are actually a family of quite different looking population histories that would give us just as good a fit to the data. However, these alternative histories do look quite strange and may not be biologically reasonable.

Anand and Yun’s article takes this as their starting place. They show that if we write our population history as a series of piecewise functions that the parameters of these functions are identifable, and provide simple estimates of the sample size needed. You can read more about their results in their guest post here at Haldane’s sieve.

Thoughts on: Loss and Recovery of Genetic Diversity in Adapting Populations of HIV

This guest post is an exchange between Richard Neher and the authors of the preprint “Loss and Recovery of Genetic Diversity in Adapting Populations of HIV” (Pleuni Pennings, Sergey Kryazhimskiy, and John Wakeley). Below is the comment from Richard Neher, and then appended is the response from the authors of the study.

In this paper, the authors use a data set from a study of the anti-HIV drug efavirenz. This drug has a fairly stereotypic resistance profile which in most cases involves a mutation at amino acid 103 of the reverse transcriptase of HIV (K103N). The authors examine sequences from patients after treatment failure (drug resistant virus) and observe that in a large fraction of the cases, the drug resistance mutation K103N is present on multiple genetic backgrounds or in both of the possible codons for asparagine. This suggests frequent soft sweeps, i.e., evolution of drug resistance is not limited by the waiting time for a point mutation.

The observation of frequent soft sweeps allows to put a lower bound on the product of population size and mutation rate. Since the mutation rate is on the order of 1e-5, the lower bound for the population size is around N>1e5. The authors suggest that the fact that not all patients exhibit obvious soft-sweeps can be used to deduce an upper bound of N. However, one has to realize that the patient sample is heterogeneous, that additional drugs are used along with efavirenz, and that most likely additional mutations have swept through the population. Multiple soft-sweeps in rapid succession will look like hard sweeps. The lower bound makes a lot of sense and does away with a long held erroneous belief that the “effective” HIV population within an infected individual is small.

The debate about the size of the HIV population has some interesting history. In the mid 90ies, it was estimated that roughly 1e7 cells are infected by HIV within a chronically infected patient every day. Virologists studying HIV evolution concluded that every point mutations is explored many times every day (see Coffin, Science, 1995,, which was consistent with the frequent failure of mono-therapy, i.e., therapy with only one drug. Around the same time, it was observed that HIV sequences within a patient typically have a common ancestor about 3 years ago, which translates into roughly 500-1000 generations. Population geneticists then started to convince people that this rapid coalescence corresponds to an “effective” population size of the order of a 1000, and that this explains the observed stochasticity of HIV evolution. Not everybody was convinced and some went through great trouble to show that very rare alleles matter and that the population size is large, see for example In this paper, failure of efavirenz therapy is studied in monkeys. Despite the fact that the resistance mutations were at frequencies below 1e-5 before treatment, both codons for asparagine at position 103 are observed a few days after treatment. Via as similar argument as in the above paper, the authors conclude that the population size is large.

There is very little reason to believe that coalescence in HIV is driven by short term offspring number fluctuations (drift). Instead, the coalescence is most likely driven by selection in which case the time scale of coalescence depends weakly on the population size (see e.g.

The tendency of population geneticists to map everything to a neutral model has in this case of HIV produced much confusion. This confusion is easily avoided if people were willing to give up the concept of effective population size and simply call the time scale of coalescent what it is.

Response from Pennings et al.

Hi Richard,

Thanks a lot for your detailed comments.
We agree with most of your analysis, but let us explain why we believe that our estimate of the effective population size is not just a lower bound.
We use the observation of “soft” and “hard” sweeps to estimate the effective population size of the virus in a patient.
In 7 patients, a mixture of AAC and AAT alleles at the 103rd codon of RT replaced the wildtype, whereas in 10 patients either the AAT or the AAC allele replaced the wildtype.
When we only observe the AAC allele, it is still possible that this allele has several origins (the mutation from AAA to AAC occurred multiple times). This possibility is included in our analysis.
In addition, it is possible that originally, a mixture of AAC and AAT replaced the wildtype, but a subsequent sweep (at another site) removed one of the two alleles from the population (or reduced its frequency so that it doesn’t appear in the sample). You suggest that this process can explain our observation of hard sweeps.
We agree that this is a theoretical possibility, but we believe that our original interpretation is more parsimonious.

First, sweeps may be occurring regularly in all patients. In this case we do not expect any differences in diversity reduction in patients where the last sweep happened at a drug resistance codon versus patients in which another sweep was the last. Our data do not support this picture, because ongoing sweeps in all patients are not compatible with a significant and substantial reduction in diversity in the patients whose virus fixed a resistance mutation. Hence non-drug resistance related sweeps with a strong effect of diversity must be relatively rare in the viral populations.

We have plotted the reduction in diversity in intervals without the fixation of a resistance mutation and the in intervals with the fixation of a resistance mutation. The 10 patients in which only the AAC or AAT allele was observed are highlighted in red. The reduction of diversity in these intervals is quite severe and such severe reductions are not observed in the intervals without the fixation of a resistance mutations.


Second, it may be possible that, for some reason, the patients in which we see a hard sweep at site 103 actually had two or more sweeps (with the sweep at site 103 not being the last one) while patients in which we see a soft sweep had only one soft sweep at site 103. Then, indeed, the former set of patients would have a larger reduction in diversity than the latter set of patients, and this difference in reduction would NOT be due to fact that the former patients received only one resistance mutation. One potential scenario in which this could happen is if the time intervals during which the sweep occurred are systematically longer in patients in which we observe hard sweeps. However, this is not the case, see figure.


Another scenario is if there is a specific structure of epistasis among mutations in HIV. In particular, after the 103 mutation has fixed, another mutation or mutations become available which were not available before the K103N swept. These could be compensatory mutations, for example. In this case, in all patients there was a soft sweep at site 103. Following that, in some patients, the secondary mutation occurred and swept quickly, but in other it didn’t (just by chance). In those patients where it did occur and sweep, we see a larger reduction in diversity (including site 103) due to this secondary sweep. However, this would mean that the populations are limited by the supply of this secondary mutation rather then the K103N mutation, which seems unlikely (especially considering that after the K103N mutation the population size would have likely gone up). Also, if this were the case, the mutations that lead to the second sweep must occur relatively far away from the K103N site, otherwise they would have likely been discovered.

Finally, it can be that what looks like hard sweeps are indeed hard sweeps. We believe that this is, with our current knowledge, the most parsimonious explanation of our observations. Hence, the effective population size of the virus cannot be very large. This explanation is also compatible with the observation that resistance does not evolve in all patients.

Pleuni and Sergey (John is on vacation)

Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here


Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on "Polygenic modeling with Bayesian sparse linear mixed models" by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Thoughts on: Finding the sources of missing heritability in a yeast cross

[This commentary post is by Joe Pickrell on Finding the sources of missing heritability in a yeast cross by Bloom et al., available from the arXiv here]

For decades, human geneticists have used twin studies and related methods to show that a considerable amount of the phenotypic variation in humans is driven by genetic variation (without any knowledge of the underlying loci). More recently, genome-wide association studies have made incredible progress in identifying specific genetic variants that are important in various traits. However, the loci identified in the latter studies often have small effects, and the sums of their effects rarely come close to the genetic effects known from the former. The difference between the genetic effects on a trait known from heritability studies and the effects estimated from individual loci has come to be known as the “missing heritability”, and much ink has been spilled on speculation as to its cause.

Bloom et al. take an elegant and straightforward approach to this question using a model system, the budding yeast Saccharomyces cerevisiae. The insight is that in order to make progress, one needs both an experimental design to isolate some of the possible causes of the “missing heritability”. To achieve this, Bloom et al. use a cross between two different yeast strains and grow the segregants in identical conditions. They thus remove much of the environmental variation in the phenotypes, while also removing the effect of allele frequency (since all alleles are at frequency 0.5 in the cross). While this means that they cannot address some controversies about potential sources of heritability in humans (for example, rare versus common variants), they are able to estimate how much phenotypic variation is due to detectable additive effects. The authors additionally develop high-throughput assays to measure phenotypes (in their case, growth rates in 46 different conditions) and genotypes in the segregants from the cross, so that they can perform high-powered mapping studies.

The main result relevant to the issue of “missing heritability” is presented in their Figure 3a (reproduced above). After performing a well-powered mapping study, the authors compare the effects from their identified loci to the narrow-sense heritability of each trait. As it turns out, the heritability is not missing! In this cross, the identified loci, though of small effect, add up to a substantial fraction of the overall (narrow-sense) heritability for most traits. The authors additional identify some gene-gene interactions that contribute to the broad-sense heritability (but by definition not the narrow-sense heritability) of many traits.

The authors provocatively interpret their results as supporting a model in which the majority of the “missing heritability” lies in a large number of variants with small effect sizes (in line with the model proposed most notably by Peter Visscher and colleagues, though the authors here make no claims about the allele frequencies of the relevant variants). While this seems to be true in yeast, it remains to be seen in humans. It’s of course easy to come up with reasons why this might not hold in humans–our species is a special snowflake and so forth–but this paper should be in the back of the mind of anyone who is thinking about this problem.

Thoughts on: The date of interbreeding between Neandertals and modern humans.

The following are my (Graham Coop, @graham_coop) brief thoughts on Sriram Sankararaman et al.’s arXived article: “The date of interbreeding between Neandertals and modern humans.”. You can read the authors’ guest post here, along with comments by Sriram and others.

Overall it’s a great article, so I thought I’d spend sometime talking about the interpretation of the results. Please feel free to comment, our main reason for doing these posts is to facilitate early discussion of preprints.

The authors analysis relies on measuring the correlation along the genome between alleles that may have been inherited from the putative admixture event [so called admixture. The idea being that if there was in fact no admixture and these alleles have just been inherited from the common ancestral population (>300kya) then these correlations should be very weak, as there has been plenty of time for recombination to break down the correlation between these markers. If there has been a single admixture event, the rate at which the correlation decays with the genetic distance between the markers is proportional to this admixture time [i.e. slower decay for a more recent event, as there is less time for recombination]. These ideas for testing for admixture have been around in the literature for sometime [e.g. Machado et al], its the application and genome-wide application that is novel.

As you can tell from the title and abstract of the paper, the authors find pretty robust evidence that this curve is decaying slower than we’d expect if there had been no gene flow, and estimate this “admixture time” to be 37k-86k years ago. However, as the authors are careful to note in their discussion, this is not a definitive answer to whether modern humans and Neandertals interbred, nor is this number a definite time of admixture. Obviously the biological implications of the admixture result will get a lot of discussion, so I thought I’d instead spend a moment on these caveats. [This post has run long, so I'll only get to the 1st point in this post and perhaps return to write another post on this later].

Okay so did Neandertals actually mate with humans?

The difficulty [as briefly discussed by the authors] is that we cannot know for sure from this analysis that the time estimated is the time of gene flow from Neandertals, and not some [now extinct] population that is somewhat closer to Neandertals than any modern humans.

Consider the figure below. We would like to say that the cartoon history on the left is true, where gene flow has happened directly from Neandertals into some subset of humans. The difficulty is that the same decay curve could be generated by the scenario on the right, where gene flow has occurred from some other population that shares more of its population history with Neandertals than any current day human population does.

Why is this? Well allele frequency change that occurred in the red branch [e.g. due to genetic drift] means that the frequencies in population X and Neandertals are correlated. This means that when we ask questions about correlations along the genome between alleles shared between Neanderthals and humans, we are also asking questions about correlations along the genome between population X and modern humans. So under scenario B I think the rate of decay of the correlation calculated in the paper is a function only of the admixture time of population X with Europeans, and so there may have been no direct admixture from Neandertals into Eurasians*.

First thing is first, that doesn’t diminish how interesting the result is. If interpretation of the decay as a signal of admixture is correct, then it still means that Eurasians interbred with some ancient human population, which was closer to Neandertals than other modern humans. That seems pretty awesome, regardless of whether that population is Neanderthals or some yet undetermined group.

At this point you are likely saying: well we know that Neandertals existed as a [somewhat] separate population/species who are these population X you keep talking about and where are their remains? Population X could easily be a subset of what we call Neandertals, in which case you’ve been reading this all for no reason [if you only want to know if we interbred with Neandertals]. However, my view is that in the next decade of ancient human population history things are going to get really interesting. We have already seen this from the Denisovian papers [1,2], and the work of ancient admixture in Africa (e.g. Hammer et al. 2011, Lachance et al. 2012). We will likely discover a bunch of cryptic somewhat distinct ancient populations, that we’ve previously [rightly] grouped into a relatively small number of labels based on their morphology and timing in the fossil record. We are not going to have names for many of these groups, but with large amounts of genomic data [ancient and modern] we are going to find all sorts of population structure. The question then becomes not an issue of naming these populations, but understanding the divergence and population genetic relationship among them.

There’s a huge range of (likely more plausible) scenarios that are hybrids between A and B that I think would still give the same difficulties with interpretations. For example, ongoing low levels of gene flow from population X into the Ancestral “population” of modern humans, consistent with us calling population X modern humans [see Figure below, **]. But all of the scenarios likely involve some thing pretty interesting happening in the past 100,000 years, with some form of contact between Eurasians and a somewhat diverged population.

As I say, the authors to their credit take the time in the discussion to point out this caveat. I thought some clarification of why this is the case would be helpful. The tools to address this problem more thoroughly are under development by some of the authors on this paper [Patterson et al 2012] and others [Lawson et al.]. So these tools along with more sequencing of ancient remains will help clarify all of this. It is an exciting time for human population genomics!

* I think I’m right in saying that the intercept of the curve with zero is the only thing that changes between Fig 1A and Fig 1B.

** Note that in the case shown in Figure 2, I think Sriram et al are mostly dating the red arrow, not any of the earlier arrows. This is because they condition their subset of alleles to represent introgression into European and to be at low frequency in Africa. We would likely not be able to date the deeper admixture arrow into the ancestor on Eurasian/Africa using the authors approach, as [I think] it relies on having a relatively non-admixed population to use as a control.