Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs

This guest post is by Ilan Gronau on Freedman et al. Genome Sequencing Highlights the Dynamic Early History of Dogs. This preprint was one of the most popular on Haldane’s Sieve last year, and was published yesterday in PLoS Genetics.

Earlier this week, our paper entitled “Genome Sequencing Highlights the Dynamic Early History of Dogs” was published in PLoS Genetics. This paper explores central questions having to do with the origin of domestic canines by sequencing and analyzing the complete genomes of six individuals from carefully selected dog and wolf lineages. We posted an earlier version of this manuscript on ArXiv, and received quite a lot of comments and questions regarding the methodology we employed for demography inference (many through Haldane’s Sieve). This feedback exposed an increasing interest in demography inference methods that utilize small numbers of complete individual genomes, but it also highlighted the need to examine the strengths and weaknesses of the different methods we employed, and how best to combine them to obtain a unified and robust picture of demographic history. We discuss some of these issues in the revised version published this week, but I thought some of the points were worth spelling out a bit more explicitly, which is the purpose of this post. I’d like to thank Adam Siepel, John Novembre, and Adam Freedman for sharing their insights through the process of writing this post.

The Methods

Our study takes advantage of three recently developed demography inference methods:
the Pairwise Sequential Markovian Coalescent (PSMC; Li and Durbin, 2011), the D statistic, or as it is more commonly referred to, the ABBA/BABA test (Durand et al., 2011), and the Generalized Phylogenetic Coalescent Sampler (G-PhoCS; Gronau, et al., 2011). All three methods base their inferences on the genealogical relationships among a relatively small number of individuals, taking advantage of the wealth of genealogical information encoded in individual genomes due to genetic recombination. PSMC, for instance, makes use of the information on changes in coalescent times between the two chromosome copies within a single individual to infer ancestral population sizes. The ABBA/BABA test makes use of asymmetries in genealogies spanning four chromosomes to detect post-divergence gene flow. G-PhoCS jointly considers all individuals using a multi-population coalescent-based demographic model, which includes population divergence times, changes in ancestral population size, and gene flow. A major advantage of G-PhoCS is that it produces a single detailed image of demographic history, inferred using a unified probabilistic model for all individuals. However, in the interest of computational tractability, the method relies on several simplifying modeling assumptions not required by the other methods. In addition, by being constrained to subsets of individuals, the PSMC and ABBA/BABA approaches are free to specialize in capturing particular aspects of demographic history (ancestral Ne and admixture, resp.), which G-PhoCS treats more coarsely. Thus, we found all three methods to be complementary and their combination to be particularly informative about the demographic history of these wild and domestic canids.

Inference of Ancestral Population Sizes

Both PSMC and G-PhoCS provide information about ancestral population sizes (Ne). PSMC is specifically designed for this task, and provides a high-resolution trace of changes in ancestral Ne by separately analyzing each diploid genome. Using PSMC, we could detect sharp declines in Ne for wolves and dogs without making any assumptions regarding the canid phylogeny. However, we found that the traces of ancestral Ne inferred by PSMC should be interpreted with care; simulations we conducted showed that the gradual reduction in Ne inferred by PSMC was also consistent with a more recent severe population bottleneck. Eventually, the coarser model inferred by G-PhoCS, in which shortly after the divergence of dogs and wolves the two ancestral populations suffered severe bottlenecks, ended up fitting the data better than the model implied by the PSMC traces. Thus in our case we found the phylogenetic context to be quite useful for dating the major changes in canid population size. Ideally, it should be possible to directly infer PSMC-style traces along the branches of the population phylogeny (alongside inference of divergence times), but this would involve a fairly major methodological undertaking.

Detection of Admixture and Post Divergence Gene Flow

One of the central findings in our study was that gene flow, particularly between dogs and wolves, played a prominent role in the history of canids. Indeed, we found that several previous claims about dog origins in the Middle East and East Asia were likely influenced by ancient gene flow from wolves to dogs in these regions. The ABBA/BABA test was an obvious choice for a method to detect admixture. It is a fairly simple method, sensitive to even low amounts of gene flow, and robust to assumptions about the demographic history of the populations being tested. By applying this test separately to all sample quartets that include the jackal outgroup, we were able to obtain a good set of candidate ancestral admixture events between dogs and wolves. However, interpreting some of these signals and combining them into a single unified hypothesis was not straightforward, especially since we found signatures for multiple ancestral admixture events among overlapping pairs of populations (e.g., Basenji-Israeli wolf and Boxer-Israeli wolf). G-PhoCS is better suited to deal with this more complex scenario of gene flow, because it jointly analyzes all samples and can consider multiple migration bands in a single analysis. We exploited this feature to find strong evidence of gene flow between wolves and jackals and to show that the signal found for Boxer-Israeli wolf admixture in the ABBA/BABA test was a result of ancestral gene flow from Basenji to Israeli wolf. Still, coming up with a scenario of ancestral gene flow that best fit the data required developing a fairly complex framework for model comparison that involved a combination of multiple separate G-PhoCS runs and comparison with simulated data (see below).

Model Comparison

Addressing subtle questions having to do with the origin of dogs and post-divergence gene flow with wolves required the ability to compare alternative hypotheses for dog domestication in terms of their fit to the data. We did this by considering a collection of plausible topologies for the population phylogeny augmented with different sets of migration bands, and using G-PhoCS to infer demographic parameters for each case. This provided us with a complete demographic model we could associate with each alternative hypothesis and then use to simulate data representing that hypothesis. The hypotheses were then assessed by comparing the simulated data with the real data. While this approach does not constitute a formal model-testing method, it did allow us to explore the space of plausible models in a systematic way and show that the data supports a model with single origin for dogs and that the origin was ancient and similarly distant from all sampled wolf populations.

Future Development

This study allowed us to closely examine recently developed methods for demography inference and ways of combining them to obtain a unified and robust inference of demographic history. While the different methods used in our study were all shown to be quite powerful, particularly when combined, there is obvious room for improvement. In my view, the most promising developments in this field will come from methods (such as G-PhoCS) that capture all major aspects of the demographic history—divergence times, ancestral population sizes and post-divergence gene flow—in a single analysis. The great advantage of such methods is that they provide a framework for rigorous hypothesis testing and model comparison. In principle, the fully Bayesian nature of G-PhoCS enables this quite naturally through the use of Bayes factors for comparison of different sets of model assumptions. Bayes factors are essentially the relative probabilities of different models given the data. Throughout this study, we experimented with various simple ways of estimating Bayes factors based on the data likelihoods of the MCMC samples generated by G-PhoCS, but we were not able to robustly capture the differences in likelihoods of genealogies sampled for the different hypotheses. Solving this important problem will require additional work, but it is definitely within reach. Another important set of extensions involves using richer models that rely on weaker sets of assumptions. This includes modeling recombination, gradual changes in ancestral population sizes, and more realistic models for gene flow. Progress is being made in these directions as well (see, for example, our recent work on ancestral recombination graph inference), and there is much room for optimism that the next generation of demography inference methods, coupled with emerging genomic data sets, will allow researchers an unprecedented capability to investigate the demographic history of additional species.

I am glad that this post is promoting discussion of these issues. Let me try to address these concerns.

(1) Genetic dating is always a bit problematic and depends on reliable estimates of mutation rate. Currently, there is a 2-fold difference between the lower and upper bounds on mammalian mutation rate. We have a fairly detailed account of this in the Discussion section of the paper, leading to a conclusion that domestication may have occurred as early as 34 kya. Hence, our findings are not strictly contradictory to the ancient DNA finding of Druzhkova et al (2013). Additionall, while the Altai dog findings are pretty convincing, the genetic analysis is of a single line of maternal ancestry (throug mtDNA). Druzhkova et al (2013) admit in their paper that analysis of more loci is needed to conclusively address the issue of date and source population. I think we all agree that ancient DNA will play a very important role in addressing these issues.

(2) You may ask why we chose to base our main set of estimates on a single mutation rate, which is in the upper part of the plausible range (1.4e-8 per generation, including CpG mutations). The main reason for that was to allow easy comparison with genetic estimates from other recent studies. All of these studies assumed a single mutation rate, which was in most cases compatible with the one we assumed (with the exception of Wang et al, 2013, which we specifically note in that discussion paragraph). A secondary reason was that we wanted our main set of estimates to be conservative with respect to the hypothesis of pre-agriculture domestication, which is why we preferred a high mutation rate to a lower one. With more data accummulating, I think we’ll eventually be able to narrow down the mutation rate issue, and we’re likely to find out that the actual rates are lower than commonly assumed for dogs (lower than 1e-8 per generation). The main objective of our analysis, in terms of dating, was to address the other sources of uncertainty (sparse data, assumptions on gene flow, etc.), which were pretty overwhelming in previous studies.

(4) Regarding the issue of “more loci” vs. “more samples”, this is where I have to disagree with you. First of all, Hao et al (2008) address this issue in the context of association mapping, which is quite a different story from demography reconstruction. We examined power issues quite thoroughly in the G-PhoCS paper (Gronau et al, 2011 http://www.nature.com/ng/journal/v43/n10/full/ng.937.html), and found that the number of loci you get from complete genome sequences (tens of thousands) gives you very accurate estimation of demographic parameters–pop szes and divergence times–even from very smal sample sizes. In fact, if you’re interested in reconstructing deep demography in the range of dates we were interested in, then more loci give you more power than more samples. The coalescent-based intuition behind this is that additional samples are likely to coalsce with one of the current samples relatively quickly, and those recent coalescent events are not informative about deep demography. In contrast, an additional unlinked locus is more likely to contain “new” deep coalescent events.

Reply ↓

9 thoughts on “Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs”

Pingback: Nibbles: British foods, NBPGR wheat, CIMMYT wheat, Innovation, Ghana cowpea, Nordic grog, Medicinal purposes, Heirloom chocolate, Grünewoche, Gog genomics
Maju on January 18, 2014 at 8:51 am said:

I read the study yesterday but I have some serious concerns:

1. We know for a fact that an ancient dog from Altai, dated to c. 33 Ka BP, is ancestral to at least some modern dogs (Native American breeds), v. Druzhkova 2013: (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0057754). This is in outright contradiction with the results of this paper.

2. We also know that, in autosomal genetics, larger samples are extremely much more informative than larger sequences, v. Ke Hao 2008 (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000109). Instead this study chose the less productive approach: full coverage genomic sequences but with just three samples.

3. Finally the “molecular clock” estimates that the study offers as “conclusions” can only be what they are: mere estimates (quite often far from reality). They can only have some validity when confirmed by other more solid and less merely theoretical data, for example ancient DNA, and in this case this one is clearly not supporting the conclusions (not to mention that there is other evidence of dog domestication deep in the Upper Paleolithic).

Reply ↓
- Joe Pickrell on January 18, 2014 at 10:29 am said:
  
  There’s a lot in the study, I think you’re referring specifically to the conclusion about the date of domestication? That is, you’re arguing that the inferred domestication date ~15kya does not seem consistent with the ancient DNA results from the bone from 33kya? This seems like a reasonable point.
  
  Reply ↓
  - Maju on January 18, 2014 at 2:35 pm said:
    
    Exactly: I mean that conclusion. There is more stuff like the appearence of dogs having both East Asian and West Asian wolf ancestors, which I do not dare to judge but may well be reasonable because previous studies contradict each other on this aspect.
Ilan Gronau on January 20, 2014 at 1:03 am said:

I am glad that this post is promoting discussion of these issues. Let me try to address these concerns.

(1) Genetic dating is always a bit problematic and depends on reliable estimates of mutation rate. Currently, there is a 2-fold difference between the lower and upper bounds on mammalian mutation rate. We have a fairly detailed account of this in the Discussion section of the paper, leading to a conclusion that domestication may have occurred as early as 34 kya. Hence, our findings are not strictly contradictory to the ancient DNA finding of Druzhkova et al (2013). Additionall, while the Altai dog findings are pretty convincing, the genetic analysis is of a single line of maternal ancestry (throug mtDNA). Druzhkova et al (2013) admit in their paper that analysis of more loci is needed to conclusively address the issue of date and source population. I think we all agree that ancient DNA will play a very important role in addressing these issues.

(2) You may ask why we chose to base our main set of estimates on a single mutation rate, which is in the upper part of the plausible range (1.4e-8 per generation, including CpG mutations). The main reason for that was to allow easy comparison with genetic estimates from other recent studies. All of these studies assumed a single mutation rate, which was in most cases compatible with the one we assumed (with the exception of Wang et al, 2013, which we specifically note in that discussion paragraph). A secondary reason was that we wanted our main set of estimates to be conservative with respect to the hypothesis of pre-agriculture domestication, which is why we preferred a high mutation rate to a lower one. With more data accummulating, I think we’ll eventually be able to narrow down the mutation rate issue, and we’re likely to find out that the actual rates are lower than commonly assumed for dogs (lower than 1e-8 per generation). The main objective of our analysis, in terms of dating, was to address the other sources of uncertainty (sparse data, assumptions on gene flow, etc.), which were pretty overwhelming in previous studies.

(4) Regarding the issue of “more loci” vs. “more samples”, this is where I have to disagree with you. First of all, Hao et al (2008) address this issue in the context of association mapping, which is quite a different story from demography reconstruction. We examined power issues quite thoroughly in the G-PhoCS paper (Gronau et al, 2011 http://www.nature.com/ng/journal/v43/n10/full/ng.937.html), and found that the number of loci you get from complete genome sequences (tens of thousands) gives you very accurate estimation of demographic parameters–pop szes and divergence times–even from very smal sample sizes. In fact, if you’re interested in reconstructing deep demography in the range of dates we were interested in, then more loci give you more power than more samples. The coalescent-based intuition behind this is that additional samples are likely to coalsce with one of the current samples relatively quickly, and those recent coalescent events are not informative about deep demography. In contrast, an additional unlinked locus is more likely to contain “new” deep coalescent events.

Reply ↓
- Maju on January 21, 2014 at 7:09 am said:
  
  Thanks for your reply, Ilan. Regarding point #4 (shouldn’t be #3?) I accept your point of view because my knowledge is indeed too limited in this aspect, so I have to bow to what seems a deeper understanding than my own.
  
  However re. points #1 and #2, I must underline that, in my understanding, there is a generalized problem in the population genetics’ scholarship (barring the rare exception) of “endogamic feedback” so to say or, as I often put it: scholasticism. You admit to it when you say: “The main reason for that was to allow easy comparison with genetic estimates from other recent studies”. So what really matters is not that the age estimates are realistic but that they approximate or follow the same methodologies as other studies, the vast majority of which fall in the same serious problem of under-stimating realistic ages (when contrasted with archaeological and paleontological data). This applies to Pan-Homo split estimates, as to the “out of Africa” ones, almost invariably way too recent when contrasted with the latest material findings (Sahelanthropus for example in the first case, but also the age estimates of Langergraber 2012, or the recent piling up of archaeological and fossil human data strongly supporting a H. sapiens migration of c. 100 Ka ago to South and East Asia, after maybe a 25 Ka hiatus in Arabia/Palestine).
  
  This is about dogs, not humans, but the same systemic problems seem to be present. I really think that “molecular chronology” needs a strong revision – and for that researchers need to depart from scholastic “endogamous feedback”.
  
  Reply ↓
Pingback: Wolf/dog genomes paper out | jnpopgen
Pingback: Most viewed on Haldane’s Sieve: January 2014 | Haldane's Sieve
Pingback: Sifting through 2014 on Haldane’s Sieve | Haldane's Sieve

Haldane's Sieve

Discussing preprints in population and evolutionary genetics

Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs

The Methods

Inference of Ancestral Population Sizes

Detection of Admixture and Post Divergence Gene Flow

Model Comparison

Future Development

9 thoughts on “Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs”

Leave a comment Cancel reply

The Methods

Inference of Ancestral Population Sizes

Detection of Admixture and Post Divergence Gene Flow

Model Comparison

Future Development

Share this:

Related

9 thoughts on “Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs”

Leave a comment Cancel reply