This guest post is by Ilan Gronau on Freedman et al. Genome Sequencing Highlights the Dynamic Early History of Dogs. This preprint was one of the most popular on Haldane’s Sieve last year, and was published yesterday in PLoS Genetics.
Earlier this week, our paper entitled “Genome Sequencing Highlights the Dynamic Early History of Dogs” was published in PLoS Genetics. This paper explores central questions having to do with the origin of domestic canines by sequencing and analyzing the complete genomes of six individuals from carefully selected dog and wolf lineages. We posted an earlier version of this manuscript on ArXiv, and received quite a lot of comments and questions regarding the methodology we employed for demography inference (many through Haldane’s Sieve). This feedback exposed an increasing interest in demography inference methods that utilize small numbers of complete individual genomes, but it also highlighted the need to examine the strengths and weaknesses of the different methods we employed, and how best to combine them to obtain a unified and robust picture of demographic history. We discuss some of these issues in the revised version published this week, but I thought some of the points were worth spelling out a bit more explicitly, which is the purpose of this post. I’d like to thank Adam Siepel, John Novembre, and Adam Freedman for sharing their insights through the process of writing this post.
The Methods
Our study takes advantage of three recently developed demography inference methods:
the Pairwise Sequential Markovian Coalescent (PSMC; Li and Durbin, 2011), the D statistic, or as it is more commonly referred to, the ABBA/BABA test (Durand et al., 2011), and the Generalized Phylogenetic Coalescent Sampler (G-PhoCS; Gronau, et al., 2011). All three methods base their inferences on the genealogical relationships among a relatively small number of individuals, taking advantage of the wealth of genealogical information encoded in individual genomes due to genetic recombination. PSMC, for instance, makes use of the information on changes in coalescent times between the two chromosome copies within a single individual to infer ancestral population sizes. The ABBA/BABA test makes use of asymmetries in genealogies spanning four chromosomes to detect post-divergence gene flow. G-PhoCS jointly considers all individuals using a multi-population coalescent-based demographic model, which includes population divergence times, changes in ancestral population size, and gene flow. A major advantage of G-PhoCS is that it produces a single detailed image of demographic history, inferred using a unified probabilistic model for all individuals. However, in the interest of computational tractability, the method relies on several simplifying modeling assumptions not required by the other methods. In addition, by being constrained to subsets of individuals, the PSMC and ABBA/BABA approaches are free to specialize in capturing particular aspects of demographic history (ancestral Ne and admixture, resp.), which G-PhoCS treats more coarsely. Thus, we found all three methods to be complementary and their combination to be particularly informative about the demographic history of these wild and domestic canids.
Inference of Ancestral Population Sizes
Both PSMC and G-PhoCS provide information about ancestral population sizes (Ne). PSMC is specifically designed for this task, and provides a high-resolution trace of changes in ancestral Ne by separately analyzing each diploid genome. Using PSMC, we could detect sharp declines in Ne for wolves and dogs without making any assumptions regarding the canid phylogeny. However, we found that the traces of ancestral Ne inferred by PSMC should be interpreted with care; simulations we conducted showed that the gradual reduction in Ne inferred by PSMC was also consistent with a more recent severe population bottleneck. Eventually, the coarser model inferred by G-PhoCS, in which shortly after the divergence of dogs and wolves the two ancestral populations suffered severe bottlenecks, ended up fitting the data better than the model implied by the PSMC traces. Thus in our case we found the phylogenetic context to be quite useful for dating the major changes in canid population size. Ideally, it should be possible to directly infer PSMC-style traces along the branches of the population phylogeny (alongside inference of divergence times), but this would involve a fairly major methodological undertaking.
Detection of Admixture and Post Divergence Gene Flow
One of the central findings in our study was that gene flow, particularly between dogs and wolves, played a prominent role in the history of canids. Indeed, we found that several previous claims about dog origins in the Middle East and East Asia were likely influenced by ancient gene flow from wolves to dogs in these regions. The ABBA/BABA test was an obvious choice for a method to detect admixture. It is a fairly simple method, sensitive to even low amounts of gene flow, and robust to assumptions about the demographic history of the populations being tested. By applying this test separately to all sample quartets that include the jackal outgroup, we were able to obtain a good set of candidate ancestral admixture events between dogs and wolves. However, interpreting some of these signals and combining them into a single unified hypothesis was not straightforward, especially since we found signatures for multiple ancestral admixture events among overlapping pairs of populations (e.g., Basenji-Israeli wolf and Boxer-Israeli wolf). G-PhoCS is better suited to deal with this more complex scenario of gene flow, because it jointly analyzes all samples and can consider multiple migration bands in a single analysis. We exploited this feature to find strong evidence of gene flow between wolves and jackals and to show that the signal found for Boxer-Israeli wolf admixture in the ABBA/BABA test was a result of ancestral gene flow from Basenji to Israeli wolf. Still, coming up with a scenario of ancestral gene flow that best fit the data required developing a fairly complex framework for model comparison that involved a combination of multiple separate G-PhoCS runs and comparison with simulated data (see below).
Model Comparison
Addressing subtle questions having to do with the origin of dogs and post-divergence gene flow with wolves required the ability to compare alternative hypotheses for dog domestication in terms of their fit to the data. We did this by considering a collection of plausible topologies for the population phylogeny augmented with different sets of migration bands, and using G-PhoCS to infer demographic parameters for each case. This provided us with a complete demographic model we could associate with each alternative hypothesis and then use to simulate data representing that hypothesis. The hypotheses were then assessed by comparing the simulated data with the real data. While this approach does not constitute a formal model-testing method, it did allow us to explore the space of plausible models in a systematic way and show that the data supports a model with single origin for dogs and that the origin was ancient and similarly distant from all sampled wolf populations.
Future Development
This study allowed us to closely examine recently developed methods for demography inference and ways of combining them to obtain a unified and robust inference of demographic history. While the different methods used in our study were all shown to be quite powerful, particularly when combined, there is obvious room for improvement. In my view, the most promising developments in this field will come from methods (such as G-PhoCS) that capture all major aspects of the demographic history—divergence times, ancestral population sizes and post-divergence gene flow—in a single analysis. The great advantage of such methods is that they provide a framework for rigorous hypothesis testing and model comparison. In principle, the fully Bayesian nature of G-PhoCS enables this quite naturally through the use of Bayes factors for comparison of different sets of model assumptions. Bayes factors are essentially the relative probabilities of different models given the data. Throughout this study, we experimented with various simple ways of estimating Bayes factors based on the data likelihoods of the MCMC samples generated by G-PhoCS, but we were not able to robustly capture the differences in likelihoods of genealogies sampled for the different hypotheses. Solving this important problem will require additional work, but it is definitely within reach. Another important set of extensions involves using richer models that rely on weaker sets of assumptions. This includes modeling recombination, gradual changes in ancestral population sizes, and more realistic models for gene flow. Progress is being made in these directions as well (see, for example, our recent work on ancestral recombination graph inference), and there is much room for optimism that the next generation of demography inference methods, coupled with emerging genomic data sets, will allow researchers an unprecedented capability to investigate the demographic history of additional species.