Inference of Ancestral Recombination Graphs
As the title indicates, our paper is about the problem of inferring an “ancestral recombination graph,” or ARG, from sequence data. This is a topic that may strike many readers as impenetrably obscure and technical, so I will first try to explain, in plain language, what the ARG describes and why it has so much potential to be useful in many kinds of genetic analysis. Then, I will tell the story of how I and members of my research group have become increasingly fascinated by this problem over the years, how we have struggled with it, and how we finally achieved the conceptual breakthrough that is described in our paper. As will become evident, Matt Rasmussen, a former postdoc in the group and lead author of our paper, was central in this achievement.
What is the ARG?
The ARG is an elegantly simple yet superbly rich data structure that describes the complete evolutionary history of a collection of genetic sequences drawn from individuals in one or more populations. It was invented in the mid 1990s by the mathematicians Bob Griffiths and Paul Marjoram. The ARG captures essentially all evolutionary information relevant for genetic analysis of such sequences. Statisticians say that it fully defines the “correlation structure” of the sequences, meaning that it explains most similarities and differences among the sequences in terms of their patterns of shared ancestry.
The ARG is something like a family tree, only richer, because it not only defines the relationships among individuals, but it also traces the histories of specific segments of DNA sequences. For example, if you were to replace your family tree with an ARG, you could tell exactly which pieces of your genome came from your eccentric great grandmother and which pieces you share with your charming, intelligent, and handsome third cousin.
Now, this is obviously all a bit vague and superficial. What does the ARG actually describe? First, the ARG is a “graph” in the mathematical sense, meaning that it consists of a set of “nodes” and a set of “edges” between nodes. Traditionally, nodes and edges are depicted with circles and connecting lines, respectively, making graphs look a bit like those molecular modeling kits you may have used in your high school chemistry class. In the case of the ARG, there are two types of nodes, representing two key classes of events in the history of the sequences: (i) recombination nodes, which describe how a sequence is assembled by concatenating two ancestral sequences when DNA is mixed-and-matched during the production of sperm and egg cells or other gametes (meiosis); and (ii) coalescence nodes, which describe how two sequences trace back to a common ancestor. Edges between these nodes represent lines of descent, or lineages, for segments of DNA. Each node is annotated with the time of the corresponding event. In addition, for reasons that will become apparent in a moment, recombination nodes are annotated with the position along the sequences at which the ancestral sequences were cut and then joined together.
The significance of recombinations and coalescences comes from the fact that these are the two ways in which lineages can join or split over time. The best way to understand them is to think about the behavior of lineages as one looks backward in time. The graph is typically laid out with time on the vertical axis, so that the bottom of the graph represents the present time and each node is assigned a height above this baseline indicating the time before the present at which the associated event occurred. Therefore, to look backward in time, we look upward in the graph. As we do so, we see that recombination events cause a single lineage to split into two ancestral lineages (representing the two sequence fragments that were joined together by the recombination in forward time), and coalescence events cause two lineages to join into one. Therefore, recombination nodes have one edge coming in and two going out, and coalescence nodes have two edges coming in and one going out. One way of thinking about it is that, given a fragment of modern DNA, recombinations have the effect of increasing its set of ancestors, while coalescences have the effect of decreasing its set of ancestors.
All of this joining and splitting results in a complex network that, to the uninitiated, might vaguely resemble a map of the London Tube. For the ARG-literate, however, this graph is richly informative about evolutionary history. The key to interpreting the graph is to make use of the positional information associated with the recombination nodes (the positions at which the recombinations occurred). For any position along the sequence, one can extract an evolutionary tree, or genealogy, for the samples in question, simply by following their lineages upward on the page and taking a left or a right turn at each recombination node based on whether the genomic position in question falls to the left or the right, respectively, of the recombination event. It is easy to see that a subgraph extracted in this way will contain only joins (coalescences) and no splits, and must eventually reduce to a single lineage, and therefore will define a tree. Thus, the ARG has embedded within it a “local tree,” or genealogy, for every position in the sequence!
I have to pause here to say that I find this property of the ARG incredibly beautiful. In a highly compact manner, the graph manages to capture both a complete record of all recombination events and a genealogy for every position in the sequence, and it does so in a way that shows how neighboring genealogies can be reconciled in terms of historical recombination events! Figure 1 of our paper illustrates it with a simple ARG for four sequences. Don’t be confused by the absence of explicit nodes (circles) in the diagram. By tradition, nodes are not drawn as circles in ARGs, but are simply understood to be the points at which edges meet.
It is worth emphasizing that this representation is very general. It can be used to describe the history of a particular gene of interest in individuals from a single well-defined population, or the history of whole genomes (with one ARG per chromosome) for individuals from many diverse populations. It can even be used to describe the histories of sequences from representatives of different species, such as humans, chimpanzees, and gorillas. As long as the sequences in question are orthologous and collinear—meaning, essentially, that they are derived from a common ancestral sequence in the absence of duplication and rearrangement events—then the coalescence and recombination events defined by the ARG are sufficient for describing precisely how the sequences derive from their common ancestor, and, hence, how they are correlated with each other.
Why is the ARG useful?
Many biologists are familiar with the value of “tree-thinking” in understanding evolutionary relationships. For example, a diagram of a tree intuitively and precisely captures the fact that humans and chimpanzees are more closely related to each other than either humans or chimpanzees are to rhesus macaques. Similarly, it explains why even neutrally evolving sequences (“junk DNA,” if the term is not too loaded) will tend to be more similar between humans and chimpanzees, and between rats and mice, than between humans and mice (this is essentially what we mean by “correlation structure”). Against the background of neutral evolution, trees help us to identify genes and noncoding functional elements, to detect evolutionarily conserved and adaptively evolving sequences, to date speciation events, and so on.
The problem with trees on population genetic time scales, however, is that they change along the sequence, due to recombination. As noted above, the ARG precisely describes these trees and the way they change. Therefore, it enables tree-thinking with population genetic data.
Viewing population genetics in terms of the ARG can clarify one’s thinking about many problems of interest. For instance, the ARG makes it clear that divergence times for genetically isolated populations can be estimated by looking across the ARG for the most recent coalescences that cross population boundaries. Similarly, given an estimated divergence time, the rate of gene flow or migration between populations can be estimated, in a fairly straightforward manner, in terms of the rates of inter-population coalescence events across the ARG. Ancestral effective population sizes can be estimated from the density of coalescence events in the ARG over time. Signatures of natural selection, including hitchhiking and background selection, can be detected by various kinds of local distortion of the ARG. In general, the ARG provides a unifying framework for the field, and many challenging statistical problems in population genetics can properly be seen as problems of revealing relevant features of the ARG.
What would a reconstructed ARG mean in practical terms? First, I should be clear that we have no intension of actually drawing an ARG for dozens of complete human genome sequences. Such a drawing would be far too large and complex to be useful. Rather, the value of a reconstructed ARG is as a rich data structure that could be interrogated for many features of interest, such as local trees, recombination events, mutation ages, or regions of identity by descent. Because these features would be derived from a unified description of the evolutionary history of the sample, they would be guaranteed to be internally consistent, unlike ones based on simpler estimators. In this way, the ARG would be useful in many problems of interest in statistical genetics, ranging from demography inference (e.g., estimation of population divergence times or rates of gene flow between populations), to the detection of regions influenced by natural selection, to the detection of genotype/phenotype associations.
Why is it so difficult to find a good ARG?
In practice, most population geneticists do not work with ARGs, but instead work with surrogates such as principle components, site frequency spectra, and spectra of identity by descent. The reason people work with these simpler, lower-dimensional summaries of genetic data, of course, is that explicit ARG reconstruction is forbiddingly difficult. From a statistical and computational perspective, there are two major issues in reconstructing the ARG. First, the problem of searching all possible ARGs for one that best fits the data is computationally intractable, even in a restricted, parsimony-based formulation of the problem (it belongs to the class of problems computer scientists call “NP-hard”). Second, and perhaps more importantly, in most cases of interest there is simply not enough information in the data to reconstruct a single ARG with high confidence. Rather, in general, a large family of ARGs will be more or less equally compatible with observed sequences.
For these reasons, it would be misleading to suggest that there is any hope of producing a magical computer program that will allow the user to input a collection of sequences and obtain the true ARG for those sequences as output. Instead, we must consider many possible ARGs, weighting them in some way by their plausibility. In other words, we must consider a statistical distribution of ARGs given the data.
Because of the awkwardness of the space of ARGs (each ARG is a complex, combinatorial object, difficult to summarize in terms of low-dimensional features), we and others have come to the conclusion that the best way to get at these distributions is by making use of statistical sampling methods. In our case, we use an approach, called Markov chain Monte Carlo (MCMC), that chooses samples that are guaranteed to be representative of the distribution of ARGs given the data and the model, provided the sampling program is run long enough. After collecting fairly large numbers of samples, we can make useful statements about general features of the ARG even if we have limited confidence in each individual sample. For example, the average of the times to most recent common ancestry (TMRCA) in the sampled ARGs at a particular position along the sequence can be used as an estimator of the true TMRCA at that position. We show that our methods can be used to summarize various useful features of this kind, including recombination and coalescence rates, and the ages of mutations that are polymorphic in the sample, as well as TMRCAs.
How we became interested in the problem of ARG inference—or, the evolution of an obsession
The ARG seems to have a unique ability to inspire obsession in a particular kind of mathematically inclined geneticist. As discussed above, it is potentially richly informative but extremely difficult to obtain, which makes it a natural “Holy Grail” for evolutionary genetics. At the same time, the ARG is fairly simple to describe, it is straightforward to simulate, and sequence data are strongly informative about many of its features, making solutions to the inference problem seem tantalizingly within reach.
I first encountered the ARG and the problem of ARG inference in 2005 when I picked up the book Gene genealogies, variation, and evolution: a primer in coalescent theory, by Hein, Schierup, and Wiuf. The authors of this book had obviously been bitten by the ARG bug (in fact, the lead author, Jotun Hein had already been working on variations on the ARG inference problem for more than 15 years) and I found their enthusiasm contagious. At the time I was primarily a phylogeneticist, new to population genetics, so the book was a perfect introduction to the topic for me. It is written largely from a phylogenetic perspective with a strong emphasis on combinatorial analysis and algorithms and without a lot of intimidating population genetics jargon. Not long afterward I got to know Jotun Hein at leisurely meeting in Barbados and became further interested in the topic through my conversations with him.
As time went on, and I attended more and more talks on population genetics (they are hard to avoid at Cornell!), I became increasingly convinced that the ARG provided the right framework for thinking about a wide variety of problems in population genetics. While population mixture models, principle components analysis, copy-and-paste models, models of the site-frequency spectrum, and similar approaches clearly have their place, for a committed tree thinker like me, there is something deeply unsatisfying about the use of these techniques in evolutionary modeling and inference. I had a strong feeling that it must be possible to do better by more directly modeling the true evolution process that gives rise to the observed sequences.
In 2009, I was invited to write a review article on primate comparative genomics for Genome Research. This exercise led me to carry out a broad review of the literature and reflect on many of the problems of current interest in population genetics. Perhaps inevitably, the ARG ended up playing a central role in my discussion of the field. Not many people have cited this rather long-winded review but the act of writing it was extremely useful in clarifying my thinking about population genetics and the ARG.
Soon afterward, I began to try to interest students and postdocs in tackling the problem of explicit ARG inference in earnest, with an eye toward applications involving dozens or possibly hundreds of complete genomes. I had a hunch that it should be possible to improve on existing methods by making carefully selected approximations and drawing from our standard bag of tricks for probabilistic modeling. I did manage to convince a couple of students to work on the problem but progress was slow and limited, and they moved onto other problems before obtaining publishable results.
When Matt Rasmussen joined the group as a postdoc in 2011, he jumped at the chance to tackle the ARG inference problem. It quickly became clear that Matt was an ideal match for the problem, given his background in molecular evolution and phylogenetics, his strong skills in algorithms and probabilistic modeling, and his programming talents. He made early progress by setting up a forward simulator for ARGs, studying the properties of simulated ARGs in the presence of selection, and developing intuition about which features in the data were most informative for inference.
Still, the inference problem proved challenging. Matt was focusing on an approach I had suggested involving the use of a small set of “representative genealogies” that were selected to span, as effectively as possible, the (infinitely large) space of all genealogies. His plan was to use these genealogies to define a phylogenetic hidden Markov model, which could be used for approximate inference. He had devised a number of clever heuristics for selecting these genealogies, for example, based on the collection of “splits” (bipartitions of sequences) implied by informative sites in the data, but it gradually became clear that this approach was limited in effectiveness and had poor scaling properties. It was just too hard to pick a good set of representative genealogies. Almost a year into the project, we faced a choice between settling for an approach that was clearly suboptimal or going back to the drawing board and beginning anew.
Sequence “threading”—exaptation of an idea
Our breakthrough came some time during the winter of 2012. I remember the occasion vividly. It was one of those rare moments when the outlines of a solution to a difficult puzzle suddenly come into focus at all once—as close to a mythical “eureka” moment as I have experienced.
Matt and I were reviewing possible solutions to the problem of selecting representative genealogies, when I was reminded of an idea I had discussed with a student a couple of years earlier. Could there be a way to build the ARG up one sequence at a time, by “threading” an nth sequence into an ARG of size n–1? This idea had intuitive appeal but the previous student and I had never been able to make it work with full genealogies. However, Matt immediately picked up the idea and ran with it. It fit naturally with the way he was representing the local genealogies and thinking about how they were related and he guessed that it ought to be possible to think of it as a problem of adding one branch to each local tree, position by position along the sequence.
As we discussed further how one might set up a solution to the threading problem, I was suddenly struck by a parallel to a problem my group had worked on several years earlier. That problem had to do with gene prediction in a multi-species setting, where the exon-intron boundaries for genes were allowed to differ from one species to the next. The approach we had settled on was to repeatedly sample the gene annotations for one species conditional on those for all of the others, using a probabilistic model called a phylogenetic hidden Markov model. It turned out that this sampling problem could be elegantly solved by dynamic programming, using an algorithm for hidden Markov models known as the stochastic traceback algorithm.
Matt’s threading problem, it was evident, had the same essential structure, although it differed in many details. Therefore, it ought to be possible to carry out the threading operation exactly, in time proportional to the sequence length, using the stochastic traceback algorithm. Furthermore, it ought to be possible to structure this operation so that repeated applications would guarantee that we sampled from the true distribution of ARGs given the data and our model (i.e., so that we had a proper Gibbs sampler, in statistical jargon). It was immediately clear to both of us that this was the right path forward.
As usual, there was a good deal of work in taking these ideas from white-board scratchings to a fully defined model and a working implementation. Not least among these was the formidable task of generalizing the threading method to multiple sequences, which proved necessary for good MCMC convergence properties with more than a few sequences. Matt was exceptionally diligent, resourceful, and creative in working through all of these steps, and eventually demonstrated that the method performs remarkably well, perhaps better than either of us anticipated. Nevertheless, it was in that one exciting discussion about the threading operation that the project came into focus, making it clear that we had fundamentally new way of tackling this longstanding problem that was worth the additional months that would be needed to see it through to a full-fledged implementation.
Most evolutionary biologists are familiar with the concept of “exaptation,” popularized by Stephen Jay Gould and Elizabeth Vrba. Exaptation refers to the process by which evolution takes a phenotypic characteristic developed for one purpose, and co-opts it for use in another way. Examples include the gas bladders used for buoyancy in some fish (derived from lungs) and the use of bird feathers for flight (thought to have evolved for temperature regulation). The phenomenon, of course, is also common in the evolution of ideas. To me, the threading idea was a perfect example of an algorithmic exaptation, the reuse of an idea developed for one problem in evolutionary sequence analysis and co-opted for use in another. In this case, the threading operation proved considerably more powerful and useful for ARG inference than for gene prediction.
In this post, I have tried to convey some of my excitement about genome-wide ARG inference, a topic that might strike the casual reader as dry and obscure. Space does not allow for a discussion of all of the possible ways in which we are thinking of making use of Matt’s ARG sampling program (called ARGweaver), but our manuscript does include a fairly lengthy Discussion section, which lays out some of our ideas for future work. Clearly there is much still to be done, but I am convinced that this is going to be an enormously powerful tool for population genomic analysis.
Finally, I should say that, while I have focused on our own efforts in this post, we are far from the only researchers actively working on this topic. As described in our paper, there is a great deal of closely related recent work by Yun Song, Richard Durbin, Thomas Mailund, Asger Hobolth, Mikkel Schierup, and others. In short, there is no shortage of opportunities for exaptation of ideas in this vibrant field.