This guest post is by Adam Siepel (@asiepel) on his paper with Matthew Rasmussen (@mattrasmus): Rasmussen and Siepel “Genome-wide inference of ancestral recombination graphs” , arXived here.
Inference of Ancestral Recombination Graphs
As the title indicates, our paper is about the problem of inferring an “ancestral recombination graph,” or ARG, from sequence data. This is a topic that may strike many readers as impenetrably obscure and technical, so I will first try to explain, in plain language, what the ARG describes and why it has so much potential to be useful in many kinds of genetic analysis. Then, I will tell the story of how I and members of my research group have become increasingly fascinated by this problem over the years, how we have struggled with it, and how we finally achieved the conceptual breakthrough that is described in our paper. As will become evident, Matt Rasmussen, a former postdoc in the group and lead author of our paper, was central in this achievement.
What is the ARG?
The ARG is an elegantly simple yet superbly rich data structure that describes the complete evolutionary history of a collection of genetic sequences drawn from individuals in one or more populations. It was invented in the mid 1990s by the mathematicians Bob Griffiths and Paul Marjoram. The ARG captures essentially all evolutionary information relevant for genetic analysis of such sequences. Statisticians say that it fully defines the “correlation structure” of the sequences, meaning that it explains most similarities and differences among the sequences in terms of their patterns of shared ancestry.
The ARG is something like a family tree, only richer, because it not only defines the relationships among individuals, but it also traces the histories of specific segments of DNA sequences. For example, if you were to replace your family tree with an ARG, you could tell exactly which pieces of your genome came from your eccentric great grandmother and which pieces you share with your charming, intelligent, and handsome third cousin.
Now, this is obviously all a bit vague and superficial. What does the ARG actually describe? First, the ARG is a “graph” in the mathematical sense, meaning that it consists of a set of “nodes” and a set of “edges” between nodes. Traditionally, nodes and edges are depicted with circles and connecting lines, respectively, making graphs look a bit like those molecular modeling kits you may have used in your high school chemistry class. In the case of the ARG, there are two types of nodes, representing two key classes of events in the history of the sequences: (i) recombination nodes, which describe how a sequence is assembled by concatenating two ancestral sequences when DNA is mixed-and-matched during the production of sperm and egg cells or other gametes (meiosis); and (ii) coalescence nodes, which describe how two sequences trace back to a common ancestor. Edges between these nodes represent lines of descent, or lineages, for segments of DNA. Each node is annotated with the time of the corresponding event. In addition, for reasons that will become apparent in a moment, recombination nodes are annotated with the position along the sequences at which the ancestral sequences were cut and then joined together.
The significance of recombinations and coalescences comes from the fact that these are the two ways in which lineages can join or split over time. The best way to understand them is to think about the behavior of lineages as one looks backward in time. The graph is typically laid out with time on the vertical axis, so that the bottom of the graph represents the present time and each node is assigned a height above this baseline indicating the time before the present at which the associated event occurred. Therefore, to look backward in time, we look upward in the graph. As we do so, we see that recombination events cause a single lineage to split into two ancestral lineages (representing the two sequence fragments that were joined together by the recombination in forward time), and coalescence events cause two lineages to join into one. Therefore, recombination nodes have one edge coming in and two going out, and coalescence nodes have two edges coming in and one going out. One way of thinking about it is that, given a fragment of modern DNA, recombinations have the effect of increasing its set of ancestors, while coalescences have the effect of decreasing its set of ancestors.
All of this joining and splitting results in a complex network that, to the uninitiated, might vaguely resemble a map of the London Tube. For the ARG-literate, however, this graph is richly informative about evolutionary history. The key to interpreting the graph is to make use of the positional information associated with the recombination nodes (the positions at which the recombinations occurred). For any position along the sequence, one can extract an evolutionary tree, or genealogy, for the samples in question, simply by following their lineages upward on the page and taking a left or a right turn at each recombination node based on whether the genomic position in question falls to the left or the right, respectively, of the recombination event. It is easy to see that a subgraph extracted in this way will contain only joins (coalescences) and no splits, and must eventually reduce to a single lineage, and therefore will define a tree. Thus, the ARG has embedded within it a “local tree,” or genealogy, for every position in the sequence!
I have to pause here to say that I find this property of the ARG incredibly beautiful. In a highly compact manner, the graph manages to capture both a complete record of all recombination events and a genealogy for every position in the sequence, and it does so in a way that shows how neighboring genealogies can be reconciled in terms of historical recombination events! Figure 1 of our paper illustrates it with a simple ARG for four sequences. Don’t be confused by the absence of explicit nodes (circles) in the diagram. By tradition, nodes are not drawn as circles in ARGs, but are simply understood to be the points at which edges meet.
It is worth emphasizing that this representation is very general. It can be used to describe the history of a particular gene of interest in individuals from a single well-defined population, or the history of whole genomes (with one ARG per chromosome) for individuals from many diverse populations. It can even be used to describe the histories of sequences from representatives of different species, such as humans, chimpanzees, and gorillas. As long as the sequences in question are orthologous and collinear—meaning, essentially, that they are derived from a common ancestral sequence in the absence of duplication and rearrangement events—then the coalescence and recombination events defined by the ARG are sufficient for describing precisely how the sequences derive from their common ancestor, and, hence, how they are correlated with each other.
Why is the ARG useful?
Many biologists are familiar with the value of “tree-thinking” in understanding evolutionary relationships. For example, a diagram of a tree intuitively and precisely captures the fact that humans and chimpanzees are more closely related to each other than either humans or chimpanzees are to rhesus macaques. Similarly, it explains why even neutrally evolving sequences (“junk DNA,” if the term is not too loaded) will tend to be more similar between humans and chimpanzees, and between rats and mice, than between humans and mice (this is essentially what we mean by “correlation structure”). Against the background of neutral evolution, trees help us to identify genes and noncoding functional elements, to detect evolutionarily conserved and adaptively evolving sequences, to date speciation events, and so on.
The problem with trees on population genetic time scales, however, is that they change along the sequence, due to recombination. As noted above, the ARG precisely describes these trees and the way they change. Therefore, it enables tree-thinking with population genetic data.
Viewing population genetics in terms of the ARG can clarify one’s thinking about many problems of interest. For instance, the ARG makes it clear that divergence times for genetically isolated populations can be estimated by looking across the ARG for the most recent coalescences that cross population boundaries. Similarly, given an estimated divergence time, the rate of gene flow or migration between populations can be estimated, in a fairly straightforward manner, in terms of the rates of inter-population coalescence events across the ARG. Ancestral effective population sizes can be estimated from the density of coalescence events in the ARG over time. Signatures of natural selection, including hitchhiking and background selection, can be detected by various kinds of local distortion of the ARG. In general, the ARG provides a unifying framework for the field, and many challenging statistical problems in population genetics can properly be seen as problems of revealing relevant features of the ARG.
What would a reconstructed ARG mean in practical terms? First, I should be clear that we have no intension of actually drawing an ARG for dozens of complete human genome sequences. Such a drawing would be far too large and complex to be useful. Rather, the value of a reconstructed ARG is as a rich data structure that could be interrogated for many features of interest, such as local trees, recombination events, mutation ages, or regions of identity by descent. Because these features would be derived from a unified description of the evolutionary history of the sample, they would be guaranteed to be internally consistent, unlike ones based on simpler estimators. In this way, the ARG would be useful in many problems of interest in statistical genetics, ranging from demography inference (e.g., estimation of population divergence times or rates of gene flow between populations), to the detection of regions influenced by natural selection, to the detection of genotype/phenotype associations.
Why is it so difficult to find a good ARG?
In practice, most population geneticists do not work with ARGs, but instead work with surrogates such as principle components, site frequency spectra, and spectra of identity by descent. The reason people work with these simpler, lower-dimensional summaries of genetic data, of course, is that explicit ARG reconstruction is forbiddingly difficult. From a statistical and computational perspective, there are two major issues in reconstructing the ARG. First, the problem of searching all possible ARGs for one that best fits the data is computationally intractable, even in a restricted, parsimony-based formulation of the problem (it belongs to the class of problems computer scientists call “NP-hard”). Second, and perhaps more importantly, in most cases of interest there is simply not enough information in the data to reconstruct a single ARG with high confidence. Rather, in general, a large family of ARGs will be more or less equally compatible with observed sequences.
For these reasons, it would be misleading to suggest that there is any hope of producing a magical computer program that will allow the user to input a collection of sequences and obtain the true ARG for those sequences as output. Instead, we must consider many possible ARGs, weighting them in some way by their plausibility. In other words, we must consider a statistical distribution of ARGs given the data.
Because of the awkwardness of the space of ARGs (each ARG is a complex, combinatorial object, difficult to summarize in terms of low-dimensional features), we and others have come to the conclusion that the best way to get at these distributions is by making use of statistical sampling methods. In our case, we use an approach, called Markov chain Monte Carlo (MCMC), that chooses samples that are guaranteed to be representative of the distribution of ARGs given the data and the model, provided the sampling program is run long enough. After collecting fairly large numbers of samples, we can make useful statements about general features of the ARG even if we have limited confidence in each individual sample. For example, the average of the times to most recent common ancestry (TMRCA) in the sampled ARGs at a particular position along the sequence can be used as an estimator of the true TMRCA at that position. We show that our methods can be used to summarize various useful features of this kind, including recombination and coalescence rates, and the ages of mutations that are polymorphic in the sample, as well as TMRCAs.
How we became interested in the problem of ARG inference—or, the evolution of an obsession
The ARG seems to have a unique ability to inspire obsession in a particular kind of mathematically inclined geneticist. As discussed above, it is potentially richly informative but extremely difficult to obtain, which makes it a natural “Holy Grail” for evolutionary genetics. At the same time, the ARG is fairly simple to describe, it is straightforward to simulate, and sequence data are strongly informative about many of its features, making solutions to the inference problem seem tantalizingly within reach.
I first encountered the ARG and the problem of ARG inference in 2005 when I picked up the book Gene genealogies, variation, and evolution: a primer in coalescent theory, by Hein, Schierup, and Wiuf. The authors of this book had obviously been bitten by the ARG bug (in fact, the lead author, Jotun Hein had already been working on variations on the ARG inference problem for more than 15 years) and I found their enthusiasm contagious. At the time I was primarily a phylogeneticist, new to population genetics, so the book was a perfect introduction to the topic for me. It is written largely from a phylogenetic perspective with a strong emphasis on combinatorial analysis and algorithms and without a lot of intimidating population genetics jargon. Not long afterward I got to know Jotun Hein at leisurely meeting in Barbados and became further interested in the topic through my conversations with him.
As time went on, and I attended more and more talks on population genetics (they are hard to avoid at Cornell!), I became increasingly convinced that the ARG provided the right framework for thinking about a wide variety of problems in population genetics. While population mixture models, principle components analysis, copy-and-paste models, models of the site-frequency spectrum, and similar approaches clearly have their place, for a committed tree thinker like me, there is something deeply unsatisfying about the use of these techniques in evolutionary modeling and inference. I had a strong feeling that it must be possible to do better by more directly modeling the true evolution process that gives rise to the observed sequences.
In 2009, I was invited to write a review article on primate comparative genomics for Genome Research. This exercise led me to carry out a broad review of the literature and reflect on many of the problems of current interest in population genetics. Perhaps inevitably, the ARG ended up playing a central role in my discussion of the field. Not many people have cited this rather long-winded review but the act of writing it was extremely useful in clarifying my thinking about population genetics and the ARG.
Soon afterward, I began to try to interest students and postdocs in tackling the problem of explicit ARG inference in earnest, with an eye toward applications involving dozens or possibly hundreds of complete genomes. I had a hunch that it should be possible to improve on existing methods by making carefully selected approximations and drawing from our standard bag of tricks for probabilistic modeling. I did manage to convince a couple of students to work on the problem but progress was slow and limited, and they moved onto other problems before obtaining publishable results.
When Matt Rasmussen joined the group as a postdoc in 2011, he jumped at the chance to tackle the ARG inference problem. It quickly became clear that Matt was an ideal match for the problem, given his background in molecular evolution and phylogenetics, his strong skills in algorithms and probabilistic modeling, and his programming talents. He made early progress by setting up a forward simulator for ARGs, studying the properties of simulated ARGs in the presence of selection, and developing intuition about which features in the data were most informative for inference.
Still, the inference problem proved challenging. Matt was focusing on an approach I had suggested involving the use of a small set of “representative genealogies” that were selected to span, as effectively as possible, the (infinitely large) space of all genealogies. His plan was to use these genealogies to define a phylogenetic hidden Markov model, which could be used for approximate inference. He had devised a number of clever heuristics for selecting these genealogies, for example, based on the collection of “splits” (bipartitions of sequences) implied by informative sites in the data, but it gradually became clear that this approach was limited in effectiveness and had poor scaling properties. It was just too hard to pick a good set of representative genealogies. Almost a year into the project, we faced a choice between settling for an approach that was clearly suboptimal or going back to the drawing board and beginning anew.
Sequence “threading”—exaptation of an idea
Our breakthrough came some time during the winter of 2012. I remember the occasion vividly. It was one of those rare moments when the outlines of a solution to a difficult puzzle suddenly come into focus at all once—as close to a mythical “eureka” moment as I have experienced.
Matt and I were reviewing possible solutions to the problem of selecting representative genealogies, when I was reminded of an idea I had discussed with a student a couple of years earlier. Could there be a way to build the ARG up one sequence at a time, by “threading” an nth sequence into an ARG of size n–1? This idea had intuitive appeal but the previous student and I had never been able to make it work with full genealogies. However, Matt immediately picked up the idea and ran with it. It fit naturally with the way he was representing the local genealogies and thinking about how they were related and he guessed that it ought to be possible to think of it as a problem of adding one branch to each local tree, position by position along the sequence.
As we discussed further how one might set up a solution to the threading problem, I was suddenly struck by a parallel to a problem my group had worked on several years earlier. That problem had to do with gene prediction in a multi-species setting, where the exon-intron boundaries for genes were allowed to differ from one species to the next. The approach we had settled on was to repeatedly sample the gene annotations for one species conditional on those for all of the others, using a probabilistic model called a phylogenetic hidden Markov model. It turned out that this sampling problem could be elegantly solved by dynamic programming, using an algorithm for hidden Markov models known as the stochastic traceback algorithm.
Matt’s threading problem, it was evident, had the same essential structure, although it differed in many details. Therefore, it ought to be possible to carry out the threading operation exactly, in time proportional to the sequence length, using the stochastic traceback algorithm. Furthermore, it ought to be possible to structure this operation so that repeated applications would guarantee that we sampled from the true distribution of ARGs given the data and our model (i.e., so that we had a proper Gibbs sampler, in statistical jargon). It was immediately clear to both of us that this was the right path forward.
As usual, there was a good deal of work in taking these ideas from white-board scratchings to a fully defined model and a working implementation. Not least among these was the formidable task of generalizing the threading method to multiple sequences, which proved necessary for good MCMC convergence properties with more than a few sequences. Matt was exceptionally diligent, resourceful, and creative in working through all of these steps, and eventually demonstrated that the method performs remarkably well, perhaps better than either of us anticipated. Nevertheless, it was in that one exciting discussion about the threading operation that the project came into focus, making it clear that we had fundamentally new way of tackling this longstanding problem that was worth the additional months that would be needed to see it through to a full-fledged implementation.
Most evolutionary biologists are familiar with the concept of “exaptation,” popularized by Stephen Jay Gould and Elizabeth Vrba. Exaptation refers to the process by which evolution takes a phenotypic characteristic developed for one purpose, and co-opts it for use in another way. Examples include the gas bladders used for buoyancy in some fish (derived from lungs) and the use of bird feathers for flight (thought to have evolved for temperature regulation). The phenomenon, of course, is also common in the evolution of ideas. To me, the threading idea was a perfect example of an algorithmic exaptation, the reuse of an idea developed for one problem in evolutionary sequence analysis and co-opted for use in another. In this case, the threading operation proved considerably more powerful and useful for ARG inference than for gene prediction.
In this post, I have tried to convey some of my excitement about genome-wide ARG inference, a topic that might strike the casual reader as dry and obscure. Space does not allow for a discussion of all of the possible ways in which we are thinking of making use of Matt’s ARG sampling program (called ARGweaver), but our manuscript does include a fairly lengthy Discussion section, which lays out some of our ideas for future work. Clearly there is much still to be done, but I am convinced that this is going to be an enormously powerful tool for population genomic analysis.
Finally, I should say that, while I have focused on our own efforts in this post, we are far from the only researchers actively working on this topic. As described in our paper, there is a great deal of closely related recent work by Yun Song, Richard Durbin, Thomas Mailund, Asger Hobolth, Mikkel Schierup, and others. In short, there is no shortage of opportunities for exaptation of ideas in this vibrant field.
Pingback: James Watson on DNA | DNAeXplained – Genetic Genealogy
Pingback: Molecular Genetics of Recombination (Topics in Current Genetics) Reviews | HEALTHCARE
Pingback: Most viewed on Haldane’s Sieve: July 2013 | Haldane's Sieve
Pingback: Most viewed on Haldane’s Sieve: August 2013 | Haldane's Sieve
Hi Adam. Any chance that you and Matt could post the supplement as part of the arXiv submission. The main text doesn’t seem to describe the method needed for the rethreading of subtrees, and it would be really nice to have a read of that.
We read the paper as a lab for our journal club. Overall I think everything is laid out in the paper (although some is obv. in supp, and so not out yet). However, I think one of the difficulties is that you are trying to describe quite a complicated and layered procedure, that makes it hard to follow the paper.
We had trouble in a number of places, in part because the way the threading is formulated in counter intuitive. We think you are sampling the new coalescent times/branches for the n+1 lineage, followed by sampling the recombination events because this simplifies the state space. However, this is somewhat counter intuitive, as the switches in coalescent times obviously imply recombination events. Now we understand that is correctly handled (it seems under the SMC), but it does make it tough to think through.
We thought that Figure 2 could do a better job of laying this out.
A) Perhaps the 1st row could just be the ARG along the sequence for the n-1 lineages.
B) You could then have another row show the sampling of the Ys (coal. times for new nth sequence) but (somehow) leave the Zs unspecified in the figure. Perhaps this could be done by showing a range of possible Zs for each recombination.
C) In the third row show the sampling of the Zs.
In general I think a clear statement that you sample the Ys summing over all possible Zs, and then sample Zs conditional on the Ys. It feels like the Zs must be reasonably constrained by the choice of Y. It was a little hard to get a sense of where the efficiencies come from that make this possible. Is it this slightly counterintuitive setup that makes the state space easier to work with?
Given the number of variables that carry through the paper it may also help to give more of a verbal description of each new expression, so that the reader is reminded more frequently of what the terms represent. We wrote a variable list up on the whiteboard during our discussion, so we could keep track of them all.
Anyhow, congratulations on what seems like an important step forward in sampling ARGs. This paper is going to open the floodgates on a lot of interesting applications, so it is definitely worth making it as approachable as possible.
Also wondered a bunch about how to think about the problem of being “biased” towards the prior in high recombination regions. I understand why it can happen, but how do we spot that? One way we could do that is by measuring the properties of the trees in high and low recom regions of the data, however, linked selection would also have exactly the type of bias effect you are looking for. I’ve not really parsed your “Ne” around genes section to know if you are accounting for that in someway. I’m guessing that you could do simulations with realistic demography fit from low recombination regions, but that doesn’t feel particularly satisfactory (but is perhaps it is practical). Alternatively, I could learn about the coalescent rate in a PSMC-like fashion, as a way of updating my prior. But that sounds like it might reduce the flexibility of the model. Any thoughts?
Thanks for this, Graham. I have looked back over the material and can see how the treatment of the two-step (coalescence events followed by recombination events) threading process might be confusing to the reader. I have some ideas about how to make this clearer and more explicit and will try to improve the description over the next week or two. I also have recently finished thoroughly revising the supplement and will be sure to include it in the next arXiv revision (the previous version was rough in places and had some notational inconsistencies with the paper). If you are interested in looking at it before the next post just drop me an email and I’ll send it along. I’ll also try to spell out the notation a bit more clearly, as you suggest.
Regarding the influence of the prior, I think this is a general issue with the approach, with implications not just for recombination rates but for coalescence rates (inverse Ne), population structure, migration rates, and so on. Our current approach is to use a highly simplified model in the prior and hope that there is enough information in the data to overcome the prior and drive the posterior. This works reasonably well in many respects, but it is clear that this is not true for features about which the data is only weakly informative. Ultimately, I think we will need to work with richer parametric models to get at these parameters, such as we do with G-PhoCS for population phylogenies and migration rates, and such as PSMC does with Ne. The problem, of course, is that the inference problem starts to get really unwieldy when you are sampling not only full ARGs but also a whole battery of population genetic and demographic parameters. I think it will be possible to make progress in this direction but it’s going to require serious efforts in optimization and engineering (including full parallel implementations, etc.) to make it happen.
Pingback: Some preprint comment streams at Haldane’s sieve and related sites | Haldane's Sieve
I am very interested to try out this new algorithm on some simulations and real data that I’ve been working with – can you give any indication when alpha or beta software might be released??
reply from Adam via twitter: https://twitter.com/asiepel/status/377952088040296448
Good timing. I just released a first version of the ARGweaver code on github: https://github.com/mdrasmus/argweaver More documentation to come. If you have any questions, just let me know @mattrasmus
Good timing. I just released a first version of the ARGweaver code on github: https://github.com/mdrasmus/argweaver More documentation to come. If you have any questions, just let me know
Hi Adam and Matt,
I enjoyed the paper a lot! Really great work. I’d like to apply to my own data. Do you know if I can apply to Radseq genomic data, which means I have reads covered of 1% of the genome, but not completely. Can I still use the program (phase them and divide them into 2mb blocks)? Thank you!!
Hi Qixin — This is an interesting question. ARGweaver does allow for missing data (we make use of this feature when masking regions of whole genome sequences likely to be enriched for errors of various kinds) and, in principle, it could be applied to low coverage RAD-seq data. In practice, however, I think its usefulness would depend strongly on the distribution of contig sizes you were able to obtain. If coverage is very low and contigs are very short, you would probably gain little information about haplotype structure and recombination rates and might do better by simply using a method that assumes negligible rates of intralocus recombination and free recombination between loci, such as our G-PhoCS program (http://compgen.bscb.cornell.edu/GPhoCS/). In this case you could subsample the data as needed to meet the modeling assumptions but you would likely still be left with quite a lot of data. But if you think your sequence coverage was high enough to provide useful information about recombination you could certainly try ARGweaver. We’d be glad to explain how to set it up in this case.
Thank you for this detailed explanation. I have all the reads around 80-90 bp long, and scattered across the genome and covers 1% of the genome. Will that still be useful in trying ARGweaver? I’d like to know how to set it up in this case. Certainly I’ll also try GPhoCS as well if it doesn’t work out with my data. Thank you!
My hunch is that you won’t get much out of ARGweaver in this case, but I’d be very interested to see what happens if you do try it. If you are interested, the best approach is probably to download and install the program (see Matt’s link above) and contact us directly by email about how to handle missing data.
Pingback: Most viewed on Haldane’s Sieve: September 2013 | Haldane's Sieve
Pingback: Sifting through 2013 with Haldane’s Sieve | Haldane's Sieve
Pingback: Author post: Inferring human population size and separation history from multiple genome sequences | Haldane's Sieve
Pingback: Our Paper: Genome-wide inference of ancestral recombination graphs | Siepel lab
Pingback: Sifting through 2014 on Haldane’s Sieve | Haldane's Sieve