Our paper: Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

This guest post is by Mike Harvey on his (along with coauthors) paper Tilston-Smith and Harvey et al Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales arXived here.

This paper is a result of work on developing markers and methods for generating genomic data for species without available genomes (I’ll refer to these as “non-model” species). The work is a collaborative effort between some researchers who are really on top of developments in sequencing technologies (and are also a blast to work with) – Travis Glenn at UGA, Brant Faircloth at UCLA, and John McCormack at Occidental – and our lab here at LSU. We think the marker sets we have been developing (ultraconserved elements) and more generally the method we are using (sequence capture) have the potential to make the genomic revolution more accessible to researchers studying the population genetics of diverse non-model organisms.

Background

Although genomic resources for humans and other model systems are increasing rapidly, the bottleneck for those of us working on the population genetics of non-model systems is simply our ability to generate data. Many of us are still struggling to take advantage of the increase in sequencing capacity provided by next-generation platforms. For many projects, sequencing entire genomes is neither feasible (yet) nor necessary, so researchers have focused on finding reasonable methods of subsampling the genome in a repeatable way such that the same subset of genomic regions can be sampled for many individuals. We often have to do this, however, with little to no prior genomic information from our particular study organism.

Most methods for subsampling the genome thus far have involved “random” sampling from across the genome by using restriction enzymes to digest genomic DNA and then sequencing fragments that fall in a particular part of the fragment size distribution. Drawbacks of these methods include (1) the fact that the researcher has no prior knowledge of where in the genome sequences will be coming from or what function the genomic region might serve, and (2) that the repeatability of the method, specifically the ability to generate data from the same loci across samples, depends on the conservation of the enzyme cut sites, and these often are not conserved at deeper timescales. Sequencing transcriptomes is also a popular method for subsampling the genome, but this simply isn’t an option for those of us working with museum specimens and tissues or old blood samples in which RNA hasn’t been properly preserved.

Sequence capture, a molecular technique involving genome enrichment by hybridization to RNA or DNA ‘probes’, is a flexible alternative that allows researchers to subsample whatever portions of the genome they like. The drawback of sequence capture, however, is that you need enough prior genomic information to design the synthetic oligos used as probes. This is not a problem for e.g. exome capture in humans in which the targeted genes are well characterized, but it is a challenge for non-model systems without sequenced genomes.

This is where ultraconserved elements come in. Ultraconserved elements (UCEs) are short genomic regions that are highly conserved across widely divergent species (e.g. all amniotes). Because they are so conserved, UCE sequences can be easily used as probes for sequence capture in diverse non-model organisms, even if the organisms themselves have little or no genomic information available. If you are not working on amniotes or fishes (for which we have already designed probe arrays), all you may need to find UCEs is a couple of genomes from species that diverged from your study organism within the last few hundred million years. Of course, this general approach is not specific to loci that fall into our narrow definition of UCEs, but is limited merely by the availability of genomic information that can be used to design probes. As additional genomic information becomes available from a given group additional loci, including protein-coding regions, can easily be added to capture arrays.

Our question for this paper – does sequence capture of UCEs work for population genetics?

We have previously used sequence capture of UCEs to understand deeper-level phylogenetic questions. We’ve found that at deep timescales, the flanking regions of UCEs contain a large amount of informative variation. The goals of the present study were (1) to see if sufficient information existed in UCEs to enable studies at shallow evolutionary (read "population genetic or phylogeographic") timescales, and (2) to explore some of the analyses that might be possible with population genetic data from non-model organisms. For our study, we sampled two individuals from each of four populations in five different species of non-model Neotropical birds. We conducted sequence capture using probes designed from 2,386 UCEs shared by amniotes and we sequenced the resulting libraries using an Illumina HiSeq. We then examined the number of loci recovered and the amount of informative variation in those loci for each of the five species. We also conducted some standard analyses – species tree estimation, demographic modeling, and species delimitation – for each species

We were able to recover between 776 and 1,516 UCE regions across the five species, and these contained sufficient variation to conduct population genetic analyses in each species. Species tree estimates, demographic parameters, and species limits mostly corresponded with prior estimates based on morphology or mitochondrial DNA sequences. Confidence intervals around demographic parameter estimates from the UCEs were much narrower than estimates from mitochondrial DNA using similar methods, supporting the idea that larger datasets will allow more precise estimates of species histories.

Some conclusions

Pending faster and cheaper methods for sequencing and de novo assembling whole genomes, methods for sampling a subset of the genome will be a practical necessity for population genetic studies in non-model organisms. Sequence capture is both intuitively appealing and practical in that it allows researchers to select a priori the regions of the genome in which they are interested. Ultraconserved elements pair nicely with sequence capture because they allow us to collect data from the same loci shared across a very broad spectrum of organisms (e.g. all amniotes or all fishes). As genomic data for diverse groups increases, UCE capture probes will certainly be augmented with additional genomic regions. In the meantime, sequence capture of UCEs has a lot to offer for population genetic studies of non-model organisms. See our paper for more information, or visit ultraconserved.org, where our probe sets, protocols, code, and other information are available under open-source licenses (BSD-style and Creative Commons) for anyone to use.

Fluctuating selection models and McDonald-Kreitman type analyses

Fluctuating selection models and McDonald-Kreitman type analyses
Toni I. Gossmann, David Waxman, Adam Eyre-Walker
(Submitted on 25 Aug 2013)

It is likely that the strength of selection acting upon a mutation varies through time due to changes in the environment. However, most population genetic theory assumes that the strength of selection remains constant. Here we investigate the consequences of fluctuating selection pressures on the quantification of adaptive evolution using McDonald-Kreitman (MK) style approaches. In agreement with previous work, we show that fluctuating selection can generate evidence of adaptive evolution even when the expected strength of selection on a mutation is zero. However, we also find that the mutations, which contribute to both polymorphism and divergence tend, on average, to be positively selected during their lifetime, under fluctuating selection models. This is because mutations that fluctuate, by chance, to positive selected values, tend to reach higher frequencies in the population than those that fluctuate towards negative values. Hence the evidence of positive adaptive evolution detected under a fluctuating selection model by MK type approaches is genuine since fixed mutations tend to be advantageous on average during their lifetime. Never-the-less we show that methods tend to underestimate the rate of adaptive evolution when selection fluctuates.

Maximum likelihood evidence for Neandertal admixture in Eurasian populations from three genomes

Maximum likelihood evidence for Neandertal admixture in Eurasian populations from three genomes
Konrad Lohse, Laurent A.F. Frantz
(Submitted on 31 Jul 2013)

Although there has been much interest in estimating divergence and admixture from genomic data, it has proven difficult to distinguish gene flow after divergence from alternative histories involving structure in the ancestral population. The lack of a formal test to distinguish these scenarios has sparked recent controversy about the possibility of interbreeding between Neandertals and modern humans in Eurasia. We derive the probability of mutational configurations in non-recombining sequence blocks under alternative histories of divergence with admixture and ancestral structure. Dividing the genome into short blocks makes it possible to compute maximum likelihood estimates of parameters under both models. We apply this method to triplets of human Neandertal genomes and quantify the relative support for models of long-term population structure in the ancestral African popuation and admixture from Neandertals into Eurasian populations after their expansion out of Africa. Our analysis allows us — for the first time — to formally reject a history of ancestral population structure and instead reveals strong support for admixture from Neandertals into Eurasian populations at a higher rate (3.4%-7.9%) than suggested previously.

The Population Genetic Signature of Polygenic Local Adaptation

The Population Genetic Signature of Polygenic Local Adaptation
Jeremy J. Berg, Graham Coop
(Submitted on 29 Jul 2013)

Adaptation in response to selection on polygenic phenotypes occurs via subtle allele frequencies shifts at many loci. Current population genomic techniques are not well posed to identify such signals. In the past decade, detailed knowledge about the specific loci underlying polygenic traits has begun to emerge from genome-wide association studies (GWAS). Here we combine this knowledge from GWAS with robust population genetic modeling to identify traits that have undergone local adaptation. Using GWAS data, we estimate the mean additive genetic value for a give phenotype across many populations as simple weighted sums of allele frequencies. We model the expected differentiation of GWAS loci among populations under neutrality to develop simple tests of selection across an arbitrary number of populations with arbitrary population structure. To find support for the role of specific environmental variables in local adaptation we test for correlations with the estimated genetic values. We also develop a general test of local adaptation to identify overdispersion of the estimated genetic values values among populations. This test is a natural generalization of QST /FST comparisons based on GWAS predictions. Finally we lay out a framework to identify the individual populations or groups of populations that contribute to the signal of overdispersion. These tests have considerably greater power than their single locus equivalents due to the fact that they look for positive covariance between like effect alleles. We apply our tests to the human genome diversity panel dataset using GWAS data for six different traits. This analysis uncovers a number of putative signals of local adaptation, and we discuss the biological interpretation and caveats of these results.

Ancient west Eurasian ancestry in southern and eastern Africa

Ancient west Eurasian ancestry in southern and eastern Africa
Joseph K. Pickrell, Nick Patterson, Po-Ru Loh, Mark Lipson, Bonnie Berger, Mark Stoneking, Brigitte Pakendorf, David Reich
(Submitted on 30 Jul 2013)

The history of southern Africa involved interactions between indigenous hunter-gatherers and a range of populations that moved into the region. Here we use genome-wide genetic data to show that there are at least two admixture events in the history of Khoisan populations (southern African hunter-gatherers and pastoralists who speak non-Bantu languages with click consonants). One involved populations related to Niger-Congo-speaking African populations, and the other introduced ancestry most closely related to west Eurasian (European or Middle Eastern) populations. We date this latter admixture event to approximately 900-1,800 years ago, and show that it had the largest demographic impact in Khoisan populations that speak Khoe-Kwadi languages. A similar signal of west Eurasian ancestry is present throughout eastern Africa. In particular, we also find evidence for two admixture events in the history of Kenyan, Tanzanian, and Ethiopian populations, the earlier of which involved populations related to west Eurasians and which we date to approximately 2,700 – 3,300 years ago. We reconstruct the allele frequencies of the putative west Eurasian population in eastern Africa, and show that this population is a good proxy for the west Eurasian ancestry in southern Africa. The most parsimonious explanation for these findings is that west Eurasian ancestry entered southern Africa indirectly through eastern Africa.

Robust forward simulations of recurrent positive selection

Robust forward simulations of recurrent positive selection
Lawrence H. Uricchio, Ryan D. Hernandez
(Submitted on 24 Jul 2013)

It is well known that recurrent positive selection reduces the amount of genetic variation at linked sites. In recent decades, analytical results have been proposed to quantify the magnitude of this reduction with simple Wright-Fisher models and diffusion approximations. However, extending these results to include interference between selected sites, arbitrary selection schemes, and complicated demographic processes has proved to be challenging. Forward simulation can provide insights into these processes, but few studies have examined recurrent positive selection in a forward simulation context due to computational constraints. Here, we extend the flexible forward simulator SFS_CODE to greatly improve the efficiency of simulations of recurrent positive selection. Forward simulations are computationally intensive and often necessitate rescaling of relevant parameters (e.g., population size and sequence length) to achieve computational feasibility. However, it is not obvious that parameter rescaling will maintain expected patterns of diversity in all parameter regimes. We develop a simple method for parameter rescaling that provides the best possible computational performance for a given error tolerance, and a detailed theoretical analysis of the robustness of rescaling across the parameter space. These results show that ad hoc approaches to parameter rescaling under the recurrent hitchhiking model may not always provide sufficiently accurate dynamics, potentially skewing patterns of diversity in simulated DNA sequences.

Comments:

Guidelines for the design of evolve and resequencing studies

Guidelines for the design of evolve and resequencing studies
Robert Kofler, Christian Schlötterer
(Submitted on 18 Jul 2013)

Standing genetic variation provides a rich reservoir of potentially useful mutations facilitating the adaptation to novel environments. Experimental evolution studies have demonstrated that rapid and strong phenotypic responses to selection can also be obtained in the laboratory. When combined with the Next Generation Sequencing technology, these experiments promise to identify the individual loci contributing to adaption. Nevertheless, until now, very little is known about the design of such evolve and resequencing (E&R) studies. Here, we use forward simulations of entire genomes to evaluate different experimental designs that aim to maximize the power to detect selected variants. We show that low linkage disequilibrium in the starting population, population size, duration of the experiment and the number of replicates are the key factors in determining the power and accuracy of E&R studies. Furthermore, replication of E&R is more important for detecting the targets of selection than increasing the population size. Using an optimized design beneficial loci with a selective advantage as low as s=0.005 can be identified at the nucleotide level. Even when a large number of loci are selected simultaneously, up to 56% can be reliably detected without incurring large numbers of false positives. Our computer simulations suggest that, with an adequate experimental design, E&R studies are a powerful tool to identify adaptive mutations from standing genetic variation and thereby provide an excellent means to analyze the trajectories of selected alleles in evolving populations

A model-based approach for identifying signatures of balancing selection in genetic data

A model-based approach for identifying signatures of balancing selection in genetic data
Michael DeGiorgio, Kirk E. Lohmueller, Rasmus Nielsen
(Submitted on 16 Jul 2013)

While much effort has focused on detecting positive and negative directional selection in the human genome, relatively little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find evidence for many novel candidates, the strongest of which is FANK1, an imprinted gene that suppresses apoptosis, is expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our methods are able to reproduce many previously-hypothesized signals of balancing selection, as well as discover novel interesting candidates.

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella
Yaniv Brandvain, Tanja Slotte, Khaled Hazzouri, Stephen Wright, Graham Coop
(Submitted on 15 Jul 2013)

The shift from outcrossing to self-fertilization is among the most common transitions in plants. Until recently, however, a genome-wide view of this transition has been obscured by a dearth of appropriate data and the lack of appropriate population genomic methods to interpret such data. Here, we present novel analyses detailing the origin of the selfing species, Capsella rubella, which recently split from its outcrossing sister, Capsella grandiflora. Due to the recency of the split, most variation within C. rubella is found within C. grandiflora. We can therefore identify genomic regions where two C. rubella individuals have inherited the same or different segments of ancestral diversity (i.e. founding haplotypes) present in C. rubella’s founder(s). Based on this analysis, we show that C. rubella was founded by multiple individuals drawn from a diverse ancestral population closely related to extant C. grandiflora, that drift and selection have rapidly homogenized most of this ancestral variation since C. rubella’s founding, and that little novel variation has accumulated within this time. Despite the extensive loss of ancestral variation, the approximately 25% of the genome for which two C. rubella individuals have inherited different founding haplotypes makes up roughly 90% of the genetic variation between them. To extend these findings, we develop a coalescent model that utilizes the inferred frequency of founding haplotypes and variation within founding haplotypes to estimate that C. rubella was founded by a potentially large number of individuals 50-100 kya, and has subsequently experienced a 20X reduction in its effective population size. As population genomic data from an increasing number of outcrossing/selfing pairs are generated, analyses like this here will facilitate a fine-scaled view of the evolutionary and demographic impact of the transition to self-fertilization.

Our Paper: Genome-wide inference of ancestral recombination graphs

This guest post is by Adam Siepel (@asiepel) on his paper with Matthew Rasmussen (@mattrasmus): Rasmussen and Siepel “Genome-wide inference of ancestral recombination graphs” , arXived here.

Inference of Ancestral Recombination Graphs

 As the title indicates, our paper is about the problem of inferring an “ancestral recombination graph,” or ARG, from sequence data.  This is a topic that may strike many readers as impenetrably obscure and technical, so I will first try to explain, in plain language, what the ARG describes and why it has so much potential to be useful in many kinds of genetic analysis.  Then, I will tell the story of how I and members of my research group have become increasingly fascinated by this problem over the years, how we have struggled with it, and how we finally achieved the conceptual breakthrough that is described in our paper.  As will become evident, Matt Rasmussen, a former postdoc in the group and lead author of our paper, was central in this achievement.

What is the ARG?

The ARG is an elegantly simple yet superbly rich data structure that describes the complete evolutionary history of a collection of genetic sequences drawn from individuals in one or more populations.  It was invented in the mid 1990s by the mathematicians Bob Griffiths and Paul Marjoram.  The ARG captures essentially all evolutionary information relevant for genetic analysis of such sequences.  Statisticians say that it fully defines the “correlation structure” of the sequences, meaning that it explains most similarities and differences among the sequences in terms of their patterns of shared ancestry.

The ARG is something like a family tree, only richer, because it not only defines the relationships among individuals, but it also traces the histories of specific segments of DNA sequences.  For example, if you were to replace your family tree with an ARG, you could tell exactly which pieces of your genome came from your eccentric great grandmother and which pieces you share with your charming, intelligent, and handsome third cousin.

Now, this is obviously all a bit vague and superficial.  What does the ARG actually describe?  First, the ARG is a “graph” in the mathematical sense, meaning that it consists of a set of “nodes” and a set of “edges” between nodes.  Traditionally, nodes and edges are depicted with circles and connecting lines, respectively, making graphs look a bit like those molecular modeling kits you may have used in your high school chemistry class.  In the case of the ARG, there are two types of nodes, representing two key classes of events in the history of the sequences: (i) recombination nodes, which describe how a sequence is assembled by concatenating two ancestral sequences when DNA is mixed-and-matched during the production of sperm and egg cells or other gametes (meiosis); and (ii) coalescence nodes, which describe how two sequences trace back to a common ancestor. Edges between these nodes represent lines of descent, or lineages, for segments of DNA.  Each node is annotated with the time of the corresponding event.  In addition, for reasons that will become apparent in a moment, recombination nodes are annotated with the position along the sequences at which the ancestral sequences were cut and then joined together.

The significance of recombinations and coalescences comes from the fact that these are the two ways in which lineages can join or split over time.  The best way to understand them is to think about the behavior of lineages as one looks backward in time.  The graph is typically laid out with time on the vertical axis, so that the bottom of the graph represents the present time and each node is assigned a height above this baseline indicating the time before the present at which the associated event occurred. Therefore, to look backward in time, we look upward in the graph.  As we do so, we see that recombination events cause a single lineage to split into two ancestral lineages (representing the two sequence fragments that were joined together by the recombination in forward time), and coalescence events cause two lineages to join into one. Therefore, recombination nodes have one edge coming in and two going out, and coalescence nodes have two edges coming in and one going out.  One way of thinking about it is that, given a fragment of modern DNA, recombinations have the effect of increasing its set of ancestors, while coalescences have the effect of decreasing its set of ancestors.

All of this joining and splitting results in a complex network that, to the uninitiated, might vaguely resemble a map of the London Tube.  For the ARG-literate, however, this graph is richly informative about evolutionary history.  The key to interpreting the graph is to make use of the positional information associated with the recombination nodes (the positions at which the recombinations occurred). For any position along the sequence, one can extract an evolutionary tree, or genealogy, for the samples in question, simply by following their lineages upward on the page and taking a left or a right turn at each recombination node based on whether the genomic position in question falls to the left or the right, respectively, of the recombination event.  It is easy to see that a subgraph extracted in this way will contain only joins (coalescences) and no splits, and must eventually reduce to a single lineage, and therefore will define a tree.  Thus, the ARG has embedded within it a “local tree,” or genealogy, for every position in the sequence!

I have to pause here to say that I find this property of the ARG incredibly beautiful.  In a highly compact manner, the graph manages to capture both a complete record of all recombination events and a genealogy for every position in the sequence, and it does so in a way that shows how neighboring genealogies can be reconciled in terms of historical recombination events!  Figure 1 of our paper illustrates it with a simple ARG for four sequences.  Don’t be confused by the absence of explicit nodes (circles) in the diagram.  By tradition, nodes are not drawn as circles in ARGs, but are simply understood to be the points at which edges meet.

It is worth emphasizing that this representation is very general.  It can be used to describe the history of a particular gene of interest in individuals from a single well-defined population, or the history of whole genomes (with one ARG per chromosome) for individuals from many diverse populations.  It can even be used to describe the histories of sequences from representatives of different species, such as humans, chimpanzees, and gorillas.  As long as the sequences in question are orthologous and collinear—meaning, essentially, that they are derived from a common ancestral sequence in the absence of duplication and rearrangement events—then the coalescence and recombination events defined by the ARG are sufficient for describing precisely how the sequences derive from their common ancestor, and, hence, how they are correlated with each other.

Why is the ARG useful?

Many biologists are familiar with the value of “tree-thinking” in understanding evolutionary relationships.  For example, a diagram of a tree intuitively and precisely captures the fact that humans and chimpanzees are more closely related to each other than either humans or chimpanzees are to rhesus macaques.  Similarly, it explains why even neutrally evolving sequences (“junk DNA,” if the term is not too loaded) will tend to be more similar between humans and chimpanzees, and between rats and mice, than between humans and mice (this is essentially what we mean by “correlation structure”).  Against the background of neutral evolution, trees help us to identify genes and noncoding functional elements, to detect evolutionarily conserved and adaptively evolving sequences, to date speciation events, and so on.

The problem with trees on population genetic time scales, however, is that they change along the sequence, due to recombination.  As noted above, the ARG precisely describes these trees and the way they change.  Therefore, it enables tree-thinking with population genetic data.

Viewing population genetics in terms of the ARG can clarify one’s thinking about many problems of interest.  For instance, the ARG makes it clear that divergence times for genetically isolated populations can be estimated by looking across the ARG for the most recent coalescences that cross population boundaries.  Similarly, given an estimated divergence time, the rate of gene flow or migration between populations can be estimated, in a fairly straightforward manner, in terms of the rates of inter-population coalescence events across the ARG.  Ancestral effective population sizes can be estimated from the density of coalescence events in the ARG over time. Signatures of natural selection, including hitchhiking and background selection, can be detected by various kinds of local distortion of the ARG.  In general, the ARG provides a unifying framework for the field, and many challenging statistical problems in population genetics can properly be seen as problems of revealing relevant features of the ARG.

What would a reconstructed ARG mean in practical terms?  First, I should be clear that we have no intension of actually drawing an ARG for dozens of complete human genome sequences.  Such a drawing would be far too large and complex to be useful.  Rather, the value of a reconstructed ARG is as a rich data structure that could be interrogated for many features of interest, such as local trees, recombination events, mutation ages, or regions of identity by descent.  Because these features would be derived from a unified description of the evolutionary history of the sample, they would be guaranteed to be internally consistent, unlike ones based on simpler estimators.  In this way, the ARG would be useful in many problems of interest in statistical genetics, ranging from demography inference (e.g., estimation of population divergence times or rates of gene flow between populations), to the detection of regions influenced by natural selection, to the detection of genotype/phenotype associations.

Why is it so difficult to find a good ARG?

In practice, most population geneticists do not work with ARGs, but instead work with surrogates such as principle components, site frequency spectra, and spectra of identity by descent.  The reason people work with these simpler, lower-dimensional summaries of genetic data, of course, is that explicit ARG reconstruction is forbiddingly difficult.  From a statistical and computational perspective, there are two major issues in reconstructing the ARG.  First, the problem of searching all possible ARGs for one that best fits the data is computationally intractable, even in a restricted, parsimony-based formulation of the problem (it belongs to the class of problems computer scientists call “NP-hard”).  Second, and perhaps more importantly, in most cases of interest there is simply not enough information in the data to reconstruct a single ARG with high confidence.  Rather, in general, a large family of ARGs will be more or less equally compatible with observed sequences.

For these reasons, it would be misleading to suggest that there is any hope of producing a magical computer program that will allow the user to input a collection of sequences and obtain the true ARG for those sequences as output.  Instead, we must consider many possible ARGs, weighting them in some way by their plausibility.  In other words, we must consider a statistical distribution of ARGs given the data.

Because of the awkwardness of the space of ARGs (each ARG is a complex, combinatorial object, difficult to summarize in terms of low-dimensional features), we and others have come to the conclusion that the best way to get at these distributions is by making use of statistical sampling methods.  In our case, we use an approach, called Markov chain Monte Carlo (MCMC), that chooses samples that are guaranteed to be representative of the distribution of ARGs given the data and the model, provided the sampling program is run long enough.  After collecting fairly large numbers of samples, we can make useful statements about general features of the ARG even if we have limited confidence in each individual sample.  For example, the average of the times to most recent common ancestry (TMRCA) in the sampled ARGs at a particular position along the sequence can be used as an estimator of the true TMRCA at that position.  We show that our methods can be used to summarize various useful features of this kind, including recombination and coalescence rates, and the ages of mutations that are polymorphic in the sample, as well as TMRCAs.

How we became interested in the problem of ARG inference—or, the evolution of an obsession

The ARG seems to have a unique ability to inspire obsession in a particular kind of mathematically inclined geneticist.  As discussed above, it is potentially richly informative but extremely difficult to obtain, which makes it a natural “Holy Grail” for evolutionary genetics.  At the same time, the ARG is fairly simple to describe, it is straightforward to simulate, and sequence data are strongly informative about many of its features, making solutions to the inference problem seem tantalizingly within reach.

I first encountered the ARG and the problem of ARG inference in 2005 when I picked up the book Gene genealogies, variation, and evolution: a primer in coalescent theory, by Hein, Schierup, and Wiuf.  The authors of this book had obviously been bitten by the ARG bug (in fact, the lead author, Jotun Hein had already been working on variations on the ARG inference problem for more than 15 years) and I found their enthusiasm contagious.  At the time I was primarily a phylogeneticist, new to population genetics, so the book was a perfect introduction to the topic for me.  It is written largely from a phylogenetic perspective with a strong emphasis on combinatorial analysis and algorithms and without a lot of intimidating population genetics jargon.  Not long afterward I got to know Jotun Hein at leisurely meeting in Barbados and became further interested in the topic through my conversations with him.

As time went on, and I attended more and more talks on population genetics (they are hard to avoid at Cornell!), I became increasingly convinced that the ARG provided the right framework for thinking about a wide variety of problems in population genetics.  While population mixture models, principle components analysis, copy-and-paste models, models of the site-frequency spectrum, and similar approaches clearly have their place, for a committed tree thinker like me, there is something deeply unsatisfying about the use of these techniques in evolutionary modeling and inference.  I had a strong feeling that it must be possible to do better by more directly modeling the true evolution process that gives rise to the observed sequences.

In 2009, I was invited to write a review article on primate comparative genomics for Genome Research.  This exercise led me to carry out a broad review of the literature and reflect on many of the problems of current interest in population genetics.  Perhaps inevitably, the ARG ended up playing a central role in my discussion of the field.  Not many people have cited this rather long-winded review but the act of writing it was extremely useful in clarifying my thinking about population genetics and the ARG.

Soon afterward, I began to try to interest students and postdocs in tackling the problem of explicit ARG inference in earnest, with an eye toward applications involving dozens or possibly hundreds of complete genomes.  I had a hunch that it should be possible to improve on existing methods by making carefully selected approximations and drawing from our standard bag of tricks for probabilistic modeling.  I did manage to convince a couple of students to work on the problem but progress was slow and limited, and they moved onto other problems before obtaining publishable results.

When Matt Rasmussen joined the group as a postdoc in 2011, he jumped at the chance to tackle the ARG inference problem.  It quickly became clear that Matt was an ideal match for the problem, given his background in molecular evolution and phylogenetics, his strong skills in algorithms and probabilistic modeling, and his programming talents.  He made early progress by setting up a forward simulator for ARGs, studying the properties of simulated ARGs in the presence of selection, and developing intuition about which features in the data were most informative for inference.

Still, the inference problem proved challenging.  Matt was focusing on an approach I had suggested involving the use of a small set of “representative genealogies” that were selected to span, as effectively as possible, the (infinitely large) space of all genealogies.  His plan was to use these genealogies to define a phylogenetic hidden Markov model, which could be used for approximate inference.  He had devised a number of clever heuristics for selecting these genealogies, for example, based on the collection of “splits” (bipartitions of sequences) implied by informative sites in the data, but it gradually became clear that this approach was limited in effectiveness and had poor scaling properties.  It was just too hard to pick a good set of representative genealogies.  Almost a year into the project, we faced a choice between settling for an approach that was clearly suboptimal or going back to the drawing board and beginning anew.

Sequence “threading”—exaptation of an idea

Our breakthrough came some time during the winter of 2012.  I remember the occasion vividly.  It was one of those rare moments when the outlines of a solution to a difficult puzzle suddenly come into focus at all once—as close to a mythical “eureka” moment as I have experienced.

Matt and I were reviewing possible solutions to the problem of selecting representative genealogies, when I was reminded of an idea I had discussed with a student a couple of years earlier.  Could there be a way to build the ARG up one sequence at a time, by “threading” an nth sequence into an ARG of size n–1?  This idea had intuitive appeal but the previous student and I had never been able to make it work with full genealogies.  However, Matt immediately picked up the idea and ran with it.  It fit naturally with the way he was representing the local genealogies and thinking about how they were related and he guessed that it ought to be possible to think of it as a problem of adding one branch to each local tree, position by position along the sequence.

As we discussed further how one might set up a solution to the threading problem, I was suddenly struck by a parallel to a problem my group had worked on several years earlier.  That problem had to do with gene prediction in a multi-species setting, where the exon-intron boundaries for genes were allowed to differ from one species to the next.  The approach we had settled on was to repeatedly sample the gene annotations for one species conditional on those for all of the others, using a probabilistic model called a phylogenetic hidden Markov model.  It turned out that this sampling problem could be elegantly solved by dynamic programming, using an algorithm for hidden Markov models known as the stochastic traceback algorithm.

Matt’s threading problem, it was evident, had the same essential structure, although it differed in many details.  Therefore, it ought to be possible to carry out the threading operation exactly, in time proportional to the sequence length, using the stochastic traceback algorithm.  Furthermore, it ought to be possible to structure this operation so that repeated applications would guarantee that we sampled from the true distribution of ARGs given the data and our model (i.e., so that we had a proper Gibbs sampler, in statistical jargon).  It was immediately clear to both of us that this was the right path forward.

As usual, there was a good deal of work in taking these ideas from white-board scratchings to a fully defined model and a working implementation.  Not least among these was the formidable task of generalizing the threading method to multiple sequences, which proved necessary for good MCMC convergence properties with more than a few sequences.  Matt was exceptionally diligent, resourceful, and creative in working through all of these steps, and eventually demonstrated that the method performs remarkably well, perhaps better than either of us anticipated.  Nevertheless, it was in that one exciting discussion about the threading operation that the project came into focus, making it clear that we had fundamentally new way of tackling this longstanding problem that was worth the additional months that would be needed to see it through to a full-fledged implementation.

Most evolutionary biologists are familiar with the concept of “exaptation,” popularized by Stephen Jay Gould and Elizabeth Vrba.  Exaptation refers to the process by which evolution takes a phenotypic characteristic developed for one purpose, and co-opts it for use in another way.  Examples include the gas bladders used for buoyancy in some fish (derived from lungs) and the use of bird feathers for flight (thought to have evolved for temperature regulation).  The phenomenon, of course, is also common in the evolution of ideas.  To me, the threading idea was a perfect example of an algorithmic exaptation, the reuse of an idea developed for one problem in evolutionary sequence analysis and co-opted for use in another.  In this case, the threading operation proved considerably more powerful and useful for ARG inference than for gene prediction.

Concluding notes

In this post, I have tried to convey some of my excitement about genome-wide ARG inference, a topic that might strike the casual reader as dry and obscure.  Space does not allow for a discussion of all of the possible ways in which we are thinking of making use of Matt’s ARG sampling program (called ARGweaver), but our manuscript does include a fairly lengthy Discussion section, which lays out some of our ideas for future work.  Clearly there is much still to be done, but I am convinced that this is going to be an enormously powerful tool for population genomic analysis.

Finally, I should say that, while I have focused on our own efforts in this post, we are far from the only researchers actively working on this topic.  As described in our paper, there is a great deal of closely related recent work by Yun Song, Richard Durbin, Thomas Mailund, Asger Hobolth, Mikkel Schierup, and others.  In short, there is no shortage of opportunities for exaptation of ideas in this vibrant field.

Adam Siepel