Genome-wide inference of ancestral recombination graphs
Matthew D. Rasmussen, Adam Siepel
(Submitted on 21 Jun 2013)
The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of all coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are extremely computationally intensive, depend on fairly crude approximations, or are limited to small numbers of samples. As a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to be applied on the scale of dozens of complete human genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call “threading”. Using techniques based on hidden Markov models, this threading operation can be performed exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated applications of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG, for twenty or more sequences generated under realistic parameters for human populations. We also report initial results from applications of ARGweaver to high-coverage individual human genome sequences from Complete Genomics. Work is in progress on further applications of these methods to genome-wide sequence data.