Genome-wide inference of ancestral recombination graphs

Matthew D. Rasmussen, Adam Siepel

(Submitted on 21 Jun 2013)

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of all coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are extremely computationally intensive, depend on fairly crude approximations, or are limited to small numbers of samples. As a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to be applied on the scale of dozens of complete human genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call “threading”. Using techniques based on hidden Markov models, this threading operation can be performed exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated applications of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG, for twenty or more sequences generated under realistic parameters for human populations. We also report initial results from applications of ARGweaver to high-coverage individual human genome sequences from Complete Genomics. Work is in progress on further applications of these methods to genome-wide sequence data.

Pingback: Our Paper: Genome-wide inference of ancestral recombination graphs | Haldane's Sieve

Looks really nice – I think this area of research is very important for inference of population history using full genomes (or, at least, full chromosomes). I’m wondering if you’ve had a chance to look at ACG (http://www.biomedcentral.com/1471-2105/14/40/), which seems to have some similar features? ACG also uses a Bayesian MCMC procedure, but it’s certainly not doing any Gibbs sampling, and its proposal algorithms borrow more from BEAST and SMARTIE (now defunct, I believe) than PSMC. IMHO it might be nice to include a comparison to ACG since it seems to tackle the same problem. Excellent work – ARGweaver seems really interesting.

Thanks Brendan. We had seen your recent Genetics paper but not this one. We will have a look. I will, say, however, that I think the threading approach has some major advantages over the alternatives as a proposal distribution. The key is that it is very directly driven by the data, as opposed to making proposals under the prior. This allows it to make relatively large MCMC “moves” and maintain a high acceptance rate, even with large samples. In this way, it addresses a major limitation of all previous MCMC methods for ARG sampling that I know of.

Pingback: Most viewed on Haldane’s Sieve: July 2013 | Haldane's Sieve

Pingback: Most viewed on Haldane’s Sieve: August 2013 | Haldane's Sieve

Pingback: Most viewed on Haldane’s Sieve: September 2013 | Haldane's Sieve

Pingback: Our Paper: Genome-wide inference of ancestral recombination graphs | Siepel lab