Fernando Racimo, Gabriel Renaud, Montgomery Slatkin
When sequencing an ancient DNA sample from a hominin fossil, DNA from present-day humans involved in excavation and extraction will be sequenced along with the endogenous material. This type of contamination is problematic for downstream analyses as it will introduce a bias towards the population to which the contaminating individuals belong. Quantifying the extent of contamination is a crucial step as it allows researchers to account for possible biases that may arise in downstream genetic analyses. Here, we present an MCMC algorithm to co-estimate the contamination rate, sequencing error rate and demographic parameters – including drift times and admixture rates – for an ancient nuclear genome obtained from human remains, when the putative contaminating DNA comes from present-day humans. We assume we have a large panel representing the putative contaminating population (e.g. European, East Asian or African). The method is implemented in a C++ program called ‘Demographic Inference with Contamination and Error’ (DICE). The program can also be used to determine the most likely population to which the contaminant DNA belongs. We applied it to simulations and Neanderthal genome data, and we recover accurate estimates of all parameters, even when the average sequencing coverage is low (0.5X) and the per-read contamination rate is high (25%).