A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models
Shai Carmi, James Xue, Itsik Pe’er
(Submitted on 19 Sep 2015)
Admixed populations are formed by the merging of two or more ancestral populations, and the ancestry of each locus in an admixed genome derives from either source. Consider a simple “pulse” admixture model, where populations A and B merged t generations ago without subsequent gene flow. We derive the distribution of the proportion of an admixed chromosome that has A (or B) ancestry, as a function of the chromosome length L, t, and the initial contribution of the A source, m. We demonstrate that these results can be used for inference of the admixture parameters. For more complex admixture models, we derive an expression in Laplace space for the distribution of ancestry proportions that depends on having the distribution of the lengths of segments of each ancestry. We obtain explicit results for the special case of a “two-wave” admixture model, where population A contributed additional migrants in one of the generations between the present and the initial admixture event. Specifically, we derive formulas for the distribution of A and B segment lengths and numerical results for the distribution of ancestry proportions. We show that for recent admixture, data generated under a two-wave model can hardly be distinguished from that generated under a pulse model.
Inference under a Wright-Fisher model using an accurate beta approximation
Paula Tataru, Thomas Bataillon, Asger Hobolth
The large amount and high quality of genomic data available today enables, in principle, accurate inference of evolutionary history of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytic form. Existing approximations build on the computationally intensive diffusion limit, or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (zero and one). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here, we introduce the beta with spikes, an extension of the beta approximation, which explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations, with comparable performance to existing state-of-the-art method.
An empirical approach to demographic inference
Peter L. Ralph
(Submitted on 21 May 2015)
Inference with population genetic data usually treats the population pedigree as a nuisance parameter, the unobserved product of a past history of random mating. However, the history of genetic relationships in a given population is a fixed, unobserved object, and so an alternative approach is to treat this network of relationships as a complex object we wish to learn about, by observing how genomes have been noisily passed down through it. This paper explores this point of view, showing how to translate questions about population genetic data into calculations with a Poisson process of mutations on all ancestral genomes. This method is applied to give a robust interpretation to the f4 statistic used to identify admixture, and to design a new statistic that measures covariances in mean times to most recent common ancestor between two pairs of sequences. The method more generally interprets population genetic statistics in terms of sums of specific functions over ancestral genomes, thereby providing concrete, broadly interpretable interpretations for these statistics. This provides a method for describing demographic history without simplified demographic models. More generally, it brings into focus the population pedigree, which is averaged over in model-based demographic inference.
Inference of Ancestral Recombination Graphs through Topological Data Analysis
Pablo G. Camara, Arnold J. Levine, Raul Rabadan
(Submitted on 21 May 2015)
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relations within and across species. Recombination, reassortment, horizontal gene transfer, and species hybridization constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from tens or hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs (ARGs) represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, ARGs are computationally costly to reconstruct and usually become infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build on previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. For that aim, we extend the notion of barcodes in persistent homology, largely increasing their sensitivity to recombination, and present a new type of summary graph (topological ARG, or tARG), analogous to ARGs, that capture ensembles of minimal recombination histories. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations and horizontal evolution in finches inhabiting the Gal\’apagos Islands.
Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies
Julia A Palacios , John Wakeley, Sohini Ramachandran
Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Accurate methods are available for data from a single locus or from independent loci. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model which allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method’s credible intervals for population size as a function of time cover 90 percent of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.
Detecting recent selective sweeps while controlling for mutation rate and background selection
Christian D. Huber , Michael DeGiorgio , Ines Hellmann , Rasmus Nielsen
A composite likelihood ratio test implemented in the program SweepFinder is a commonly used method for scanning a genome for recent selective sweeps. SweepFinder uses information on the spatial pattern of the site frequency spectrum (SFS) around the selected locus. To avoid confounding effects of background selection and variation in the mutation process along the genome, the method is typically applied only to sites that are variable within species. However, the power to detect and localize selective sweeps can be greatly improved if invariable sites are also included in the analysis. In the spirit of a Hudson-Kreitman-Aguadé test, we suggest to add fixed differences relative to an outgroup to account for variation in mutation rate, thereby facilitating more robust and powerful analyses. We also develop a method for including background selection modeled as a local reduction in the effective population size. Using simulations we show that these advances lead to a gain in power while maintaining robustness to mutation rate variation. Furthermore, the new method also provides more precise localization of the causative mutation than methods using the spatial pattern of segregating sites alone.
Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data
Nicolas Duforet-Frebourg, Guillaume Laval, Eric Bazin, Michael G.B. Blum
(Submitted on 8 Apr 2015)
Large-scale genomic data offers the perspective to decipher the genetic architecture of natural selection. To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Looking at the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). PCA-based statistics retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially in non-model species for which defining populations can be difficult. Genome scan based on PCA is implemented in the open-source and freely available PCAdapt software.