A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

A note on the distribution of admixture segment lengths and ancestry proportions under pulse and two-wave admixture models

Shai Carmi, James Xue, Itsik Pe’er
(Submitted on 19 Sep 2015)

Admixed populations are formed by the merging of two or more ancestral populations, and the ancestry of each locus in an admixed genome derives from either source. Consider a simple “pulse” admixture model, where populations A and B merged t generations ago without subsequent gene flow. We derive the distribution of the proportion of an admixed chromosome that has A (or B) ancestry, as a function of the chromosome length L, t, and the initial contribution of the A source, m. We demonstrate that these results can be used for inference of the admixture parameters. For more complex admixture models, we derive an expression in Laplace space for the distribution of ancestry proportions that depends on having the distribution of the lengths of segments of each ancestry. We obtain explicit results for the special case of a “two-wave” admixture model, where population A contributed additional migrants in one of the generations between the present and the initial admixture event. Specifically, we derive formulas for the distribution of A and B segment lengths and numerical results for the distribution of ancestry proportions. We show that for recent admixture, data generated under a two-wave model can hardly be distinguished from that generated under a pulse model.


Inference under a Wright-Fisher model using an accurate beta approximation

Inference under a Wright-Fisher model using an accurate beta approximation

Paula Tataru, Thomas Bataillon, Asger Hobolth
doi: http://dx.doi.org/10.1101/021261

The large amount and high quality of genomic data available today enables, in principle, accurate inference of evolutionary history of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytic form. Existing approximations build on the computationally intensive diffusion limit, or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (zero and one). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here, we introduce the beta with spikes, an extension of the beta approximation, which explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations, with comparable performance to existing state-of-the-art method.

An empirical approach to demographic inference

An empirical approach to demographic inference

Peter L. Ralph
(Submitted on 21 May 2015)

Inference with population genetic data usually treats the population pedigree as a nuisance parameter, the unobserved product of a past history of random mating. However, the history of genetic relationships in a given population is a fixed, unobserved object, and so an alternative approach is to treat this network of relationships as a complex object we wish to learn about, by observing how genomes have been noisily passed down through it. This paper explores this point of view, showing how to translate questions about population genetic data into calculations with a Poisson process of mutations on all ancestral genomes. This method is applied to give a robust interpretation to the f4 statistic used to identify admixture, and to design a new statistic that measures covariances in mean times to most recent common ancestor between two pairs of sequences. The method more generally interprets population genetic statistics in terms of sums of specific functions over ancestral genomes, thereby providing concrete, broadly interpretable interpretations for these statistics. This provides a method for describing demographic history without simplified demographic models. More generally, it brings into focus the population pedigree, which is averaged over in model-based demographic inference.

Inference of Ancestral Recombination Graphs through Topological Data Analysis

Inference of Ancestral Recombination Graphs through Topological Data Analysis

Pablo G. Camara, Arnold J. Levine, Raul Rabadan
(Submitted on 21 May 2015)

The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relations within and across species. Recombination, reassortment, horizontal gene transfer, and species hybridization constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from tens or hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs (ARGs) represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, ARGs are computationally costly to reconstruct and usually become infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build on previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. For that aim, we extend the notion of barcodes in persistent homology, largely increasing their sensitivity to recombination, and present a new type of summary graph (topological ARG, or tARG), analogous to ARGs, that capture ensembles of minimal recombination histories. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations and horizontal evolution in finches inhabiting the Gal\’apagos Islands.

Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies

Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies
Julia A Palacios , John Wakeley, Sohini Ramachandran
doi: http://dx.doi.org/10.1101/019216

Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Accurate methods are available for data from a single locus or from independent loci. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model which allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method’s credible intervals for population size as a function of time cover 90 percent of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.

Detecting recent selective sweeps while controlling for mutation rate and background selection

Detecting recent selective sweeps while controlling for mutation rate and background selection

Christian D. Huber , Michael DeGiorgio , Ines Hellmann , Rasmus Nielsen
doi: http://dx.doi.org/10.1101/018697

A composite likelihood ratio test implemented in the program SweepFinder is a commonly used method for scanning a genome for recent selective sweeps. SweepFinder uses information on the spatial pattern of the site frequency spectrum (SFS) around the selected locus. To avoid confounding effects of background selection and variation in the mutation process along the genome, the method is typically applied only to sites that are variable within species. However, the power to detect and localize selective sweeps can be greatly improved if invariable sites are also included in the analysis. In the spirit of a Hudson-Kreitman-Aguadé test, we suggest to add fixed differences relative to an outgroup to account for variation in mutation rate, thereby facilitating more robust and powerful analyses. We also develop a method for including background selection modeled as a local reduction in the effective population size. Using simulations we show that these advances lead to a gain in power while maintaining robustness to mutation rate variation. Furthermore, the new method also provides more precise localization of the causative mutation than methods using the spatial pattern of segregating sites alone.

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Nicolas Duforet-Frebourg, Guillaume Laval, Eric Bazin, Michael G.B. Blum
(Submitted on 8 Apr 2015)

Large-scale genomic data offers the perspective to decipher the genetic architecture of natural selection. To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Looking at the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). PCA-based statistics retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially in non-model species for which defining populations can be difficult. Genome scan based on PCA is implemented in the open-source and freely available PCAdapt software.

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Kevin J Galinsky , Gaurav Bhatia , Po-Ru Loh , Stoyan Georgiev , Sayan Mukherjee , Nick J Patterson , Alkes L Price
doi: http://dx.doi.org/10.1101/018143

Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents.

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele
Roy Ronen , Glenn Tesler , Ali Akbari , Shay Zakov , Noah A Rosenberg , Vineet Bafna

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

New Routes to Phylogeography

New Routes to Phylogeography

Nicola De Maio, Chieh-Hsi Wu, Kathleen M O’Reilly, Daniel Wilson
(Submitted on 27 Mar 2015)

Phylogeographic methods aim to infer migration trends and the history of sampled lineages from genetic data. Applications of phylogeography are broad, and in the context of pathogens include the reconstruction of transmission histories and the origin and emergence of outbreaks. Phylogeographic inference based on bottom-up population genetics models is computationally expensive, and as a result faster alternatives based on the evolution of discrete traits have become popular. In this paper, we show that inference of migration rates and root locations based on discrete trait models is extremely unreliable and sensitive to biased sampling. To address this problem, we introduce BASTA (BAyesian STructured coalescent Approximation), a new approach implemented in BEAST2 that combines the accuracy of methods based on the structured coalescent with the computational efficiency required to handle more than just few populations. We illustrate the potentially severe implications of poor model choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola virus. Whereas the structured coalescent analysis correctly infers that successive human Ebola outbreaks have been seeded by a large unsampled non-human reservoir population, the discrete trait analysis implausibly concludes that undetected human-to-human transmission has allowed the virus to persist over the past four decades. As genomics takes on an increasingly prominent role informing the control and prevention of infectious diseases, it will be vital that phylogeographic inference provides robust insights into transmission history.