Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia
Kevin J Galinsky , Gaurav Bhatia , Po-Ru Loh , Stoyan Georgiev , Sayan Mukherjee , Nick J Patterson , Alkes L Price
Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents.
Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele
Roy Ronen , Glenn Tesler , Ali Akbari , Shay Zakov , Noah A Rosenberg , Vineet Bafna
Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.
New Routes to Phylogeography
Nicola De Maio, Chieh-Hsi Wu, Kathleen M O’Reilly, Daniel Wilson
(Submitted on 27 Mar 2015)
Phylogeographic methods aim to infer migration trends and the history of sampled lineages from genetic data. Applications of phylogeography are broad, and in the context of pathogens include the reconstruction of transmission histories and the origin and emergence of outbreaks. Phylogeographic inference based on bottom-up population genetics models is computationally expensive, and as a result faster alternatives based on the evolution of discrete traits have become popular. In this paper, we show that inference of migration rates and root locations based on discrete trait models is extremely unreliable and sensitive to biased sampling. To address this problem, we introduce BASTA (BAyesian STructured coalescent Approximation), a new approach implemented in BEAST2 that combines the accuracy of methods based on the structured coalescent with the computational efficiency required to handle more than just few populations. We illustrate the potentially severe implications of poor model choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola virus. Whereas the structured coalescent analysis correctly infers that successive human Ebola outbreaks have been seeded by a large unsampled non-human reservoir population, the discrete trait analysis implausibly concludes that undetected human-to-human transmission has allowed the virus to persist over the past four decades. As genomics takes on an increasingly prominent role informing the control and prevention of infectious diseases, it will be vital that phylogeographic inference provides robust insights into transmission history.
Introgression obscures and reveals historical relationships among the American live oaks
Deren Eaton , Antonio Gonzalez-Rodriguez , Andrew Hipp , Jeannine Cavender-Bares
Introgressive hybridization challenges the concepts we use to define species and our ability to infer their evolutionary relationships. Methods for inferring historical introgression from the genomes of extant species are now widely used, however, few guidelines have been articulated for how best to interpret their results. Because these tests are inherently comparative, we show that they are sensitivite to the effects of missing data (unsampled species) and to non-independence (hierarchical relationships among species). We demonstrate this using genomic RAD data sampled from populations across the geographic ranges of all extant species in the American live oaks (Quercus series Virentes), a group notorious for hybridization. By considering all species in the clade, and their phylogenetic relationships, we were able to distinguish true hybridizing lineages from those that falsely appear admixed due to phylogenetic structure among hybridizing relatives. Six of seven species show evidence of admixture, often with multiple other species, but which can be explained by hybrid introgression among few related lineages where they occur in close proximity. We identify the Cuban oak as a highly admixed lineage and use an information-theoretic model comparison approach to test alternative scenarios for its origin. Hybrid speciation is a poor fit compared to a model in which a population from Central America colonized Cuba and received subsequent gene flow from Florida. The live oaks form a continuous ring-like distribution around the Gulf of Mexico, connected in Cuba, across which they could effectively exchange alleles. However, introgression appears to remain localized to areas of sympatry, suggesting that oak species boundaries, and their geographic ranges have remained relatively stable over evolutionary time.
The Spatial Mixing of Genomes in Secondary Contact Zones
Alisa Sedghifar , Yaniv Brandvain , Peter L. Ralph , Graham Coop
Recent genomic studies have highlighted the important role of admixture in shaping genome-wide patterns of diversity. Past admixture leaves a population genomic signature of linkage disequilibrium (LD), reflecting the mixing of parental chromosomes by segregation and recombination. The extent of this LD can be used to infer the timing of admixture. However, the results of inference can depend strongly on the assumed demographic model. Here, we introduce a theoretical framework for modeling patterns of LD in a geographic contact zone where two differentiated populations are diffusing back together. We derive expressions for the expected LD and admixture tract lengths across geographic space as a function of the age of the contact zone and the dispersal distance of individuals. We develop an approach to infer age of contact zones using population genomic data from multiple spatially sampled populations by fitting our model to the decay of LD with recombination distance. We use our approach to explore the fit of a geographic contact zone model to three human population genomic datasets from populations along the Indonesian archipelago, populations in Central Asia and populations in India.
Eight thousand years of natural selection in Europe
Iain Mathieson , Iosif Lazaridis , Nadin Rohland , Swapan Mallick , Bastien Llamas , Joseph Pickrell , Harald Meller , Manuel A. Rojo Guerra , Johannes Krause , David Anthony , Dorcas Brown , Carles Lalueza Fox , Alan Cooper , Kurt W. Alt , Wolfgang Haak , Nick Patterson , David Reich
The arrival of farming in Europe beginning around 8,500 years ago required adaptation to new environments, pathogens, diets, and social organizations. While evidence of natural selection can be revealed by studying patterns of genetic variation in present-day people, these pattern are only indirect echoes of past events, and provide little information about where and when selection occurred. Ancient DNA makes it possible to examine populations as they were before, during and after adaptation events, and thus to reveal the tempo and mode of selection. Here we report the first genome-wide scan for selection using ancient DNA, based on 83 human samples from Holocene Europe analyzed at over 300,000 positions. We find five genome-wide signals of selection, at loci associated with diet and pigmentation. Surprisingly in light of suggestions of selection on immune traits associated with the advent of agriculture and denser living conditions, we find no strong sweeps associated with immunological phenotypes. We also report a scan for selection for complex traits, and find two signals of selection on height: for short stature in Iberia after the arrival of agriculture, and for tall stature on the Pontic-Caspian steppe earlier than 5,000 years ago. A surprise is that in Scandinavian hunter-gatherers living around 8,000 years ago, there is a high frequency of the derived allele at the EDAR gene that is the strongest known signal of selection in East Asians and that is thought to have arisen in East Asia. These results document the power of ancient DNA to reveal features of past adaptation that could not be understood from analyses of present-day people.
Efficient computation of the joint sample frequency spectra for multiple populations
John A. Kamm, Jonathan Terhorst, Yun S. Song
(Submitted on 3 Mar 2015)
A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences. In particular, recently there has been growing interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. Although much methodological progress has been made, existing SFS-based inference methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable efficient computation of the expected joint SFS for multiple populations related by a complex demographic model with arbitrary population size histories (including piecewise exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study involving tens of populations, we demonstrate our improvements to numerical stability and computational complexity.