# Introgression obscures and reveals historical relationships among the American live oaks

Introgression obscures and reveals historical relationships among the American live oaks

Deren Eaton , Antonio Gonzalez-Rodriguez , Andrew Hipp , Jeannine Cavender-Bares
doi: http://dx.doi.org/10.1101/016238

Introgressive hybridization challenges the concepts we use to define species and our ability to infer their evolutionary relationships. Methods for inferring historical introgression from the genomes of extant species are now widely used, however, few guidelines have been articulated for how best to interpret their results. Because these tests are inherently comparative, we show that they are sensitivite to the effects of missing data (unsampled species) and to non-independence (hierarchical relationships among species). We demonstrate this using genomic RAD data sampled from populations across the geographic ranges of all extant species in the American live oaks (Quercus series Virentes), a group notorious for hybridization. By considering all species in the clade, and their phylogenetic relationships, we were able to distinguish true hybridizing lineages from those that falsely appear admixed due to phylogenetic structure among hybridizing relatives. Six of seven species show evidence of admixture, often with multiple other species, but which can be explained by hybrid introgression among few related lineages where they occur in close proximity. We identify the Cuban oak as a highly admixed lineage and use an information-theoretic model comparison approach to test alternative scenarios for its origin. Hybrid speciation is a poor fit compared to a model in which a population from Central America colonized Cuba and received subsequent gene flow from Florida. The live oaks form a continuous ring-like distribution around the Gulf of Mexico, connected in Cuba, across which they could effectively exchange alleles. However, introgression appears to remain localized to areas of sympatry, suggesting that oak species boundaries, and their geographic ranges have remained relatively stable over evolutionary time.

# The Spatial Mixing of Genomes in Secondary Contact Zones

The Spatial Mixing of Genomes in Secondary Contact Zones
Alisa Sedghifar , Yaniv Brandvain , Peter L. Ralph , Graham Coop
doi: http://dx.doi.org/10.1101/016337

Recent genomic studies have highlighted the important role of admixture in shaping genome-wide patterns of diversity. Past admixture leaves a population genomic signature of linkage disequilibrium (LD), reflecting the mixing of parental chromosomes by segregation and recombination. The extent of this LD can be used to infer the timing of admixture. However, the results of inference can depend strongly on the assumed demographic model. Here, we introduce a theoretical framework for modeling patterns of LD in a geographic contact zone where two differentiated populations are diffusing back together. We derive expressions for the expected LD and admixture tract lengths across geographic space as a function of the age of the contact zone and the dispersal distance of individuals. We develop an approach to infer age of contact zones using population genomic data from multiple spatially sampled populations by fitting our model to the decay of LD with recombination distance. We use our approach to explore the fit of a geographic contact zone model to three human population genomic datasets from populations along the Indonesian archipelago, populations in Central Asia and populations in India.

# Eight thousand years of natural selection in Europe

Eight thousand years of natural selection in Europe
Iain Mathieson , Iosif Lazaridis , Nadin Rohland , Swapan Mallick , Bastien Llamas , Joseph Pickrell , Harald Meller , Manuel A. Rojo Guerra , Johannes Krause , David Anthony , Dorcas Brown , Carles Lalueza Fox , Alan Cooper , Kurt W. Alt , Wolfgang Haak , Nick Patterson , David Reich
doi: http://dx.doi.org/10.1101/016477

The arrival of farming in Europe beginning around 8,500 years ago required adaptation to new environments, pathogens, diets, and social organizations. While evidence of natural selection can be revealed by studying patterns of genetic variation in present-day people, these pattern are only indirect echoes of past events, and provide little information about where and when selection occurred. Ancient DNA makes it possible to examine populations as they were before, during and after adaptation events, and thus to reveal the tempo and mode of selection. Here we report the first genome-wide scan for selection using ancient DNA, based on 83 human samples from Holocene Europe analyzed at over 300,000 positions. We find five genome-wide signals of selection, at loci associated with diet and pigmentation. Surprisingly in light of suggestions of selection on immune traits associated with the advent of agriculture and denser living conditions, we find no strong sweeps associated with immunological phenotypes. We also report a scan for selection for complex traits, and find two signals of selection on height: for short stature in Iberia after the arrival of agriculture, and for tall stature on the Pontic-Caspian steppe earlier than 5,000 years ago. A surprise is that in Scandinavian hunter-gatherers living around 8,000 years ago, there is a high frequency of the derived allele at the EDAR gene that is the strongest known signal of selection in East Asians and that is thought to have arisen in East Asia. These results document the power of ancient DNA to reveal features of past adaptation that could not be understood from analyses of present-day people.

# Efficient computation of the joint sample frequency spectra for multiple populations

Efficient computation of the joint sample frequency spectra for multiple populations

John A. Kamm, Jonathan Terhorst, Yun S. Song
(Submitted on 3 Mar 2015)

A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences. In particular, recently there has been growing interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. Although much methodological progress has been made, existing SFS-based inference methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable efficient computation of the expected joint SFS for multiple populations related by a complex demographic model with arbitrary population size histories (including piecewise exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study involving tens of populations, we demonstrate our improvements to numerical stability and computational complexity.

# Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Giovanni Busonera , Marco Cogoni , Gianluigi Zanetti
doi: http://dx.doi.org/10.1101/015669

In this report we present a multimarker association tool (Flash) based on a novel algorithm to generate haplotypes from raw genotype data. It belongs to the entropy minimization class of methods and is composed of a two stage deterministic – heuristic part and of a optional stochastic optimization. This algorithm is able to scale up well to handle huge datasets with faster performance than the competing technologies such as BEAGLE and MACH while maintaining a comparable accuracy. A quality assessment of the results is carried out by comparing the switch error. Finally, the haplotypes are used to perform a haplotype-based Genome-wide Association Study (GWAS). The association results are compared with a multimarker and a single SNP association test performed with Plink. Our experiments confirm that the multimarker association test can be more powerful than the single SNP one as stated in the literature. Moreover, Flash and Plink show similar results for the multimarker association test but Flash speeds up the computation time of about an order of magnitude using 5 SNP size haplotypes.

# Differential Evolution Approach to Detect Recent Admixture

Differential Evolution Approach to Detect Recent Admixture

Konstantin Kozlov , Dmitry Chebotarov , Mehedi Hassan , Petr Triska , Martin Triska , Pavel Flegontov , Tatiana V Tatarinova
doi: http://dx.doi.org/10.1101/015446

The genetic structure of human populations is extraordinarily complex and of fundamental importance to studies of anthropology, evolution, and medicine. As increasingly many individuals are of mixed origin, there is an unmet need for tools that can infer multiple origins. Misclassification of such individuals can lead to incorrect and costly misinterpretations of genomic data, primarily in disease studies and drug trials. We present an advanced tool to infer ancestry that can identify the biogeographic origins of highly mixed individuals. reAdmix can incorporate individual’s knowledge of ancestors (e.g. having some ancestors from Turkey or a Scottish grandmother). reAdmix is an online tool available at http://chcb.saban-chla.usc.edu/reAdmix/.

# A Spatial Framework for Understanding Population Structure and Admixture.

A Spatial Framework for Understanding Population Structure and Admixture.
Gideon Bradburd, Peter L. Ralph, Graham Coop
doi: http://dx.doi.org/10.1101/013474

Geographic patterns of genetic variation within modern populations, produced by complex histories of migration, can be difficult to infer and visually summarize. A general consequence of geographically limited dispersal is that samples from nearby locations tend to be more closely related than samples from distant locations, and so genetic covariance often recapitulates geographic proximity. We use genome-wide polymorphism data to build “geogenetic maps”, which, when applied to stationary populations, produces a map of the geographic positions of the populations, but with distances distorted to reflect historical rates of gene flow. In the underlying model, allele frequency covariance is a decreasing function of geogenetic distance, and nonlocal gene flow such as admixture can be identified as anomalously strong covariance over long distances. This admixture is explicitly co-estimated and depicted as arrows, from the source of admixture to the recipient, on the geogenetic map. We demonstrate the utility of this method on a circum-Tibetan sampling of the greenish warbler (Phylloscopus trochiloides), in which we find evidence for gene flow between the adjacent, terminal populations of the ring species. We also analyze a global sampling of human populations, for which we largely recover the geography of the sampling, with support for significant histories of admixture in many samples. This new tool for understanding and visualizing patterns of population structure is implemented in a Bayesian framework in the program SpaceMix.

# Scaling probabilistic models of genetic variation to millions of humans

Scaling probabilistic models of genetic variation to millions of humans

Prem Gopalan, Wei Hao, David M. Blei, John D. Storey
doi: http://dx.doi.org/10.1101/013227

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown significantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

# A new FST-based method to uncover local adaptation using environmental variables

A new $F_{\text{ST}}$-based method to uncover local adaptation using environmental variables
Pierre de Villemereuil, Oscar E. Gaggiotti
Comments: 18 pages, 5 figures, Supplementary Information at the end of the document
Subjects: Populations and Evolution (q-bio.PE)

Genome-scan methods are used for screening genome-wide patterns of DNA polymorphism to detect signatures of positive selection. There are two main types of methods: (i) “outlier” detection methods based on $F_{\text{ST}}$ that detect loci with high differenciation compared to the rest of the genomes and, (ii) environmental association methods that test the association between allele frequencies and environmental variables. In this article, we present a new $F_{\text{ST}}$-based genome scan method, BayeScEnv, which incorporates environmental information in the form of “environmental differentiation”. It is based on the F model but as opposed to existing approaches it considers two locus-specific effects, one due to divergent selection and another due to other processes such as differences in mutation rates across loci or background selection. Simulation studies showed that our method has a much lower false positive rate than an existing $F_{\text{ST}}$-based method, BayeScan, under a wide range of demographic scenarios. Although it had lower power, it leads to a better compromise between power and false positive rate. We apply our method to Human and Salmon datasets and show that it can be used successfully to study local adaptation. The method was developped in C++ and is avaible at this http URL

# Visualizing spatial population structure with estimated effective migration surfaces

Visualizing spatial population structure with estimated effective migration surfaces
Desislava Petkova, John Novembre, Matthew Stephens
doi: http://dx.doi.org/10.1101/011809

Genetic data often exhibit patterns that are broadly consistent with “isolation by distance” – a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of “effective migration” to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.