Hierarchical Bayesian model of population structure reveals convergent adaptation to high altitude in human populations

Hierarchical Bayesian model of population structure reveals convergent adaptation to high altitude in human populations

Matthieu Foll, Oscar E. Gaggiotti, Josephine T. Daub, Laurent Excoffier
(Submitted on 18 Feb 2014)

Detecting genes involved in local adaptation is challenging and of fundamental importance in evolutionary, quantitative, and medical genetics. To this aim, a standard strategy is to perform genome scans in populations of different origins and environments, looking for genomic regions of high differentiation. Because shared population history or population sub-structure may lead to an excess of false positives, analyses are often done on multiple pairs of populations, which leads to i) a global loss of power as compared to a global analysis, and ii) the need for multiple tests corrections. In order to alleviate these problems, we introduce a new hierarchical Bayesian method to detect markers under selection that can deal with complex demographic histories, where sampled populations share part of their history. Simulations show that our approach is both more powerful and less prone to false positive loci than approaches based on separate analyses of pairs of populations or those ignoring existing complex structures. In addition, our method can identify selection occurring at different levels (i.e. population or region-specific adaptation), as well as convergent selection in different regions. We apply our approach to the analysis of a large SNP dataset from low- and high-altitude human populations from America and Asia. The simultaneous analysis of these two geographic areas allows us to identify several new candidate genome regions for altitudinal selection, and we show that convergent evolution among continents has been quite common. In addition to identifying several genes and biological processes involved in high altitude adaptation, we identify two specific biological pathways that could have evolved in both continents to counter toxic effects induced by hypoxia.

Investigating speciation in face of polyploidization: what can we learn from approximate Bayesian computation approach?

Investigating speciation in face of polyploidization: what can we learn from approximate Bayesian computation approach?
Camille Roux, John Pannell

Despite its importance in the diversification of many eucaryote clades, particularly plants, detailed genomic analysis of polyploid species is still in its infancy, with published analysis of only a handful of model species to date. Fundamental questions concerning the origin of polyploid lineages (e.g., auto- vs. allopolyploidy) and the extent to which polyploid genomes display different modes of inheritance are poorly resolved for most polyploids, not least because they have hitherto required detailed karyotypic analysis or the analysis of allele segregation at multiple loci in pedigrees or artificial crosses, which are often not practical for non-model species. However, the increasing availability of sequence data for non-model species now presents an opportunity to apply established approaches for the evolutionary analysis of genomic data to polyploid species complexes. Here, we ask whether approximate Bayesian computation (ABC), applied to sequence data produced by next-generation sequencing technologies from polyploid taxa, allows correct inference of the evolutionary and demographic history of polyploid lineages and their close relatives. We use simulations to investigate how the number of sampled individuals, the number of surveyed loci and their length affect the accuracy and precision of evolutionary and demographic inferences by ABC, including the mode of polyploidisation, mode of inheritance of polyploid taxa, the relative timing of genome duplication and speciation, and effective populations sizes of contributing lineages. We also apply the ABC framework we develop to sequence data from diploid and polyploidy species of the plant genus Capsella, for which we infer an allopolyploid origin for tetra C. bursa-pastoris ≈ 90,000 years ago. In general, our results indicate that ABC is a promising and powerful method for uncovering the origin and subsequent evolution of polyploid species.

Nonparametric inference of the distribution of fitness effects across functional categories in humans

Nonparametric inference of the distribution of fitness effects across functional categories in humans

Fernando Racimo, Joshua G Schraiber

Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral, or rely on fitting the DFE of new nonsynonymous mutations to a particular parametric probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model. We then use our coefficient mapping to quantify the distribution of all scored single-nucleotide polymorphisms in Yoruba and Europeans. Our method serves to approximate the DFE of any type of segregating mutations, regardless of its genomic consequence, and so allows us to compare the proportion of mutations that are negatively selected or neutral across various genomic categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly leptokurtic, with a strong peak at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Footprints of ancient balanced polymorphisms in genetic variation data

Footprints of ancient balanced polymorphisms in genetic variation data
Ziyue Gao, Molly Przeworski, Guy Sella
(Submitted on 29 Jan 2014)

When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic variation data, notably the presence of additional neutral polymorphisms shared between species and the absence of fixed differences, the effects remain poorly characterized. We focus on the case of a single site under long-lived balancing selection and derive approximations for summaries of the data that are sensitive to a trans-species polymorphism: the length of the segment that carries most of the signals, the expected number of shared neutral SNPs within the segment and the patterns of allelic associations among them. Coalescent simulations of ancient balancing selection confirm the accuracy of our approximations. We further show that for humans and chimpanzees, and more generally for pairs of species with low genetic diversity levels, the patterns of genetic variation on which we focus are highly unlikely to be generated by neutral recurrent mutations, so these statistics are specific as well as sensitive. We discuss the implications of our results for the design and interpretation of genome scans for ancient balancing selection in apes and other taxa.

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees

Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees
Gilad Liberman, Jennifer Benichou, Lea Tsaban, yaakov maman, Jacob Glanville, yoram louzoun

Incremental selection within a population, defined as a limited fitness change following a mutation, is an important aspect of many evolutionary processes and can significantly affect a large number of mutations through the genome. Strongly advantageous or deleterious mutations are detected through the fixation of mutations in the population, using the synonymous to non-synonymous mutations ratio in sequences. There are currently to precise methods to estimate incremental selection occurring over limited periods. We here provide for the first time such a detailed method and show its precision and its applicability to the genomic analysis of selection. A special case of evolution is rapid, short term micro-evolution, where organism are under constant adaptation, occurring for example in viruses infecting a new host, B cells mutating during a germinal center reactions or mitochondria evolving within a given host. The proposed method is a novel mixed lineage tree/sequence based method to detect within population selection as defined by the effect of mutations on the average number of offspring. Specifically, we pro-pose to measure the log of the ratio between the number of leaves in lineage trees branches following synonymous and non-synonymous mutations. This method does not suffer from the need of a baseline model and is practically not affected by sampling biases. In order to show the wide applicability of this method, we apply it to multiple cases of micro-evolution, and show that it can detect genes and inter-genic regions using the selection rate and detect selection pressures in viral proteins and in the immune response to pathogens.

A C++ template library for efficient forward-time population genetic simulation of large populations

A C++ template library for efficient forward-time population genetic simulation of large populations
Kevin R. Thornton
(Submitted on 15 Jan 2014)

fwdpp is a C++ library of routines intended to facilitate the development of forward-time simulations under arbitrary mutation and fitness models. The library design provides a combination of speed, low memory overhead, and modeling flexibility not currently available from other forward simulation tools. The library is particularly useful when the simulation of large populations is required, as programs implemented using the library are much more efficient that other available forward simulation programs.

Response to a population bottleneck can be used to infer recessive selection

Response to a population bottleneck can be used to infer recessive selection
Daniel J. Balick, Ron Do, David Reich, Shamil R. Sunyaev
(Submitted on 11 Dec 2013)

Here we present the first genome wide statistical test for recessive selection. This test uses explicitly non-equilibrium demographic differences between populations to infer the mode of selection. By analyzing the transient response to a population bottleneck and subsequent re-expansion, we qualitatively distinguish between alleles under additive and recessive selection. We analyze the response of the average number of deleterious mutations per haploid individual and describe time dependence of this quantity. We introduce a statistic, BR, to compare the number of mutations in different populations and detail its functional dependence on the strength of selection and the intensity of the population bottleneck. This test can be used to detect the predominant mode of selection on the genome wide or regional level, as well as among a sufficiently large set of medically or functionally relevant alleles.

Evaluating the use of ABBA-BABA statistics to locate introgressed loci

Evaluating the use of ABBA-BABA statistics to locate introgressed loci
Simon Henry Martin, John William Davey, Chris D Jiggins

Several methods have been proposed to test for introgression across genomes. One method identifies an excess of shared derived alleles between taxa using Patterson’s D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Smith and Kronforst (2013) propose that, at loci identified as outliers for the D statistic, introgression is indicated by a reduction in absolute genetic divergence (dXY) between taxa with shared ancestry, whereas ancestral structure produces no reduction in dXY at these loci. Here, we use simulations and Heliconius butterfly data to investigate the behavior of D when applied to small genomic regions. We find that D imperfectly identifies loci with shared ancestry in many scenarios due to a bias in regions with few segregating sites. A related statistic, f, is mostly robust to this bias but becomes less accurate as gene flow becomes more ancient. Although reduced dXY does indicate introgression when loci with shared ancestry can be accurately detected, both D and f systematically identify regions of lower dXY in the presence of both gene flow and ancestral structure, so detecting a reduction in dXY at D or f outliers is not sufficient to infer introgression. However, models including gene flow produced a larger reduction in dXY than models including ancestral structure in almost all cases, so this reduction may be suggestive, but not conclusive, evidence for introgression.

Probabilistic models of genetic variation in structured populations applied to global human studies

Probabilistic models of genetic variation in structured populations applied to global human studies
Wei Hao, Minsun Song, John D. Storey
(Submitted on 7 Dec 2013)

Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important, unsolved problem is how to formulate and estimate probabilistic models of observed genotypes that allow for complex population structure. We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis (PCA) can be utilized to estimate a general model that includes the well-known Pritchard-Stephens-Donnelly mixed-membership model as a special case. Noting some drawbacks of this approach, we introduce a new “logistic factor analysis” (LFA) framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the human genome diversity panel and 1000 genomes project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions.