An improved sequence measure used to scan genomes for regions of recent gene flow
Anthony J. Geneva, Christina A. Muirhead, LeAnne M. Lovato, Sarah B. Kingan, Daniel Garrigan
(Submitted on 6 Mar 2014)
The study of complex speciation, or speciation with gene flow, requires the identification of genomic regions that are either unusually divergent or that have experienced recent gene flow. Furthermore, the rapid growth of population genomic datasets relevant to studying complex speciation requires that analytical tools be scalable to the level of whole-genome analysis. We present a simple sequence measure, Gmin which is specifically designed to identify regions of diverging genomes as candidates for experiencing recent gene flow. Gmin is defined as the ratio of the minimum number of nucleotide differences between sequences from two different populations to the average number of between-population differences. We compare the sensitivity of Gmin to that of the widely used index of population differentiation, Fst. Extensive computer simulations demonstrate that Gmin has greater sensitivity and specificity to detect gene flow than Fst. Additionally, the sensitivity of Gmin to detect gene flow is robust with respect to both the population mutation and recombination rates, suggesting that it is flexible and can be applied to a variety of biological scenarios. Finally, a scan of Gmin across the X~chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results demonstrate that Gmin is a biologically straightforward, yet powerful, alternative to Fst, as well as to more computationally intensive model-based methods for detecting gene flow.
A renewal theory approach to IBD sharing
Shai Carmi, Itsik Pe’er
(Submitted on 6 Mar 2014)
Long genomic segments that are nearly identical between a pair of individuals and are inherited from a recent common ancestor without recombination are called identical-by-descent (IBD) segments. IBD sharing has numerous applications in genetics, from demographic inference to phasing, imputation, pedigree reconstruction, and disease mapping. Here, we provide a theoretical analysis of IBD sharing under Markovian approximations of the coalescent with recombination. We describe a general framework for the IBD process along the chromosome under the Markovian models (SMC/SMC’), as well as introduce and justify a new model, which we term the renewal approximation, under which lengths of successive segments are independent. Then, considering the infinite-chromosome limit of the IBD process, we recover previous results (for SMC) and derive new results (for SMC’) for the average fraction of the chromosome found in long shared segments and the average number of such segments. A number of new results for tree heights in SMC’ are proved as lemmas. We then use renewal theory to derive an expression (in Laplace space) for the distribution of the number of shared segments and demonstrate implications for demographic inference. We also use renewal theory to compute the distribution of the fraction of the chromosome shared. While the expression is again in Laplace space, we could invert the first two moments and compare a number of approximations. Finally, we generalized all results to populations with variable historical effective size.
Decoding coalescent hidden Markov models in linear time
Kelley Harris, Sara Sheehan, John A. Kamm, Yun S. Song
(Submitted on 4 Mar 2014)
In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.
Genome scans for detecting footprints of local adaptation using a Bayesian factor model
N. Duforet-Frebourg, E. Bazin, M.G.B. Blum
(Submitted on 21 Feb 2014)
A central part of population genomics consists of finding genomic regions implicated in local adaptation. Population genomic analyses are based on genotyping numerous molecular markers and looking for outlier loci in terms of patterns of genetic differentiation. One of the most common approach for selection scan is based on statistics that measure population differentiation such as FST. However they are important caveats with approaches related to FST because they require grouping individuals into populations and they additionally assume a particular model of population structure. Here we implement a more flexible individual-based approach based on Bayesian factor models. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. Factor models are strongly related to principal components analysis (PCA) and they model population structure with latent variables called factors. The hierarchical factor model considers that outlier loci are atypically explained by one of the factors. In a model of population divergence, we show that it can achieve a 2-fold or more reduction of false discovery rate compared to the software BayeScan or compared to a FST approach. We show that our software can handle large SNP datasets by analyzing the HGDP SNP dataset. The Bayesian factor model is implemented in the command-line PCAdapt software.
Hierarchical Bayesian model of population structure reveals convergent adaptation to high altitude in human populations
Matthieu Foll, Oscar E. Gaggiotti, Josephine T. Daub, Laurent Excoffier
(Submitted on 18 Feb 2014)
Detecting genes involved in local adaptation is challenging and of fundamental importance in evolutionary, quantitative, and medical genetics. To this aim, a standard strategy is to perform genome scans in populations of different origins and environments, looking for genomic regions of high differentiation. Because shared population history or population sub-structure may lead to an excess of false positives, analyses are often done on multiple pairs of populations, which leads to i) a global loss of power as compared to a global analysis, and ii) the need for multiple tests corrections. In order to alleviate these problems, we introduce a new hierarchical Bayesian method to detect markers under selection that can deal with complex demographic histories, where sampled populations share part of their history. Simulations show that our approach is both more powerful and less prone to false positive loci than approaches based on separate analyses of pairs of populations or those ignoring existing complex structures. In addition, our method can identify selection occurring at different levels (i.e. population or region-specific adaptation), as well as convergent selection in different regions. We apply our approach to the analysis of a large SNP dataset from low- and high-altitude human populations from America and Asia. The simultaneous analysis of these two geographic areas allows us to identify several new candidate genome regions for altitudinal selection, and we show that convergent evolution among continents has been quite common. In addition to identifying several genes and biological processes involved in high altitude adaptation, we identify two specific biological pathways that could have evolved in both continents to counter toxic effects induced by hypoxia.
Investigating speciation in face of polyploidization: what can we learn from approximate Bayesian computation approach?
Camille Roux, John Pannell
Despite its importance in the diversification of many eucaryote clades, particularly plants, detailed genomic analysis of polyploid species is still in its infancy, with published analysis of only a handful of model species to date. Fundamental questions concerning the origin of polyploid lineages (e.g., auto- vs. allopolyploidy) and the extent to which polyploid genomes display different modes of inheritance are poorly resolved for most polyploids, not least because they have hitherto required detailed karyotypic analysis or the analysis of allele segregation at multiple loci in pedigrees or artificial crosses, which are often not practical for non-model species. However, the increasing availability of sequence data for non-model species now presents an opportunity to apply established approaches for the evolutionary analysis of genomic data to polyploid species complexes. Here, we ask whether approximate Bayesian computation (ABC), applied to sequence data produced by next-generation sequencing technologies from polyploid taxa, allows correct inference of the evolutionary and demographic history of polyploid lineages and their close relatives. We use simulations to investigate how the number of sampled individuals, the number of surveyed loci and their length affect the accuracy and precision of evolutionary and demographic inferences by ABC, including the mode of polyploidisation, mode of inheritance of polyploid taxa, the relative timing of genome duplication and speciation, and effective populations sizes of contributing lineages. We also apply the ABC framework we develop to sequence data from diploid and polyploidy species of the plant genus Capsella, for which we infer an allopolyploid origin for tetra C. bursa-pastoris ≈ 90,000 years ago. In general, our results indicate that ABC is a promising and powerful method for uncovering the origin and subsequent evolution of polyploid species.
Nonparametric inference of the distribution of fitness effects across functional categories in humans
Fernando Racimo, Joshua G Schraiber
Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral, or rely on fitting the DFE of new nonsynonymous mutations to a particular parametric probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model. We then use our coefficient mapping to quantify the distribution of all scored single-nucleotide polymorphisms in Yoruba and Europeans. Our method serves to approximate the DFE of any type of segregating mutations, regardless of its genomic consequence, and so allows us to compare the proportion of mutations that are negatively selected or neutral across various genomic categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly leptokurtic, with a strong peak at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two.
Fast Principal Component Analysis of Large-Scale Genome-Wide Data
Gad Abraham, Michael Inouye
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.
Footprints of ancient balanced polymorphisms in genetic variation data
Ziyue Gao, Molly Przeworski, Guy Sella
(Submitted on 29 Jan 2014)
When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic variation data, notably the presence of additional neutral polymorphisms shared between species and the absence of fixed differences, the effects remain poorly characterized. We focus on the case of a single site under long-lived balancing selection and derive approximations for summaries of the data that are sensitive to a trans-species polymorphism: the length of the segment that carries most of the signals, the expected number of shared neutral SNPs within the segment and the patterns of allelic associations among them. Coalescent simulations of ancient balancing selection confirm the accuracy of our approximations. We further show that for humans and chimpanzees, and more generally for pairs of species with low genetic diversity levels, the patterns of genetic variation on which we focus are highly unlikely to be generated by neutral recurrent mutations, so these statistics are specific as well as sensitive. We discuss the implications of our results for the design and interpretation of genome scans for ancient balancing selection in apes and other taxa.
Estimate of Within Population Incremental Selection Through Branch Imbalance in Lineage Trees
Gilad Liberman, Jennifer Benichou, Lea Tsaban, yaakov maman, Jacob Glanville, yoram louzoun
Incremental selection within a population, defined as a limited fitness change following a mutation, is an important aspect of many evolutionary processes and can significantly affect a large number of mutations through the genome. Strongly advantageous or deleterious mutations are detected through the fixation of mutations in the population, using the synonymous to non-synonymous mutations ratio in sequences. There are currently to precise methods to estimate incremental selection occurring over limited periods. We here provide for the first time such a detailed method and show its precision and its applicability to the genomic analysis of selection. A special case of evolution is rapid, short term micro-evolution, where organism are under constant adaptation, occurring for example in viruses infecting a new host, B cells mutating during a germinal center reactions or mitochondria evolving within a given host. The proposed method is a novel mixed lineage tree/sequence based method to detect within population selection as defined by the effect of mutations on the average number of offspring. Specifically, we pro-pose to measure the log of the ratio between the number of leaves in lineage trees branches following synonymous and non-synonymous mutations. This method does not suffer from the need of a baseline model and is practically not affected by sampling biases. In order to show the wide applicability of this method, we apply it to multiple cases of micro-evolution, and show that it can detect genes and inter-genic regions using the selection rate and detect selection pressures in viral proteins and in the immune response to pathogens.