A new FST-based method to uncover local adaptation using environmental variables

A new $F_{\text{ST}}$-based method to uncover local adaptation using environmental variables
Pierre de Villemereuil, Oscar E. Gaggiotti
Comments: 18 pages, 5 figures, Supplementary Information at the end of the document
Subjects: Populations and Evolution (q-bio.PE)

Genome-scan methods are used for screening genome-wide patterns of DNA polymorphism to detect signatures of positive selection. There are two main types of methods: (i) “outlier” detection methods based on $F_{\text{ST}}$ that detect loci with high differenciation compared to the rest of the genomes and, (ii) environmental association methods that test the association between allele frequencies and environmental variables. In this article, we present a new $F_{\text{ST}}$-based genome scan method, BayeScEnv, which incorporates environmental information in the form of “environmental differentiation”. It is based on the F model but as opposed to existing approaches it considers two locus-specific effects, one due to divergent selection and another due to other processes such as differences in mutation rates across loci or background selection. Simulation studies showed that our method has a much lower false positive rate than an existing $F_{\text{ST}}$-based method, BayeScan, under a wide range of demographic scenarios. Although it had lower power, it leads to a better compromise between power and false positive rate. We apply our method to Human and Salmon datasets and show that it can be used successfully to study local adaptation. The method was developped in C++ and is avaible at this http URL

Visualizing spatial population structure with estimated effective migration surfaces

Visualizing spatial population structure with estimated effective migration surfaces
Desislava Petkova, John Novembre, Matthew Stephens
doi: http://dx.doi.org/10.1101/011809

Genetic data often exhibit patterns that are broadly consistent with “isolation by distance” – a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of “effective migration” to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.

Demographic inference using genetic data from a single individual: separating population size variation

Demographic inference using genetic data from a single individual: separating population size variation from population structure
Olivier Mazet, Willy Rodríguez, Lounès Chikhi
doi: http://dx.doi.org/10.1101/011866

The rapid development of sequencing technologies represents new opportunities for population genetics research. It is expected that genomic data will increase our ability to reconstruct the history of populations. While this increase in genetic information will likely help biologists and anthropologists to reconstruct the demographic history of populations, it also represents new challenges. Recent work has shown that structured populations generate signals of population size change. As a consequence it is often difficult to determine whether demographic events such as expansions or contractions (bottlenecks) inferred from genetic data are real or due to the fact that populations are structured in nature. Given that few inferential methods allow us to account for that structure, and that genomic data will necessarily increase the precision of parameter estimates, it is important to develop new approaches. In the present study we analyse two demographic models. The first is a model of instantaneous population size change whereas the second is the classical symmetric island model. We (i) re-derive the distribution of coalescence times under the two models for a sample of size two, (ii) use a maximum likelihood approach to estimate the parameters of these models (iii) validate this estimation procedure under a wide array of parameter combinations, (iv) implement and validate a model choice procedure by using a Kolmogorov-Smirnov test. Altogether we show that it is possible to estimate parameters under several models and perform efficient model choice using genetic data from a single diploid individual.

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

Hua Chen, Jody Hey, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/011247

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Paul Fearnhead, Shoukai Yu, Patrick Biggs, Barbara Holland, Nigel French
(Submitted on 5 Nov 2014)

A number of studies have suggested using comparisons between DNA sequences of closely related bacterial isolates to estimate the relative rate of recombination to mutation for that bacterial species. We consider such an approach which uses single locus variants: pairs of isolates whose DNA differ at a single gene locus. One way of deriving point estimates for the relative rate of recombination to mutation from such data is to use composite likelihood methods. We extend recent work in this area so as to be able to construct confidence intervals for our estimates, without needing to resort to computationally-intensive bootstrap procedures, and to develop a test for whether the relative rate varies across loci. Both our test and method for constructing confidence intervals are obtained by modelling the dependence structure in the data, and then applying asymptotic theory regarding the distribution of estimators obtained using a composite likelihood. We applied these methods to multi-locus sequence typing (MLST) data from eight bacteria, finding strong evidence for considerable rate variation in three of these: Bacillus cereus, Enterococcus faecium and Klebsiella pneumoniae.

Conflations of short IBD blocks can bias inferred length of IBD

Conflations of short IBD blocks can bias inferred length of IBD
Charleston W.K. Chiang, Peter Ralph, John Novembre
Comments: 12 figures, 1 table
Subjects: Populations and Evolution (q-bio.PE)

Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. Often, segments between two haplotypes are said to be IBD if they are inherited from a recent shared common ancestor without intervening recombination. Long IBD blocks (> 1cM) can be efficiently detected by a number of computer programs using high-density SNP array data from a population sample. However, all programs detect IBD based on contiguous segments of identity-by-state, and can therefore be due to the conflation of smaller, nearby IBD blocks. We quantified this effect using coalescent simulations, finding that nearly 40% of inferred blocks 1-2cM long are false conflations of two or more longer blocks, under demographic scenarios typical for modern humans. This biases the inferred IBD block length distribution, and so can affect downstream inferences. We observed this conflation effect universally across different IBD detection programs and human demographic histories, and found inference of segments longer than 2cM to be much more reliable (less than 5% conflation rate). We then present and analyze a novel estimator of the de novo mutation rate using IBD blocks, and demonstrate that the biased length distribution of the IBD segments due to conflation can strongly affect this estimator if the conflation is not modeled. Thus, the conflation effect should be carefully considered, especially as methods to detect shorter IBD blocks using sequencing data are being developed.

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution
Eric Y Durand, Chuong B Do, Joanna L Mountain, J. Michael Macpherson
doi: http://dx.doi.org/10.1101/010512

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.