Author Post: The Population Genetic Signature of Polygenic Local Adaptation

This guest post is by Jeremy Berg [@JeremyJBerg] and Graham Coop [@Graham_coop] on their paper The Population Genetic Signature of Polygenic Local Adaptation arXived here

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.

Height-genetic-value

In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.

OneSNP

We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

A Gene Regulatory Model of heterosis and speciation

A Gene Regulatory Model of heterosis and speciation
Peter M. F. Emmrich, Vera Pancaldi, Hannah E. Roberts, Krystyna A. Kelly, David C. Baulcombe
(Submitted on 15 Sep 2013)

Crossing individuals from genetically distinct populations often results in improvements in quantitative traits, such as growth rate, biomass production and stress resistance; this phenomenon is known as heterosis. We have taken a computational approach to explore the mechanisms underlying heterosis, developing a simulation of evolution and hybridization of Gene Regulatory Networks (GRNs) in a Boolean framework. These artificial regulatory networks exhibit biologically realistic topological properties and fitness is measured as the ability of a network to respond to external inputs in the correct way. Our model reproduced experimental observations from the literature on heterosis using only biologically meaningful parameters, such as mutation rates. Hybrid vigor was observed, its extent was seen to increase as parental populations diverged until it collapses when the two populations have become incompatible. Thus, the model also describes a process of speciation and links it to collapsing hybrid fitness due to genetic incompatibility of the separated populations. We also reproduce for the first time in a model the fact that hybrid vigor cannot easily be fixed by crossing hybrids, which is currently an important drawback of the use of hybrid crops. The simulation allows us to study the effects of three standard models for the genetic basis of heterosis, dominance, over-dominance, and epistasis. In our simulation over-dominance is the main factor contributing to hybrid vigour, whereas under-dominance and epistatic incompatibility are responsible for the fitness collapse. As the parental populations diverge, a single mutation can determine an almost sudden incompatibility leading to low fitness hybrids.

Universality and predictability in the evolution of molecular quantitative traits

Universality and predictability in the evolution of molecular quantitative traits
Armita Nourmohammad, Torsten Held, Michael Lässig
(Submitted on 12 Sep 2013)

Molecular traits, such as gene expression levels or protein binding affinities, are increasingly accessible to quantitative measurement by modern high-throughput techniques. Such traits measure molecular functions and, from an evolutionary point of view, are important as targets of natural selection. Here we discuss recent developments in the evolutionary theory of quantitative traits that reach beyond classical quantitative genetics. We focus on universal evolutionary characteristics: these are largely independent of a trait’s genetic basis, which is often at least partially unknown. We show that universal measurements can be used to infer selection on a quantitative trait, which determines its evolutionary mode of conservation or adaptation. Furthermore, universality is closely linked to predictability of trait evolution across lineages. We argue that universal trait statistics extends over a range of cellular scales and opens new avenues of quantitative evolutionary systems biology.

Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography

Inferring Heterogeneous Evolutionary Processes Through Time: from sequence substitution to phylogeography
Filip Bielejec, Philippe Lemey, Guy Baele, Andrew Rambaut, Marc A Suchard
(Submitted on 12 Sep 2013)

Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we extend and generalize an evolutionary approach that relaxes the time-homogeneous process assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Bogdan Pasaniuc, Noah Zaitlen, Huwenbo Shi, Gaurav Bhatia, Alexander Gusev, Joseph Pickrell, Joel Hirschhorn, David P Strachan, Nick Patterson, Alkes L. Price
(Submitted on 12 Sep 2013)

Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary LD information is available from target samples) versus 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and is computationally very fast. As an empirical demonstration, we apply our method to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $\chi^2$ association statistics) compared to HMM-based imputation from individual-level genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic vs. non-genic loci for these traits, as compared to an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown
(Submitted on 11 Sep 2013)

K-mer abundance analysis is widely used for many purposes in sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a CountMin Sketch. The CountMin Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support streaming k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a CountMin Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, and DSK. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer error rates. Khmer is implemented in C++ wrapped with a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

A Survey on Migration-Selection Models in Population Genetics

A Survey on Migration-Selection Models in Population Genetics
Reinhard Bürger
(Submitted on 10 Sep 2013)

This survey focuses on the most important aspects of the mathematical theory of population genetic models of selection and migration between discrete niches. Such models are most appropriate if the dispersal distance is short compared to the scale at which the environment changes, or if the habitat is fragmented. The general goal of such models is to study the influence of population subdivision and gene flow among subpopulations on the amount and pattern of genetic variation maintained. Only deterministic models are treated. Because space is discrete, they are formulated in terms of systems of nonlinear difference or differential equations. A central topic is the exploration of the equilibrium and stability structure under various assumptions on the patterns of selection and migration. Another important, closely related topic concerns conditions (necessary or sufficient) for fully polymorphic (internal) equilibria. First, the theory of one-locus models with two or multiple alleles is laid out. Then, mostly very recent, developments about multilocus models are presented. Finally, as an application, analysis and results of an explicit two-locus model emerging from speciation theory are highlighted.

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
M. Cyrus Maher, Ryan D. Hernandez
(Submitted on 9 Sep 2013)

Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. There is a range of methods available for OD. However, relative performance varies by application, stymying attempts to identify a single best method. In this paper, we present a novel tool, MOSAIC, which is capable of integrating the entire swath of OD methods. We analyze the results of applying MOSAIC over four methodologically diverse OD methods. Relative to component and competing methods, we demonstrate large gains in the number of detected orthologs while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

Inferring selective constraint and recent gain and loss of function from population genomic data

Inferring selective constraint and recent gain and loss of function from population genomic data
Daniel R. Schrider, Andrew D. Kern
(Submitted on 10 Sep 2013)

The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human specific function in the genome. Using only allele frequency information from the complete low coverage 1000 Genomes Project dataset in conjunction with a support vector machine trained from known functional and non-functional portions of the genome, we are able to identify functional portions of the genome with extremely high accuracy (~88%). Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain of function along the human lineage include a novel isoform of a killer cell immunoglobulin-like receptor, while loss of function candidates include many members of a gene cluster involved in shaping the complexity of synaptic connections in the brain. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods.

Our paper: The inevitability of unconditionally deleterious substitutions during adaptation

This author post is by Joshua B. Plotkin and David McCandlish on their preprint “The inevitability of unconditionally deleterious substitutions during adaptation”, arXived here.

The idea for this paper came to us while we were re-reading an earlier study by Sergey Kryazhimskiy and others, on the dynamics of adapting populations (Kryazhimskiy et al. 2009). Kryazhimskiy et al. studied the “fitness trajectory” — that is, the mean fitness across an ensemble of populations, as a function of time. The basic idea of their study was to infer the structure of an underlying fitness landscape by observing the fitness trajectory in experimental populations of evolving organisms (such as the ones from Lenski’s long-term experiments).

In re-reading Sergey’s paper, we noticed that the fitness trajectories were always monotonic, that is, the expected fitness would either always decrease or always increase. Indeed Kryazhimskiy et al. 2009 had presented a detailed analytical theory for how the fitness trajectory should behave (at least for a large class of models), and according to this theory the fitness trajectories should always be monotonic. However, when we looked more carefully at how this analytical theory was derived, we saw that the apparent impossibility of non-monotonic fitness trajectories was actually an unintended consequence of a seemingly innocuous technical assumption. The theory had been thoroughly tested against simulations for the examples explored in the paper, and it had performed quite well. But still, we wondered, could we construct fitness landscapes with non-monotonic fitness trajectories?

The answer was yes. In fact, we found conditions that produced non-monotonic fitness trajectories in one of the simplest and widely used models of a fitness landscape: the house of cards model, where the fitness of each new mutation is drawn from some fixed probability distribution. We also noticed an interesting pattern. If the population starts at a very low fitness then the fitness trajectory is be monotonically increasing. But if the starting fitness of the population is closer to the equilibrium mean fitness (that is, the value that the fitness trajectory would eventually tend to) the fitness trajectories will become non-monotonic: fitness will initially decrease, and then, eventually, increase to
its asymptotic value.

After much coffee, we eventually proved that this basic pattern must occur for any house of cards model whose equilibrium fitness distribution has a finite mean (at least under a Moran process in the limit of weak mutation). That result was the germ that eventually developed into our paper, which includes further results on the house of cards model, and on Fisher’s geometric model.

Why are non-monotonic fitness trajectories interesting? On the one hand, this is a population-genetic curiosity in a vein similar to McVean and Charlesworth (1999)’s observation that increasing the strength of purifying selection can sometimes increase the nucleotide site diversity. It’s somewhat counter-intuitive that the expected selection coefficient of the first mutation to fix in an adapting population can be negative, even on a fitness landscapes that contains no local maxima!

On the other hand, we think that this result has important implications studying adaptive evolution. It is common in such studies to assume that deleterious mutations can never fix (e.g. by approximating the probability of fixation for a new mutation as 2s). Our results on the surprising prevalence of deleterious substitutions during adapation should hopefully spur others to consider carefully the circumstances under which ignoring deleterious fixations is justified.

Joshua B. Plotkin and David McCandlish

Works cited:

Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. PNAS 106: 18638-18643 (2009)

McVean, G. A., and Charlesworth, B. (1999). A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genetical research, 74:145-158.