Natural selection. V. How to read the fundamental equations of evolutionary change in terms of information theory

Natural selection. V. How to read the fundamental equations of evolutionary change in terms of information theory
Steven A. Frank
(Submitted on 16 Nov 2012)

The equations of evolutionary change by natural selection are commonly expressed in statistical terms. Fisher’s fundamental theorem emphasizes the variance in fitness. Quantitative genetics expresses selection with covariances and regressions. Population genetic equations depend on genetic variances. How can we read those statistical expressions with respect to the meaning of natural selection? One possibility is to relate the statistical expressions to the amount of information that populations accumulate by selection. However, the connection between selection and information theory has never been compelling. Here, I show the correct relations between statistical expressions for selection and information theory expressions for selection. Those relations link selection to the fundamental concepts of entropy and information in the theories of physics, statistics, and communication. We can now read the equations of selection in terms of their natural meaning. Selection causes populations to accumulate information about the environment.

Evolution of male life histories and age-dependent sexual signals under female choice

Evolution of male life histories and age-dependent sexual signals under female choice
Joel James Adamson
(Submitted on 16 Nov 2012)

Strategic models have predicted that males could benefit from age-dependent sexual advertisement following evolution of increased lifespan. Dynamical considerations may play a crucial role in the origin of age-dependent sexual signals, despite strategic advantages in populations with established signals and preferences. I investigated the problem that rare trait-bearing males may suffer low viability due to small young-age signals, restricting the favorable conditions for age-dependent trait evolution. I also ask when age-dependence will prevail during trait evolution if males bearing age-dependent traits co-occur with males carrying age-independent traits. I used numerical simulations to analyze the evolution of an age-structured haploid population with no genetic drift. Age-dependence limits the evolution of male traits to cases of relatively weak selection against the trait, but the trait fixes at smaller sizes when age-dependent than when age-independent. When mode of expression (age-dependence versus age-independence) evolved along with the trait, age-independence prevailed over much of parameter space, although mode of expression remained polymorphic at small trait sizes under weak selection. The ubiquity of age-dependent traits in nature shows that many species’ life-histories satisfy the conditions for age-dependent trait evolution. My results suggest that high adult male survival facilitates sexual selection by favoring the evolution of age-dependent sexual signals under fairly broad conditions.

Bacterial diversity associated with Drosophila in the laboratory and in the natural environment

Bacterial diversity associated with Drosophila in the laboratory and in the natural environment
Fabian Staubach, John F. Baines, Sven Kuenzel, Elisabeth M. Bik, Dmitri A. Petrov
(Submitted on 14 Nov 2012)

All higher organisms are associated with bacterial communities. Bacteria have a range of effects on their metazoan hosts from being indispensable for survival to being lethal pathogens. Because bacteria have phenotypic effects on their hosts, they can also be involved in host adaptation to the environment. The fruit fly Drosophila is a classic model organism to study adaptation as well as the relationship between genetic variation and phenotypes. Recently, Drosophila has received attention in immunology and studies of host-microbe interaction. Although bacterial communities associated with Drosophila might be important for many aspects of Drosophila biology, little is known about their diversity and composition or the factors shaping these communities. We used 454-based sequencing of a variable region of the bacterial 16S ribosomal gene to characterize the bacterial communities associated with wild and laboratory Drosophila isolates. In order to specifically investigate effects of food source and host species on bacterial communities, we analyzed samples from wild Drosophila melanogaster and D. simulans flies collected from a variety of natural substrates, as well as from adults and larvae of nine laboratory-reared Drosophila species. We find substantial variation of bacterial communities within and between laboratories that could interfere with phenotype studies. We show that bacterial communities associated with wild-caught Drosophila contain more bacterial species than laboratory-raised flies, but that they are on average less diverse than vertebrate communities. The natural Drosophila-associated microbiota appears to be predominantly shaped by food substrate with an additional but smaller effect of host species identity.

Improved haplotyping of rare variants using next-generation sequence data

Improved haplotyping of rare variants using next-generation sequence data
Fouad Zakharia, Carlos Bustamante
(Submitted on 9 Nov 2012)

Accurate identification of haplotypes in sequenced human genomes can provide invaluable information about population demography and fine-scale correlations along the genome, thus empowering both population genomic and medical association studies. Yet phasing unrelated individuals remains a challenging problem. Incorporating available data from high throughput sequencing into traditional statistical phasing approaches is a promising avenue to alleviate these issues. We present a novel statistical method that expands on an existing graphical haplotype reconstruction method (shapeIT) to incorporate phasing information from paired-end read data. The algorithm harnesses the haplotype graph information estimated by shapeIT from genotypes across the population and refines haplotype likelihoods for a given individual to be compatible with the sequencing data. Applying the method to HapMap individuals genotyped on the Affymetrix Axiom chip at 7,745,081 SNPs and on a trio sequenced by Complete Genomics, we found that the inclusion of paired end read data significantly improved phasing, with reductions in switch error on the order of 4-15% against shapeIT across all panels. As expected, the improvements were found to be most significant at sites harboring rare variants; furthermore, we found that longer read sizes and higher throughput translated to greater decreases in switching error, as did higher variance in the size of the insert separating the two reads–suggesting that multi-platform next generation sequencing may be exploited to yield particularly accurate haplotypes. Overall, the phasing improvements afforded by this new method highlight the power of integrating sequencing read information and population genotype data for reconstructing haplotypes in unrelated individuals.

Our paper: The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions

For our next guest post Philipp Messer and Dmitri Petrov write about their paper
The McDonald-Kreitman Test and its Extensions under Frequent Adaptation: Problems and Solutions, arXived here

The McDonald-Kreitman (MK) test is the basis of most modern approaches to measure the rate of adaptation from population genomic data. This test was used to argue that in some organisms, such as Drosophila, the rate of adaptation is surprisingly high. However, the MK test, and in fact most of the current machinery of population genetics, relies on the assumption that adaptation is rare so that the effects of selective sweeps on linked variation can be neglected. We test this assumption using a powerful forward simulation and show that the MK test is severely biased even when the rate of adaptation is only moderate. The biases arise from the complex linkage effects between slightly deleterious and strongly advantageous mutations. In order to deal with these biases, we suggest a new robust approach based on a simple asymptotic extension of the MK test.

We further show that already under very moderate amounts of adaptation, linkage effects from recurrent selective sweeps can profoundly affect key population genetic parameters, such as the fixation probabilities of deleterious mutations and the frequency distributions of polymorphisms. In synonymous polymorphism data, these linkage effects leave signatures that can easily be mistaken for the signatures of recent, severe population expansion.

The bigger claim of our paper is that the effects of linked selection cannot be simply swept under the rug by introducing effective parameters, such as effective population size or effective strength of selection, and then using these effective parameters in formulae derived from the diffusion approximation under the assumption of free recombination. Given that most of our estimates of the key evolutionary parameters are still obtained from methods based on this paradigm, we argue that it is crucial to verify whether they are robust to linkage effects.

Philipp Messer and Dmitri Petrov

The evolution of genetic architectures underlying quantitative traits

The evolution of genetic architectures underlying quantitative traits
Etienne Rajon, Joshua B. Plotkin
(Submitted on 31 Oct 2012)

In the classic view introduced by R.A. Fisher, a quantitative trait is encoded by many loci with small, additive effects. Recent advances in QTL mapping have begun to elucidate the genetic architectures underlying vast numbers of phenotypes across diverse taxa, producing observations that sometimes contrast with Fisher’s blueprint. Despite these considerable empirical efforts to map the genetic determinants of traits, it remains poorly understood how the genetic architecture of a trait should evolve, or how it depends on the selection pressures on the trait. Here we develop a simple, population-genetic model for the evolution of genetic architectures. Our model predicts that traits under moderate selection should be encoded by many loci with highly variable effects, whereas traits under either weak or strong selection should be encoded by relatively few loci. We compare these theoretical predictions to qualitative trends in the genetics of human traits, and to systematic data on the genetics of gene expression levels in yeast. Our analysis provides an evolutionary explanation for broad empirical patterns in the genetic basis of traits, and it introduces a single framework that unifies the diversity of observed genetic architectures, ranging from Mendelian to Fisherian.

Asexual Evolution Waves: Fluctuations and Universality

Asexual Evolution Waves: Fluctuations and Universality
Daniel S. Fisher
(Submitted on 23 Oct 2012)

In large asexual populations, multiple beneficial mutations arise in the population, compete, interfere with each other, and accumulate on the same genome, before any of them fix. The resulting dynamics, although studied by many authors, is still not fully understood, fundamentally because the effects of fluctuations due to the small numbers of the fittest individuals are large even in enormous populations. In this paper, branching processes and various asymptotic methods for analyzing the stochastic dynamics are further developed and used to obtain information on fluctuations, time dependence, and the distributions of sizes of subpopulations, jumps in the mean fitness, and other properties. The focus is on the behavior of a broad class of models: those with a distribution of selective advantages of available beneficial mutations that falls off more rapidly than exponentially. For such distributions, many aspects of the dynamics are universal – quantitatively so for extremely large populations. On the most important time scale that controls coalescent properties and fluctuations of the speed, the dynamics is reduced to a simple stochastic model that couples the peak and the high-fitness “nose” of the fitness distribution. Extensions to other models and distributions of available mutations are discussed briefly.

Our paper: Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs


This guest post is by Christopher Brown, Lara Mangravite, and Barbara Engelhardt on their paper: Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs arXived here.

Why do we study eQTLs? Why don’t we count bristles?

The genetic dissection of complex trait models, independent of the particular phenotype, is useful for improving our understanding of the genetic architecture underlying the biochemical function that regulates complex traits in general. In the last ten years, gene expression levels themselves have emerged as useful phenotypes amenable to genetic dissection with several advantages, most notably that it is easy to accurately quantify tens of thousands of traits simultaneously (indeed even more when we address splicing and promoter usage). While the identification of SNPs that are associated with variation in gene expression (eQTLs) is certainly interesting at this basic level, an additional critical use for eQTL data has emerged. Because the majority of common human phenotypic variation appears to be driven by non-coding sequence variants, eQTL analyses are beginning to help with the mechanistic interpretation of GWAS results. In light of these interests and applications, we believe that eQTL analyses are hampered by (at least) three important limitations, which we have attempted to address in our recent preprint:

(1) Methodological (non) uniformity. Most eQTL studies have been performed by different groups, on different genotyping and gene expression platforms, with different association methods, and using different criteria for defining significance. This lack of uniformity complicates even simple cross study comparisons; for example, what fraction of genes has one or more independently associated eQTL when analyzed across tissues? We address this issue by testing for eQTL associations across a diverse set of cell types using a uniform pipeline with standardized analysis parameters to perform all analytical steps starting from raw data. As a fairly trivial example, our analyses across the eleven studies demonstrated that nearly all of the variation in the proportion of genes with significant eQTL associations identified within each study can be explained by just two factors: study size and replicate gene expression measurements. The proportion of genes with one or more independently associated eQTLs, then, is probably not 5-10% as has been hypothesized, but most or all of them, which we can get a better picture of when we design studies with sufficient power.

(2) Undercharacterized cell specificity. It is generally agreed upon that some eQTLs regulate gene expression in a cell type specific manner. When using eQTLs to interpret the genetic contribution to complex clinical traits, it is important to consider the cell type(s) most relevant to the trait of interest. However, if we don’t know what cell type is responsible for a phenotype or if we don’t have eQTL data for the cell type of interest, we are forced to extrapolate inferences about eQTLs derived from other cell types. By enabling the simultaneous comparison of within and between cell type eQTL replication for multiple cell type combinations and integrating these results with cis-regulatory element (CRE) mapping data from ENCODE, we have addressed several unresolved questions concerning the nature of cell type specific and ubiquitous eQTL SNPs. We find that eQTL-CRE overlap is frequently cell type specific and that this information can be used to predict cell specificity of eQTLs in the absence of additional gene expression data from the cell type of interest. While these results are certainly preliminary (and indeed we see many possible improvements), we hope this will improve the utility of eQTL-GWAS comparisons, particularly in situations where the GWAS cell type of interest lacks eQTL data.

(3) Resolution, causality, and mechanism. Lead tag SNPs are probably causal variants less than 30% of the time. While larger and more diverse genomic sample sets are essential to improve the resolution for identifying causal variants, this is not always possible due to time or budget constraints. However, the application of orthogonal genomic data also has the potential to considerably refine resolution with the added benefit of providing insight into the mechanism through which a causal variant acts. We approach this (as a few other groups have – notably Dan Gaffney et al.) by integrating CRE data into our analyses, because it appears that genetic variants that overlap certain types of CREs are much more likely to be functional than those that do not. We believe that this hypothesis, and the methods used to address it, need to be validated with directed functional assays, but we see no reason to doubt the principle of understanding heritable phenotypes using genotype functional analyses. Furthermore, the analysis of cell specific eQTL data in the context of cell specific CRE data, which is now possible, enables predictions about the regulatory mechanisms that are affected by a specific eQTL, which will allow us to place GWAS hits into pathways or provide other meaningful biological insights.

Why did we submit the paper to arXiv and Haldane’s Sieve?

We are big proponents of open access publication, open data, and transparent methods and analysis. At least part of what we’ve done here is to create a resource that we hope will be useful to the broader community. We are open to pre and post publication review of and commentary on our motivations and methods. Furthermore, we have submitted all of the eQTLs we identify to a database of eQTLs (eqtl.uchicago.edu), and we are currently securing funding to develop open access, online tools to help GWAS researchers follow up specific functional variants using our methods.

Christopher Brown, Lara Mangravite, Barbara Engelhardt

The equivalence between weak and strong purifying selection

The equivalence between weak and strong purifying selection
Benjamin H Good, Michael M Desai
(Submitted on 16 Oct 2012)

Weak purifying selection, acting on many linked mutations, may play a major role in shaping patterns of molecular evolution in natural populations. Yet efforts to infer these effects from DNA sequence data are limited by our incomplete understanding of weak selection on local genomic scales. Here, we demonstrate a natural symmetry between weak and strong selection, in which the effects of many weakly selected mutations on patterns of molecular evolution are equivalent to a smaller number of more strongly selected mutations. By introducing a coarse-grained “effective selection coefficient,” we derive an explicit mapping between weakly selected populations and their strongly selected counterparts, which allows us to make accurate and efficient predictions across the full range of selection strengths. This suggests that an effective selection coefficient and effective mutation rate — not an effective population size — is the most accurate summary of the effects of selection over locally linked regions. Moreover, this correspondence places fundamental limits on our ability to resolve the effects of weak selection from contemporary sequence data alone.

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs

Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs
Christopher D Brown, Lara M Mangravite, Barbara E Engelhardt
(Submitted on 11 Oct 2012)

Genetic variants in cis-regulatory elements or trans-acting regulators commonly influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in four parts: first we identified eQTLs from eleven studies on seven cell types; next we quantified cell type specific eQTLs across the studies; then we integrated eQTL data with cis-regulatory element (CRE) data sets from the ENCODE project; finally we built a classifier to predict cell type specific eQTLs. Consistent with prior studies, we demonstrate that allelic heterogeneity is pervasive at cis-eQTLs and that cis-eQTLs are often cell type specific. Within and between cell type eQTL replication is associated with eQTL SNP overlap with hundreds of cell type specific CRE element classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. Using a random forest classifier including 526 CRE data sets as features, we successfully predict the cell type specificity of eQTL SNPs in the absence of gene expression data from the cell type of interest. We anticipate that such integrative, predictive modeling will improve our ability to understand the mechanistic basis of human complex phenotypic variation.