Our paper: The inevitability of unconditionally deleterious substitutions during adaptation

This author post is by Joshua B. Plotkin and David McCandlish on their preprint “The inevitability of unconditionally deleterious substitutions during adaptation”, arXived here.

The idea for this paper came to us while we were re-reading an earlier study by Sergey Kryazhimskiy and others, on the dynamics of adapting populations (Kryazhimskiy et al. 2009). Kryazhimskiy et al. studied the “fitness trajectory” — that is, the mean fitness across an ensemble of populations, as a function of time. The basic idea of their study was to infer the structure of an underlying fitness landscape by observing the fitness trajectory in experimental populations of evolving organisms (such as the ones from Lenski’s long-term experiments).

In re-reading Sergey’s paper, we noticed that the fitness trajectories were always monotonic, that is, the expected fitness would either always decrease or always increase. Indeed Kryazhimskiy et al. 2009 had presented a detailed analytical theory for how the fitness trajectory should behave (at least for a large class of models), and according to this theory the fitness trajectories should always be monotonic. However, when we looked more carefully at how this analytical theory was derived, we saw that the apparent impossibility of non-monotonic fitness trajectories was actually an unintended consequence of a seemingly innocuous technical assumption. The theory had been thoroughly tested against simulations for the examples explored in the paper, and it had performed quite well. But still, we wondered, could we construct fitness landscapes with non-monotonic fitness trajectories?

The answer was yes. In fact, we found conditions that produced non-monotonic fitness trajectories in one of the simplest and widely used models of a fitness landscape: the house of cards model, where the fitness of each new mutation is drawn from some fixed probability distribution. We also noticed an interesting pattern. If the population starts at a very low fitness then the fitness trajectory is be monotonically increasing. But if the starting fitness of the population is closer to the equilibrium mean fitness (that is, the value that the fitness trajectory would eventually tend to) the fitness trajectories will become non-monotonic: fitness will initially decrease, and then, eventually, increase to
its asymptotic value.

After much coffee, we eventually proved that this basic pattern must occur for any house of cards model whose equilibrium fitness distribution has a finite mean (at least under a Moran process in the limit of weak mutation). That result was the germ that eventually developed into our paper, which includes further results on the house of cards model, and on Fisher’s geometric model.

Why are non-monotonic fitness trajectories interesting? On the one hand, this is a population-genetic curiosity in a vein similar to McVean and Charlesworth (1999)’s observation that increasing the strength of purifying selection can sometimes increase the nucleotide site diversity. It’s somewhat counter-intuitive that the expected selection coefficient of the first mutation to fix in an adapting population can be negative, even on a fitness landscapes that contains no local maxima!

On the other hand, we think that this result has important implications studying adaptive evolution. It is common in such studies to assume that deleterious mutations can never fix (e.g. by approximating the probability of fixation for a new mutation as 2s). Our results on the surprising prevalence of deleterious substitutions during adapation should hopefully spur others to consider carefully the circumstances under which ignoring deleterious fixations is justified.

Joshua B. Plotkin and David McCandlish

Works cited:

Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. PNAS 106: 18638-18643 (2009)

McVean, G. A., and Charlesworth, B. (1999). A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genetical research, 74:145-158.

Our paper: An integrative genomic approach illuminates the causes and consequences of genetic background effects

This is a guest post by Dr. Chris Chandler on “An integrative genomic approach illuminates the causes and consequences of genetic background effects“, arXived here.

Biologists have long recognized that a mutation can have variable effects on an organism’s phenotype; even introductory genetics classes often make this observation by introducing the concepts of penetrance and expressivity. More mysterious, however, are the factors that influence the phenotypic expression of a mutation or allele. We know, for instance, that introducing the same mutation into two different but otherwise wild-type genetic backgrounds can result in vastly different phenotypes. But what specific differences between these two genetic backgrounds interact with the mutation, and how? And how does gene expression fit into this puzzle? Answering these questions has not been an easy task, which is not too surprising when you realize that penetrance and expressivity are, in reality, complex quantitative traits. We therefore adopted a multi-pronged genetic and genomic approach to tease apart the mechanisms mediating background dependence in a mutation affecting wing development in the fly Drosophila melanogaster.

The phenotypic patterns seen in our model trait have already been characterized: the scalloped[E3] (sd[E3]) mutation has strong effects in the Oregon-R (ORE) background, resulting in a tiny, underdeveloped wing, while its effects in the Samarkand (SAM) background are still obvious but much less extreme, resulting in a blade-like wing.

To try to find out what causes these differences, we generated and combined a variety of datasets: whole-genome re-sequencing of the parental strains and a panel of introgression lines to map the background modifiers of the sd[E3] phenotype; transcription profiling (using two microarray datasets and one RNA-seq-like dataset), including analyses of allele-specific expression in flies carrying a “hybrid” genetic background; predictions of binding sites for the SD protein, which is a transcription factor; and a screen for deletion alleles that enhance or suppress the sd[E3] phenotype in a background-dependent fashion.

Our results point to a complex genetic basis for this background dependence. We found evidence for a number of loci that are likely to modulate the effects of the sd[E3] allele. However, some unexpected inconsistencies provide a cautionary tale for those intending to take a similar mapping-by-introgression approach for their trait of interest: do multiple replicates, and introgress in both directions, or you may inadvertently end up mapping some other trait! Although the number of candidate genes we identified were generally large, by combining those results with data from our other datasets, we were able to narrow our focus to those showing a consistent signal, yielding a robust set of candidate genes for further study. Without getting into too much detail, we also used a novel approach to show that background-dependent modifier deletions of the sd[E3] phenotype (of which there are many) involve higher-order epistatic interactions between the sd[E3] mutation, the deletion, and the genetic background, rather than quantitative non-complementation (so more than two genes were involved).

Overall, we think that an integrative approach like this could be useful for others trying to understand complex traits, including genetic background-dependence of mutations. In addition, if you’re a Drosophila researcher working with the commonly used Samarkand or Oregon-R strains, our genome re-sequencing data (raw and assembled), including SNPs, will soon be available in public repositories for genetic data.

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies
Xiaoquan Wen
(Submitted on 3 Sep 2013)

Motivated by examples from genetic association studies, this paper considers the model selection problem in a general complex linear model system and in a Bayesian framework. We discuss formulating model selection problems and incorporating context-dependent {\it a priori} information through different levels of prior specifications. We also derive analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties. We demonstrate our Bayesian approach based on an implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real data application of mapping tissue-specific eQTLs. Our novel results on Bayes factors provide a general framework to perform efficient model comparisons in complex linear model systems.

The inevitability of unconditionally deleterious substitutions during adaptation

The inevitability of unconditionally deleterious substitutions during adaptation
David M. McCandlish, Charles L. Epstein, Joshua B. Plotkin
(Submitted on 4 Sep 2013)

Studies on the genetics of adaptation typically neglect the possibility that a deleterious mutation might fix. Nonetheless, here we show that, in many regimes, the first substitution is most often deleterious, even when fitness is expected to increase in the long term. In particular, we prove that this phenomenon occurs under weak mutation for any house-of-cards model with an equilibrium distribution. We find that the same qualitative results hold under Fisher’s geometric model. We also provide a simple intuition for the surprising prevalence of unconditionally deleterious substitutions during early adaptation. Importantly, the phenomenon we describe occurs on fitness landscapes without any local maxima and is therefore distinct from “valley-crossing”. Our results imply that the common practice of ignoring deleterious substitutions leads to qualitatively incorrect predictions in many regimes. Our results also have implications for the substitution process at equilibrium and for the response to a sudden decrease in population size.

MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping

MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping
Wan-Ping Lee (1), Michael Stromberg (1 and 2), Alistair Ward (1), Chip Stewart (1 and 3), Erik Garrison (1), Gabor T. Marth (1) ((1) Department of Biology, Boston College, Chestnut Hill, MA, (2) Illumina, Inc., San Diego, CA, (3) Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA)
(Submitted on 4 Sep 2013)

This paper presents an accurate short-read mapper for next-generation sequencing data which is widely used in the 1000 Genomes Project, and human clinical and other species genome studies.

Predicting the ancestral character changes in a tree is typically easier than predicting the root state

Predicting the ancestral character changes in a tree is typically easier than predicting the root state
Olivier Gascuel, Mike Steel
(Submitted on 4 Sep 2013)

Predicting the ancestral sequences of a group of homologous sequences related by a phylogenetic tree has been the subject of many studies, and numerous methods have been proposed to this purpose. Theoretical results are available that show that when the mutation rate become too large, reconstructing the ancestral state at the tree root is no longer feasible. Here, we also study the reconstruction of the ancestral changes that occurred along the tree edges. We show that, depending on the tree and branch length distribution, reconstructing these changes (i.e. reconstructing the ancestral state of all internal nodes in the tree) may be easier or harder than reconstructing the ancestral root state. However, results from information theory indicate that for the standard Yule tree, the task of reconstructing internal node states remains feasible, even for very high substitution rates. Moreover, computer simulations demonstrate that for more complex trees and scenarios, this result still holds. For a large variety of counting, parsimony-based and likelihood-based methods, the predictive accuracy of a randomly selected internal node in the tree is indeed much higher than the accuracy of the same method when applied to the tree root. Moreover, parsimony- and likelihood-based methods appear to be remarkably robust to sampling bias and model mis-specification.

Evolutionary consequences of assortativeness in haploid genotypes

Evolutionary consequences of assortativeness in haploid genotypes
David M. Schneider, Ayana B. Martins, Eduardo do Carmo, Marcus A.M. de Aguiar
(Submitted on 3 Sep 2013)

We study the evolution of allele frequencies in a large population where random mating is violated in a particular way that is related to recent works on speciation. Specifically, we consider non-random encounters in haploid organisms described by biallelic genes at two loci and assume that individuals whose alleles differ at both loci are incompatible. We show that evolution under these conditions leads to the disappearance of one of the alleles and substantially reduces the diversity of the population. The allele that disappears, and the other allele frequencies at equilibrium, depend only on their initial values, and so does the time to equilibration. However, certain combinations of allele frequencies remain constant during the process, revealing the emergence of strong correlation between the two loci promoted by the epistatic mechanism of incompatibility. We determine the geometrical structure of the haplotype frequency space and solve the dynamical equations, obtaining a simple rule to determine equilibrium solution from the initial conditions. We show that our results are equivalent to selection against double heterozigotes for a population of diploid individuals and discuss the relevance of our findings to speciation.

Biological Averaging in RNA-Seq

Biological Averaging in RNA-Seq
Surojit Biswas, Yash N. Agrawal, Tatiana S. Mucyn, Jeffery L. Dangl, Corbin D. Jones
(Submitted on 3 Sep 2013)

RNA-seq has become a de facto standard for measuring gene expression. Traditionally, RNA-seq experiments are mathematically averaged — they sequence the mRNA of individuals from different treatment groups, hoping to correlate phenotype with differences in arithmetic read count averages at shared loci of interest. Alternatively, the tissue from the same individuals may be pooled prior to sequencing in what we refer to as a biologically averaged design. As mathematical averaging sequences all individuals it controls for both biological and technical variation; however, is the statistical resolution gained always worth the additional cost? To compare biological and mathematical averaging, we examined theoretical and empirical estimates of statistical efficiency and relative cost efficiency. Though less efficient at a fixed sample size, we found that biological averaging can be more cost efficient than mathematical averaging. With this motivation, we developed a differential expression classifier, ICRBC, to can detect alternatively expressed genes between biologically averaged samples. In simulation studies, we found that biological averaging and subsequent analysis with our classifier performed comparably to existing methods, such as ASC, edgeR, and DESeq, especially when individuals were pooled evenly and less than 20% of the regulome was expected to be differentially regulated. In two technically distinct mouse datasets and one plant dataset, we found that our method was over 87% concordant with edgeR for the 100 most significant features. We therefore conclude biological averaging may sufficiently control biological variation to a level that differences in gene expression may be detectable. In such situations, ICRBC can enable reliable exploratory analysis at a fraction of the cost, especially when interest lies in the most differentially expressed loci.

An integrative genomic approach illuminates the causes and consequences of genetic background effects

An integrative genomic approach illuminates the causes and consequences of genetic background effects
Christopher H. Chandler, Sudarshan Chari, David Tack, Ian Dworkin
(Submitted on 2 Sep 2013)

(abridged) – The phenotypic consequences of mutations are modulated by the wild type genetic background in which they occur, sometimes dramatically so. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist, nor about the mechanisms underlying it. We also lack knowledge on how mutations interact with the genetic background to influence gene expression patterns, and how gene expression may in turn mediate mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets, from a mapping-by-introgression experiment, as well as a tagged RNA gene expression dataset. In addition we used whole genome re-sequencing of the parental lines-two commonly used laboratory strains-to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutational screen. By searching for genes showing a congruent signal in multiple datasets, we identified candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers are caused by higher-order epistasis, not quantitative non-complementation of alleles. Our results also suggest that cis-regulatory variation contributes little to the background dependence of this mutant phenotype. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well.

Human Genome Variation and the concept of Genotype Networks

Human Genome Variation and the concept of Genotype Networks
Giovanni Marco Dall’Olio (1), Jaume Bertranpetit (1), Andreas Wagner (2, 3, 4), Hafid Laayouni (1) ((1) Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Spain. (2) Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland. (3) The Swiss Institute of Bioinformatics, Lausanne, Switzerland. (4) The Santa Fe Institute, Santa Fe, USA.)
(Submitted on 3 Sep 2013)

In 1970, John Maynard-Smith introduced the concept of “Protein Space”, a representation of all the possible protein sequences, as a framework to describe how evolutionary processes take place. Since then, the concepts of protein and of networks of sequences have been applied to a variety of systems, from protein modeling to RNA evolution, and to metabolic systems. Here, we adapted these concepts to the analysis of human DNA sequence data. We focused on the variation that can be represented from Single Nucleotide Variants (SNV) data, and we used the 1000 Genomes dataset to determine how human populations have explored this genotype space.
Our results include a genome-wide survey of how the genotype networks of human populations vary along the genome, and a framework to calculate the properties of these networks from sequencing data. Moreover, we found that, in coding regions, these networks tend to be both more “extended” in the space, and also more connected, than in non-coding regions. The application of the concept of genotype networks can provide a new opportunity to understand the evolutionary processes that shaped our genome. If we learn how human populations have explored the genotype space, we can achieve a better understanding of how selective pressures such as pathogens and diseases have shaped the evolution of a region of the genome, and how different regions have evolved. Combined with the availability of larger datasets of sequencing data, genotype networks represent a new approach to the study of human genetic diversity.