Our paper: Epistasis not needed to explain low dN/dS

This guest post is by Joshua Plotkin on his group’s paper McCandlish et al. Epistasis not needed to explain low dN/dS arXived here.

Our lab has recently begun to post research pre-prints on arXiv. All members of the group enthusiastically support this trend, both within our own group and within the broader scientific community. The merits of sharing pre-prints have been described elsewhere. The benefits of pre-prints are so immediately apparent, I feel, that there is no need to add further verses to the praises that have already been sung.

Recently, however, my research group and I faced an unusual and difficult question: whether we should post a pre-print that does not describe primary research, but rather is a critique of a recent paper published by another group – a paper on the role of epistasis in molecular evolution from the group led by Fyodor Kondrashov. My group and I have never before written such a commentary; and so I faced this choice with some uncertainty. Here are some thoughts on our group’s decision to write the commentary and to post it to arXiv.

Kondrashov’s group is at the vanguard of contemporary research in molecular evolution. In this particular paper from his group, Breen et al. contend that epistasis is “pervasive throughout protein evolution”; a view that I mostly support and indeed have expressed, in a more limited scope, in several publications and commentaries (e.g. here, here, and here). However, in discussing the paper by Breen et al. over lunch, our research group came to the consensus that their argument is logically flawed. Breen et al. reached their conclusion because the dN/dS values observed in some genes are much lower than their expectation in the absence of epistasis. But when calculating the expected dN/dS ratio in the absence of epistasis, Breen et al. assumed that all amino acids observed in a protein alignment at any particular position have equal fitness. This assumption is unrealistic because, simply, some amino acids may be more fit than others. When we relaxed this unrealistic assumption, we found that the observed dN/dS values and the observed patterns of amino acid diversity at each site are perfectly consistent with a non-epistatic model of protein evolution, for all the nuclear and chloroplast genes in the Breen et al. dataset (but, interestingly, not for their mitochondrial genes).

In an ideal world, scientific disagreements would be resolved by straightforward transactions based solely on logic and data. But in reality, such disagreements inevitably involve intellectual biases, not to mention personalities, politics, reputations, et cetera. In fact, we (my research group and I) are colleagues and admirers of Kondrashov and his comrades (these two papers of his are among our favorites). Why risk our collegiality by publishing a critique on arXiv?

The answer is two-fold. First, we are passionate about understanding molecular evolution, both as individuals and within the context of a scientific community – and we believe this exchange will advance that understanding. Second, we have had extensive email correspondences with Fedya about the scientific issues at hand. These correspondences have been completely open and straightforward: we have shared our computer code so that Fedya can reproduce our analyses; and Fedya has agreed with our critique, in principle, although he has some reservations and may appreciate subtleties of his data that we do not. In any case, I feel that the scientific exchange has been honest, and it will hopefully avoid the snark that sometimes accompanies such disagreements, and focus instead on the scientific issues at stake.

I wish to thank Graham Coop for inviting me to contribute to Haldane’s Sieve. And thanks of course to my co-authors, including our own fearless leader, David McCandlish.

—Joshua B. Plotkin

N.B.: This blog post is meant as an exchange among scientific colleagues, and not as an advertisement to the media.


Epistasis not needed to explain low dN/dS

Epistasis not needed to explain low dN/dS
In Response to “Epistasis as the primary factor in molecular evolution” by Breen et al. Nature 490, 535-538 (2012)
David M. McCandlish, Etienne Rajon, Premal Shah, Yang Ding, Joshua B. Plotkin
(Submitted on 20 Dec 2012)

An important question in molecular evolution is whether an amino acid that occurs at a given position makes an independent contribution to fitness, or whether its effect depends on the state of other loci in the organism’s genome, a phenomenon known as epistasis. In a recent letter to Nature, Breen et al. (2012) argued that epistasis must be “pervasive throughout protein evolution” because the observed ratio between the per-site rates of non-synonymous and synonymous substitutions (dN/dS) is much lower than would be expected in the absence of epistasis. However, when calculating the expected dN/dS ratio in the absence of epistasis, Breen et al. assumed that all amino acids observed in a protein alignment at any particular position have equal fitness. Here, we relax this unrealistic assumption and show that any dN/dS value can in principle be achieved at a site, without epistasis. Furthermore, for all nuclear and chloroplast genes in the Breen et al. dataset, we show that the observed dN/dS values and the observed patterns of amino acid diversity at each site are jointly consistent with a non-epistatic model of protein evolution.

A statistical framework for joint eQTL analysis in multiple tissues

A statistical framework for joint eQTL analysis in multiple tissues
Timothée Flutre, Xiaoquan Wen, Jonathan Pritchard, Matthew Stephens
(Submitted on 19 Dec 2012)

Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely-adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues this framework increases power to detect eQTLs that are present in more than one tissue compared with “tissue-by-tissue” analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells and fibroblasts we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR=0.05). Further the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.

easyGWAS: An integrated interspecies platform for performing genome-wide association studies

easyGWAS: An integrated interspecies platform for performing genome-wide association studies

Dominik Grimm, Bastian Greshake, Stefan Kleeberger, Christoph Lippert, Oliver Stegle, Bernhard Schölkopf, Detlef Weigel, Karsten Borgwardt
(Submitted on 19 Dec 2012)

Motivation: The rapid growth in genome-wide association studies (GWAS) in plants and animals has brought about the need for a central resource that facilitates i) performing GWAS, ii) accessing data and results of other GWAS, and iii) enabling all users regardless of their background to exploit the latest statistical techniques without having to manage complex software and computing resources.
Results: We present easyGWAS, a web platform that provides methods, tools and dynamic visualizations to perform and analyze GWAS. In addition, easyGWAS makes it simple to reproduce results of others, validate findings, and access larger sample sizes through merging of public datasets.
Availability: Detailed method and data descriptions as well as tutorials are available in the supplementary materials. easyGWAS is available at this http URL
Contact: dominik.grimm@tuebingen.mpg.de

Selection biases the prevalence and type of epistasis along adaptive trajectories

Selection biases the prevalence and type of epistasis along adaptive trajectories
Jeremy A. Draghi, Joshua B. Plotkin
(Submitted on 17 Dec 2012)

The contribution to an organism’s phenotype from one genetic locus may depend upon the status of other loci. Such epistatic interactions among loci are now recognized as fundamental to shaping the process of adaptation in evolving populations. Although little is known about the structure of epistasis in most organisms, recent experiments with bacterial populations have concluded that antagonistic interactions abound and tend to de-accelerate the pace of adaptation over time. Here, we use a broad class of mathematical fitness landscapes to examine how natural selection biases the mutations that substitute during evolution based on their epistatic interactions. We find that, even when beneficial mutations are rare, these biases are strong and change substantially throughout the course of adaptation. In particular, epistasis is less prevalent than the neutral expectation early in adaptation and much more prevalent later, with a concomitant shift from predominantly antagonistic interactions early in adaptation to synergistic and sign epistasis later in adaptation. We observe the same patterns when re-analyzing data from a recent microbial evolution experiment. Since these biases depend on the population size and other parameters, they must be quantified before we can hope to use experimental data to infer an organism’s underlying fitness landscape or to understand the role of epistasis in shaping its adaptation. In particular, we show that when the order of substitutions is not known to an experimentalist, then standard methods of analysis may suggest that epistasis retards adaptation when in fact it accelerates it.

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites
Katarzyna Bryc, Nick Patterson, David Reich
(Submitted on 17 Dec 2012)

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual’s genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.

The GenoChip: A New Tool for Genetic Anthropology

The GenoChip: A New Tool for Genetic Anthropology
Eran Elhaik, Elliott Greenspan, Sean Staats, Thomas Krahn, Chris Tyler-Smith, Yali Xue, Sergio Tofanelli, Paolo Francalacci, Francesco Cucca, Luca Pagani, Li Jin, Hui Li, Theodore G. Schurr, Bennett Greenspan, R. Spencer Wells, the Genographic Consortium
(Submitted on 17 Dec 2012)

The Genographic Project is an international effort using genetic data to chart human migratory history. The project is non-profit and non-medical, and through its Legacy Fund supports locally led efforts to preserve indigenous and traditional cultures. In its second phase, the project is focusing on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide SNP genotyping, they were designed for medical genetic studies and contain medically related markers that are not appropriate for global population genetic studies. GenoChip, the Genographic Project’s new genotyping array, was designed to resolve these issues and enable higher-resolution research into outstanding questions in genetic anthropology. We developed novel methods to identify AIMs and genomic regions that may be enriched with alleles shared with ancestral hominins. Overall, we collected and ascertained AIMs from over 450 populations. Containing an unprecedented number of Y-chromosomal and mtDNA SNPs and over 130,000 SNPs from the autosomes and X-chromosome, the chip was carefully vetted to avoid inclusion of medically relevant markers. The GenoChip results were successfully validated. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays for three continental populations. While all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The GenoChip is a dedicated genotyping platform for genetic anthropology and promises to be the most powerful tool available for assessing population structure and migration history.

Comment on “Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”

Comment on “Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”
Nicolas Bray, Lior Pachter
(Submitted on 13 Dec 2012)

Ward and Kellis (Reports, September 5 2012) identify regulatory regions in the human genome exhibiting lineage-specific constraint and estimate the extent of purifying selection. There is no statistical rationale for the examples they highlight, and their estimates of the fraction of the genome under constraint are biased by arbitrary designations of completely constrained regions.

Assembling large, complex environmental metagenomes

Assembling large, complex environmental metagenomes
Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti, Susannah G. Tringe, James M. Tiedje, C. Titus Brown
(Submitted on 12 Dec 2012)

The large volumes of sequencing data required to deeply sample complex environments pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two data reduction approaches, digital normalization and partitioning, to this challenge. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.

Compensatory evolution and the origins of innovations.

Compensatory evolution and the origins of innovations. (arXiv:1212.2658v1 [q-bio.PE])
by Etienne Rajon, Joanna Masel

Cryptic genetic sequences have attenuated effects on phenotypes. In the classic view, relaxed selection allows cryptic genetic diversity to build up across individuals in a population, providing alleles that may later contribute to adaptation when co-opted – e.g. following a mutation increasing expression from a low, attenuated baseline. This view is described, for example, by the metaphor of the spread of a population across a neutral network in genotype space. As an alternative view, consider the fact that most phenotypic traits are affected by multiple sequences, including cryptic ones. Even in a strictly clonal population, the co-option of cryptic sequences at different loci may have different phenotypic effects and offer the population multiple adaptive possibilities. Here, we model the evolution of quantitative phenotypic characters encoded by cryptic sequences, and compare the relative contributions of genetic diversity and of variation across sites to the phenotypic potential of a population. We show that most of the phenotypic variation accessible through co-option would exist even in populations with no polymorphism. This is made possible by a history of compensatory evolution, whereby the phenotypic effect of a cryptic mutation at one site was balanced by mutations elsewhere in the genome, leading to a diversity of cryptic effect sizes across sites rather than across individuals. Cryptic sequences might accelerate adaptation and facilitate large phenotypic changes even in the absence of genetic diversity, as traditionally defined in terms of alternative alleles.