Author post: Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance

This guest post is by Pleuni Pennings about her paper (with co-authors) Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance, bioRxived here. This is a cross-post from Pleuni’s blog.

Almost three years ago, in early 2012, I attended a talk by Martin Nowak. He talked about cancer and one of the things he said was that treatment with multiple drugs at the same time is a good idea because it helps prevent the evolution of drug resistance. Specifically, he explained, when treatment is with multiple drugs, the pathogen (tumor cells in the case of cancer) needs to acquire multiple resistance mutations at the same time in order to escape drug pressure.

As I listened to Martin Nowak’s talk, I was thinking of HIV, not cancer. At that time, I had already spent about two years working on drug resistance in HIV. Treatment of HIV is always with multiple drugs, for the same reason that Martin Nowak highlighted in his talk: it helps prevent the evolution of drug resistance.

However, as I read the HIV drug resistance literature and analyzed sequence data from HIV patients, I found evidence that drug resistance mutations in HIV tend to accumulate one at a time. This is contrary to the generally accepted idea that the pathogen must acquire resistance mutations simultaneously.

There seemed to be a clear mismatch between data and theory. Data show mutations are acquired one at a time, and theory says mutations must be acquired simultaneously. One of the two must be wrong, and it can’t be the data![1]


After Martin Nowak’s talk, I went up to him and told him how I thought data didn’t fit the theory. Martin’s response: “Oh, that is interesting!” (Imagine this being said with an Austrian accent). “Let’s meet and talk about it.”

So, we met. Logically, Alison Hill and Daniel Rosenbloom, then grad students in Martin’s group, were there too. I had already met with Alison and Daniel many times, since they were also working on drug resistance in HIV. John Wakeley (my advisor at Harvard) came to the meeting too.

Between the five of us, we brainstormed and fairly quickly realized that one solution to the conundrum was to assume that a body’s patient consisted of different compartments and that each drug may not penetrate into each compartment. Maybe we found this solution quickly because Alison and Daniel had already been thinking of the issue of drug penetration in the context of another project. A body compartment that has only one drug instead of two or three would allow a pathogen that has acquired one drug resistance mutation to replicate. If a pathogen with just one mutation has a place to replicate, this makes it possible for the pathogen to acquire resistance mutations one at a time.

We decided to start a collaboration to analyze a formal model to see whether our intuition was correct. Over the following three years, there were some personnel changes and several moves, graduations and new jobs. Stefany Moreno joined the project as a student from the European MEME Master’s program when she spent a semester in Martin’s group. When I moved to Stanford, Dmitri Petrov became involved in the project. Next, Alison and Daniel each got their PhD and started postdocs (Alison at Harvard, Daniel at Columbia), Stefany got her MSc and started a PhD in Groningen, I had a baby and became an assistant professor at SFSU. No one would have been surprised if the project would never have been finished! But we stuck with it and after many hours of work, especially by the first authors Alison and Stefany, and uncountable Google Hangout meetings, we can now confidently say that our initial intuition from that meeting in 2012 was correct. Compartments with imperfect drug penetration indeed allow pathogens to acquire drug resistance one mutation at a time. And, importantly, the evolution of multi-drug resistance can happen fast if mutations can be acquired one at a time, much faster than when simultaneous mutations are needed.

Our manuscript can be found on the BioRxiv (link). It is entitled “Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance.” We hope you find it useful!

[1]Of course, it could be my interpretation of the data!

Author post: Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

This guest post is by Xiaoquan (William) Wen, Francesca Luca, and Roger Pique-Regi on their preprint Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study, bioRxived here.

Our paper presents an integrative analysis framework to perform fine mapping and functional analysis of cis-eQTLs. In particular, we consider a setting where eQTL data are collected from multiple population groups. Although the details of our methods and analysis results are described in the manuscript, here we’d like to take the opportunity to discuss some of our main features and interesting findings of this work.

From the methodological perspective, the Bayesian inference framework that we present in this paper enables efficient multiple SNP analysis in the presence of multiple heterogeneous (population) groups. This framework is a natural extension from our previous works in dealing with heterogeneous genetic association data in single SNP analysis (Flutre et al 2013, Wen and Stephens 2014). The output from our multiple cis-eQTL analysis fully characterizes the uncertainty of eQTL calls, which becomes critical in the downstream functional analysis. This represents a significant advantage over the commonly applied conditional analysis approach, which is non-trivial to generalize when there are multiple heterogeneous subgroups. Taking advantage of these features, we further extend our analysis framework to incorporate functional genomic annotations and assess their levels of enrichment in association signals. Although in this paper we solely focus on eQTLs, it should be noted that our statistical methods are completely general, and applicable in other contexts of genetic association analysis.

Applying this analysis framework, we re-analyzed the eQTL data from the GUEVADIS project that consists of samples from five population groups. Importantly, a key motivation is to identify eQTL signals that are consistently presented in all population groups. This analysis yields some interesting findings, which we will highlight below:

  1. Cross-population meta-analysis greatly improves the power of eQTL discovery. The power gain by integrating data across population groups is well expected: with a combined sample size ~400 in five population groups, we are able to identify 6,555 genes that harbor at least one cis-eQTL (which we refer to as “eGenes”) from 11,838 tested protein coding and lincRNA genes at 5% FDR level; in comparison, the union set from the population-by-population analysis yields 3,447 eGenes.
  2. Cross-population samples provide unique resources to fine map cis-eQTLs. We perform multiple SNP analysis for each identified eGene, and find that for a non-trivial proportion of genes (7% of all genes analyzed, or 14% of identified eGenes), two or more independent cis-eQTL signals can be confidently identified in the GEUVADIS data. In most of those cases, we are relatively certain about the existence of multiple eQTL signals, but cannot pinpoint the causal variants by fully resolving the LD. Nevertheless, we find that utilizing cross-population samples, the population heterogeneity in local patterns of LD can be effectively leveraged to narrow down the genomic regions that harbor causal eQTLs, a phenomenon that we refer to as “LD filtering”. Using the GEUVADIS data, we are able to quantify the effect of LD filtering. More specifically, we select a set of genes that are identified harboring exactly one cis-eQTL with high confidence, and construct credible regions for each eQTL signal based on both the population-by-population and the cross-population analyses. We find that for the majority of the genes tested (92% of 526 selected genes), the joint analysis yields a smaller credible region comparing to the minimum credible region length from separate population analyses. The median reduction in region length from the separate analysis to joint analysis is close to 50% in the set of genes examined.

    On the other hand, there are cases that population specific LD patterns can cause some SNPs to display large degree of heterogeneity across populations in their estimated effect sizes from single SNP analysis. In some extreme cases, a SNP may appear to possess strong “population specific” effects. As we acknowledge that genuine population specific eQTLs are certainly interesting phenomena and very much likely exist, we suggest interpreting highly heterogeneous eQTL signals from single SNP analysis with caution. In the paper, we demonstrate one such example where a set of SNPs in LD, when analyzed alone, appear to show strong but opposite effects on expression levels in European and African populations. The multiple SNP analysis yields a seemingly much more plausible alternative explanation: it identifies two independent eQTL signals in the region, and the “opposite effect” eQTLs tag one signal in the African population and the other signal in the European populations. This example, we believe, fully demonstrates the necessity and benefit of multiple SNP analysis using cross-population samples.

  3. Genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs. This point is demonstrated by our functional analysis approach based on the fine mapping results of cis-eQTLs. In brief, we classify every cis-SNP into three mutually exclusive categories based on the computational predictions of CENTIPEDE model: 1) SNPs strongly affecting TF binding 2) SNPs residing in a DNAse-I footprint region but with little or no effects on TF binding 3) all other SNPs, or baseline SNPs. We find that the first category of SNPs are 1.49 fold more likely than baseline SNPs to be eQTLs, and its enrichment level is statistically highly significant (p-value = 4.93 x 10-22). The SNPs in category 2 is also enriched but with much less impressive fold change (1.15) and statistical significance (p-value = 0.0035). Very interestingly, this finding seems in agreement with the results reported in our recent work Moyerbrailean et al 2014) where other cellular and organismal phenotype QTLs are examined.

Overall, the ability of our method to disentangle multiple eQTL signals represents a significant step forward towards fully comprehending the complex mechanisms regulating gene expression. Using the natural interventions represented by genetic polymorphisms can be used in future studies to identify multiple functional regulatory elements for a gene. The computational methods used in this paper are implemented in the software packages FM-eQTL and eQTLBMA. Our analysis results are also available for browsing and downloading at this site.


1. T Flutre, X Wen, J Pritchard, M Stephens (2013). A statistical framework for joint eQTL analysis in multiple tissues. PLoS genetics 9 (5), e1003486

2. X Wen, M Stephens (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. The Annals of Applied Statistics 8 (1), 176-203

3. T Lappalainen et al (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511

4. Moyberbrailean et al (2014) Are all genetic variants in DNase I sensitivity regions functional? bioRxiv, 007559

Author post: Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

This guest post is by Claude Becker, Jörg Hagmann and Detlef Weigel on their preprint Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage, bioRxived here.

This paper is the result of a collaboration between experts in machine learning and statistical analysis (from the group of Karsten Borgwardt at the Max Planck Institute of Intelligent Systems), a lab that has spearheaded the assembly and SNP genotyping of a world-wide collection of Arabidopsis thaliana specimen (Joy Bergelson’s lab at the University of Chicago), a group specialized in large-scale phenotyping (the lab of Thomas Altmann at the Leibniz Institute of Plant Genetics and Crop Plant Research in Gatersleben) and our epigenomics group at the Max Planck Institute for Developmental Biology in Tübingen.

The epigenome of an organism, in a restricted definition, consists of the entirety of post-translational histone modifications (e.g. methylation, acetylation, etc.) and chemical modifications to the DNA, such as methylation of cytosines. Epigenetic marks can influence the transcriptional activity of genes and transposable elements by locally modulating the accessibility of the DNA. The local configuration of the epigenome can change (i) spontaneously, (ii) in dependence of genetic rearrangements, or (iii) as a consequence of external signals. That the epigenome reacts to external signals such as stress and nutrient supply and that it can influence physiological processes – even behavior – has caused much recent excitement. Academic and popular scientific articles have raised the question whether the epigenome has the potential to maintain environmental footprints across generations. The epigenome is thus presented as an entity that fuels acclimation to rapidly changing environmental conditions and that enables adaptation in subsequent generations. Studies investigating the epigenetic basis of the inheritance of acquired traits, however, often either lack the depth of analysis necessary for the identification of locus-specific epigenetic changes or investigate inheritance over a rather short time period of only one or two generations. Moreover, many study designs do not allow for easy distinction between genetic variation causing the observed epigenetic change and epigenetic differences independent of DNA sequence variation.

In our new study we aim to tackle the question to what extent long exposure to varying and diverse environmental conditions can change the heritable DNA methylation landscape. We overcome several of the above-mentioned problems and limitations by studying variation of DNA methylation in a quasi-isogenic lineage of the model plant Arabidopsis thaliana. North America (NA) was only recently colonized by A. thaliana, and approximately half of the current population is made of a single lineage that underwent a recent population bottleneck, having diverged from a common ancestor more a century or two ago, resulting in minimal genetic diversity in the current population [1].

We sequenced the genome and DNA methylome of thirteen closely related NA accessions originating from different geographical locations in order to determine the spectrum, frequency and effect of epigenetic variants. We then compared the epigenetic variation in the NA lineage to that of a previously analyzed set of isogenic A. thaliana lines that had been propagated for 30 generations in the greenhouse [2,3].

Pairwise comparison of the NA accessions revealed that only 3% of the genome-wide methylation showed variable methylation. By using the genetic mutations as a molecular clock, we found that – contrary to our expectation – epimutations did not accumulate at a higher rate under varying natural conditions compared to growth in a stable greenhouse environment. Even more surprisingly, changes in DNA methylation of single cytosines and of larger contiguous regions were often seen in both NA and greenhouse-grown accessions. In both datasets, accumulation of epimutations over time was non-linear, likely reflecting frequent reversions of methylation changes back to the initial configuration. Population structure inferred from methylation data reflected the genetic relatedness of the accessions and showed no signal of a genome-wide environmental footprint. This, together with the fact that most epigenetic variants were neutral and did not correlate with changes in gene expression, indicated that epigenetic variants accumulate to a large extent as a function of time and genetic diversification rather than as a consequence of local adaptation to environmental changes.

In summary, we have shown that long-term methylome variation of plants grown in varying and diverse natural sites is largely stable at the whole-genome level and in several aspects is intriguingly similar to that of lines raised in uniform conditions. This does not rule out a limited number of subtle adaptive DNA methylation changes that are linked to specific growth conditions, but it is in stark contrast to the published claims of broad, genome-wide epigenetic variation reflecting local adaptation. Heritable polymorphisms that arise in response to specific growth conditions certainly appear to be much less frequent than those that arise spontaneously or due to genetic variation.

In addition to the biological findings discussed above, an important part of our paper is an improved method for the detection of differentially methylated regions. Past studies have relied on clustering of differentially methylated positions or on fixed sliding windows, with the caveat of high rates of false negatives and false positives, respectively. We have adapted a Hidden Markov Model, initially developed for animal methylation data, to the more complex DNA methylation patterns in plants. Upon identification of methylated regions in each strain, these are then tested for differential methylation between strains. Our method results in increased specificity and higher accuracy and we believe it will be of broad interest to the epigenomics community.


1. Platt A, Horton M, Huang YS, Li Y, Anastasio AE, et al. (2010) The scale of population structure in Arabidopsis thaliana. PLoS Genet 6: e1000843.

2. Becker C, Hagmann J, Müller J, Koenig D, Stegle O, et al. (2011) Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480: 245-249.

3. Schmitz RJ, Schultz MD, Lewsey MG, O’Malley RC, Urich MA, et al. (2011) Transgenerational epigenetic instability is a source of novel methylation variants. Science 334: 369-373.

Author post: Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids

This guest post is by Russ Corbett-Detig, Emily Jacobs-Palmer, and Hopi Hoekstra (@hopihoekstra) on their paper Corbett-Detig et al Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids bioRxived here.

What are segregation distorters and how can they contribute to reproductive isolation?

Within an individual, somatic cells are typically genetic clones of one another; in contrast, haploid gametes are related to their compatriots at only half of all loci on average, opening doors to intra-individual competition and conflict. Eggs and sperm may express selfish genetic elements called segregation distorters (SDs) that disable or destroy competitor gametes carrying unrelated alleles. The resulting transmission advantage attained by SDs allows them to invade populations without improving the fitness of individuals that harbor them. Indeed, SDs often negatively impact carriers’ fitness because such hosts transmit fewer fit (or viable) gametes. Hence natural selection favors the evolution of alleles that suppress distortion and thereby restore fertility.

Coevolution of SDs and their suppressors can in turn contribute to the evolution of reproductive isolation between diverging lineages. How? If two populations become temporarily isolated from one another, SDs and later their accompanying suppressors may arise and eventually fix in one isolated population, possibly multiple times over. Should the two populations then encounter each other again, the sperm of hybrid males, for example, will contain one or more distorters without the appropriate suppressors, and these males will suffer decreased fertility. Over time, gene flow may be substantially and perhaps permanently hindered leading to the formation of two reproductively isolated species.

In some Drosophila species pairs, and in many crop plants, it is clear that the coevolution of SDs and their suppressors are major, even primary, contributors to the evolution of reproductive isolation between diverging lineages. At present, however, the relative importance of SDs-suppressor systems to reproductive isolation in broader taxonomic swathes of sexually reproducing organisms (e.g. mammals) is largely unexplored.

Our solution to the practical challenges of studying SDs


The primary impediment to addressing this important question in evolutionary biology is practical, not conceptual. Conventionally, researchers detect SD-suppressor systems by crossing two strains to produce a large second-generation hybrid population; they then genotype these hybrids at a set of markers across the genome to identify loci that show substantive deviations from 50:50 mendelian ratios—putative SDs. Ultimately, this traditional approach suffers from two major pitfalls. First, for many organisms it is not feasible to raise and genotype enough hybrids (hundreds to thousands) to have sufficient statistical power to detect SDs, especially those with weaker effects. Second, by genotyping these second generation hybrids, rather than the gametes of their parents, one conflates SD with hybrid inviability, and it can be very difficult to disentangle these two factors.

How to circumvent these challenges? In this work, we develop an alternative approach that avoids these practical challenges. We first obtain high quality, motile sperm from first generation hybrid males (generated from two strains with available genome sequences), and then sequence these sperm in bulk as well as a somatic ‘control’ tissue. We then contrast the relative representation of the parental chromosomes in windows across the genome in both samples, searching for regions where the sperm allele ratios show more DNA copies of one parental haplotype, but the somatic alleles do not. Importantly, this approach is very general, and it can easily be applied to any number of interspecific or intraspecific crosses where it is possible to obtain large quantities of viable gametes.

Little evidence for SDs in house mouse hybrids

We apply this method to a nascent pair of Mus musculus subspecies,M. m. castaneus and M. m. domesticus. We chose these subspecies because hybrid males formed in this cross are known to be partially reproductively dysfunctional. Nonetheless, using our novel method we find no evidence supporting the presence of SDs—no genomics regions showing a statistical deviation from 50:50 compared to control tissue—despite strong statistical power to detect them. We conclude that SDs do not contribute appreciably to the evolution of reproductive isolation in this nascent species pair. Instead, reproductive isolation in these mammalian subspecies likely stems from other incompatibilities in spermatogenesis or ejaculate production unrelated to SD-suppressor coevolution.

So what’s next? Because this approach—bulk sequencing of sperm from hybrid males—can be used on almost any pair of interfertile taxa, we can begin to better understand the prevalence of SD and its role in speciation in a wide diversity of species.

Author post: Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics

Thus guest post is by Irene Gallego Romero (@ee_reh_neh) on her paper Gallego Romero et al “Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics” bioRxived here.

Genetic divergence in protein coding regions between humans and chimpanzees cannot explain phenotypic differences between the two species, or, more broadly, between other closely related groups. Although we have known this since the early days of genetic sequencing, it has been very hard to formally test the hypothesis that follows logically – that it may be changes in gene expression and regulation that underlie the divergence in phenotypes. This is especially true in the great apes, where there are plenty of ethical and practical impediments to experimentation. For instance, our ability to carry out functional studies and really decode cellular mechanisms is restricted to tissues that can be sampled non-invasively. To date, this has mostly meant fibroblasts and immortalised lymphoblastoid cell lines. The rest of comparative work in primates tends to be done in tissue samples collected post-mortem, where experimental manipulation is not a possibility.

Together, these limitations provided the impetus for us to develop a panel of high-quality induced pluripotent stem cell (iPSC) lines from chimpanzees. The promise of this panel lies, of course, not just in insights into the pluripotent state in chimpanzees (although that is certainly a worthy subject) but in how it opens the door to a tantalizing number of previously inaccessible questions, when we combine it with any of the many protocols available for differentiating iPSCs into particular somatic cell types that have remained out of reach until now.

The amount of work that went into developing an effective reprogramming protocol is not readily apparent in our preprint, but it was exhaustive – and exhausting! We began by using retroviral vectors to deliver the four factors that are commonly used to reprogram somatic cells to pluripotency, but soon encountered two fairly sizable problems with that approach. First, these viral vectors are integrated into the host genome during the course of reprogramming, and one never knows what they’re going to disrupt. This is an issue that everyone using retro- or lentiviral vectors has to contend with, and indeed, when we began working on the project three and a half years ago they were the most reliable and established reprogramming method around, so we were prepared to take our chances and scan the resulting lines to determine insertion sites. Regardless, the thought of random insertions of pluripotency genes set us somewhat on edge!

However, for reasons that we never fully understood, those chimpanzee lines had a lot of trouble silencing the retroviral vectors and maintaining pluripotency solely through endogenous mechanisms, as we show in one of our supplemental figures. At the time, we were making human iPSC lines in tandem using exactly the same vector stocks. While the human lines would lose most exogenous vector expression after 12 to 15 passages, in chimpanzee iPSCs of the same age we would generally find that expression of at least one, if not more, exongenous genes was as high as it had been on day one. This did not bode well for the lines, or for our ability to do interesting things with them! So we scrapped the integrating approach, and began optimizing protocols all over again. Fortunately for us, Shinya Yamanaka’s group had just published a very thorough protocol on reprogramming cells using non-integrating episomal vectors, which ended up laying the foundations of the one we present in our preprint.

The lines we have generated with it are of fantastic quality, and they have passed every test we have thrown at them with flying colours. Pluripotency is being endogenously maintained, they’re karyotypically normal, and they differentiate into all three germ layers both spontaneously as embryoid bodies and teratomas when injected into mice, and when we use directed protocols to push them towards a particular fate.

We were very interested in quantifying how human and chimpanzee iPSC lines differ from each other. To this end, we collected RNA-sequencing and methylation data from the chimpanzee iPSCs and the fibroblast lines they were generated from, as well as from seven human iPSC lines from various ethnic and cellular origins and their precursors, and compared them to one another. We find large numbers of inter-species differences both before and after reprogramming, but crucially, most of them are not the same differences. Of all the genes with strong evidence for differential expression between species at the iPSC stage, only 38% are also differentially expressed before reprogramming, and the situation is quite similar with regards to methylation.

Another thing we have found very striking in the data is the very clear increase in homogeneity within (and possibly between, although our design makes that harder to effectively quantify) species at the iPSC level relative to the precursor cells, both in gene expression levels and in DNA methylation. This finding will be very interesting to keep in mind as we go forward and differentiate the iPSCs into a suite of somatic cell types and see how these measures fluctuate through differentiation.

Ultimately, however, where the biggest significance of this work lies for us is in the fact that the lines are not just for our own use. They’re available to other researchers, and this is something we have had in mind from the earliest stages of the work. There is no possible way for our lab to even begin to tackle all the questions that these lines can be used to answer. So if you want to work with our chimpanzee iPSC lines, get in touch.

Author post: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

This guest post is by Adam Siepel on his group’s paper Gulko et al. Probabilities of Fitness Consequences for Point Mutations Across the Human Genome, bioRxived here.

Four Genomicists in a Subaru

The idea for this paper emerged during a long drive across New York State, from Cold Spring Harbor to Ithaca, after the 2011 Biology of Genomes meeting. Two postdocs in my research group, Ilan Gronau and Leo Arbiza, were riding with me in my old Subaru, trying not to express too much alarm at my distracted driving. Also with us in the car was Ran Blekhman, who was at the time a postdoc with Andy Clark. (Ran is now an assistant professor at the University of Minnesota.)

Our conversation turned to important open questions in computational genomics, and in particular, to ways of making better use of the vast quantities of functional genomic data being pumped out of projects such as ENCODE. At the time, Ilan, Leo, and I were thinking a lot about how to use patterns of within-species polymorphism and between-species divergence to shed light on the influence of natural selection on regulatory sequences. Along these lines, we had spent much of the spring developing our new INSIGHT method [1,2], and we had just presented this work for the first time at Biology of Genomes. Ran, however, was pushing us to think less about abstract evolutionary questions and more about genomic function and disease association. He made a strong case that the biggest obstacle to progress in medically related human genomics was the absence of adequate functional annotations in noncoding regions of the human genome.

For a while the conversation went in circles, as we grasped for ways of measuring “functional potential” across the genome that would make use of genomic data yet be grounded in evolutionary theory. Then it suddenly dawned on us that we already had in hand a key piece of what we needed. The INSIGHT program was designed to estimate, for any collection of nucleotide positions across the genome, the fraction (denoted ρ) of those positions that were directly influenced by natural selection, in the sense that point mutations at those positions tended either to increase or to decrease the fitness of an organism. We realized that an alternative way of interpreting ρ was as a probability that the nucleotide at each position in the analyzed collection influenced fitness, assuming exchangeability of sites (as INSIGHT does).

All that we needed, therefore, was a general way of partitioning nucleotide positions from across the genome into distinct classes that were reasonably homogeneous in their functional roles. We could then estimate ρ for each class using INSIGHT, and assign to each genomic position the estimate of ρ for the partition to which that position belonged. This procedure would produce a “score” across the genome that looked roughly like widely used evolutionary conservation scores, but instead of representing local divergence patterns across the mammalian phylogeny, the score at each position would be estimated from groupwise patterns of polymorphism and divergence and would be directly interpretable as a probability of fitness consequences. Later Ilan would dub these “fitCons” scores, to emphasize this fitness-related interpretation. (“FitCons” also nicely parallels “phastCons,” our first conservation-scoring method.)

Because INSIGHT measures selection on recent time scales, fitCons scores would circumvent a major shortcoming of standard evolutionary conservation scores—that they require functional roles to have remained consistent over very long evolutionary time periods (tens to hundreds of millions of years) in order to be detectable in divergence patterns at individual sites or small loci. In principle, fitCons scores should be able to detect selection (hence potential function) at sites whose functional role had emerged quite recently, perhaps even along the human lineage.

The Problem of Grouping

The piece that was still missing in our plan was a particular scheme for grouping together similar genomic sites from across the genome. We did not get to the point of working this problem out in any detail during our revelatory drive to Ithaca, and, as it happened, it took several more months to settle on a solution. By this time, a Ph.D. student from Computer Science, Brad Gulko, had joined the project and assumed the lead in implementing a prototype of the scores.

At first, Brad, Ilan, Leo, and I spent some time thinking about fancy algorithms for clustering genomic sites that would consider functional and evolutionary information jointly. However, it did not take long to realize that this was a hard problem. Eventually, we decided to move forward with a simple grouping scheme, based on functional genomic data alone. This would allow us to cluster genomic sites in a pre-processing step and avoid the need for an iterative solution. Our hunch was that the scores would not be too sensitive to the grouping scheme as long as it was reasonable. As we discuss in our article, it may be worthwhile to revisit this clustering approach eventually, but it appears to be adequate for our current purposes. (I hope to convince Brad to discuss some of the technical issues with the clustering problem in an upcoming blog post.)

Relevance to the “Share Under Selection” in the Human Genome

By the fall of 2012, we had finally settled on an initial set of fitCons tracks and were beginning to observe decent prediction performance for cis-regulatory elements, when the human genomics community was thrown into a frenzy by a deluge of publications and accompanying press releases from the ENCODE Consortium. This event led to the now-famous controversy over what fraction of the genome is truly “functional” and whether ENCODE’s measures of “reproducible biochemical activity” (which apply to over 80% of the genome) were comparable in any meaningful way to the “share under selection” (SUS) estimated from comparative genomics (which generally came out to 5–10%).

I do not wish to rehash the familiar terms of this debate here, but I do want to focus on one aspect of it that was particularly relevant to our work. Many of the criticisms of ENCODE reminded readers that comparative genomic analyses pointed to a SUS of ~5–10%, suggesting that 80% might be a gross over-estimate of the functional content of the genome. However, others pointed out that these comparative-genomic estimates applied only to the fraction of the genome that had been under long-term selective constraint, because evolutionary turnover of functional elements—if it occurred at appreciable rates—could bias estimates based on long-term genomic divergence substantially downward. (For the latest chapter in this saga, see a recent paper by Gerton Lunter, Chris Ponting, and colleagues [3].)

We realized that the fitCons scores could help address aspects of this controversy, because they were based on patterns of variation over much more recent time scales and should therefore be much less sensitive to turnover than scores based on divergence patterns across the mammalian phylogeny. Moreover, the fitCons scores, by making use of INSIGHT to interpret patterns of polymorphism and divergence, might provide substantially better estimates of the quantities of interest than simple analyses of SNP densities or allele frequencies, a few of which had appeared among the ENCODE publications. Finally, the INSIGHT-based estimates are unique in that they directly predict the SUS, without the need for separate thresholding, mixture deconvolutions, or enrichment analyses.

Somewhat surprisingly, when we estimated the SUS based on fitCons scores, we obtained values (4.2–7.5%) that were quite similar to those based on conservation patterns in mammals. There are a number of tricky technical issues involved in this type of estimation—for example, concerning the corrections for local mutation rates and coalescence times—but violations of our modeling assumptions generally should tend to push our estimated upper bound (7.5%) to conservatively high values, implying that the true value is lower than 7.5%. In addition, the correction we have applied to obtain our lower bound (4.2%) is quite conservative, making it likely that the true value is higher than 4.2%. Therefore we have high confidence that the fraction of the genome under detectable selection from the available polymorphism and divergence data is indeed fairly close to 5%. As we discuss in the paper, it is important to bear in mind that the absolute values of these estimates reflect constraint on the identities of individual nucleotides only, and do not take into consideration higher order constraints, for example, on element lengths or spacing. Nevertheless, the similarity of the estimates based on mammalian divergence and human polymorphism suggest that evolutionary turnover has not produced a major downward bias in conservation-based estimates of the SUS.

Ilan and I soon realized that we could go a step further in this analysis and compare the fitCons-based estimates with parallel estimates based on the same functional categories but a measure of natural selection based on divergence only. The idea here was to perform a direct “apples to apples” comparison of the fraction of the genome under selection as measured on two different time scales: the 1–5 million-year time scale measured by fitCons and the ~30 million-year time scale measured by an analogous method based on divergence patterns in four primate genomes (human, chimpanzee, orangutan, and rhesus macaque), which we called “fitConsD” (the “D” is for “divergence”). I won’t attempt to describe this analysis in detail here, but our general conclusion is that the estimates of selection are highly similar on these two different time scales, suggesting further that evolutionary turnover has not had a dramatic effect on the functional content of the human genome over the past 30 million years or so. It is worth noting that Lunter and colleagues’ recent analysis is not strictly incompatible with ours (they estimate 7.1–9.2% constraint at present and focus on turnover over longer time scales) but their qualitative interpretation suggests large amounts of turnover, while ours suggests modest amounts.

Scooped by CADD… or Perhaps Not

As grant proposals, other manuscripts, and job searches led to delays in writing up our work through 2013, we began to hear rumblings on social media about a method called CADD, developed by Greg Cooper and Jay Shendure’s groups, that sounded alarmingly similar to fitCons. Then, in early 2014, a paper by Kircher, Witten et al. describing CADD appeared in Nature Genetics [4]. When we saw this paper, our initial impression was that we had been scooped by sitting on a good idea for too long. CADD was described as a method that integrated functional and evolutionary data and produced a measure of “relative pathogenicity” across the entire genome, and it was motivated, in part, by its potential usefulness in noncoding as well as coding regions. The paper included several impressive-looking ROC plots in which CADD apparently outperformed conservation based methods such as phastCons, phyloP, and GERP by a significant margin. In addition, CADD made use of a support vector machine (SVM), which was potentially a highly flexible and powerful means for considering large numbers of covariates with arbitrarily complex correlations.

We decided, with a certain amount of dread, that we needed to add CADD to our empirical performance comparisons on putative regulatory elements. At the time, fitCons was showing clear advantages in predictive power for cis regulatory elements compared with conservation-based methods and a functional annotation database called RegulomeDB. FitCons had several potential advantages over CADD—for example, it made direct use of polymorphism data for prediction, it considered covariates in a cell-type-specific manner, and it avoided a need for brute-force simulations through its use of INSIGHT for inference—but we thought that the use of the SVM in CADD made it unlikely that fitCons could compete with it in a pure classification task. Nevertheless, Brad dutifully downloaded the CADD scores, added them to his experiments, and displayed curves for CADD in his ROC plots for three types of regulatory elements (ChIP-seq-supported transcription factor binding sites, eQTLs, and enhancer predictions based on chromatin marks).

To our surprise, fitCons significantly outperformed CADD in all of these tests. This was true for three different types of putative regulatory elements, and true whether or not we considered cell-type-specific test sets. In fact, CADD performed essentially no better than conventional conservation scores in these tests, in apparent contradiction to the results presented in the CADD paper.

A closer reading of the CADD paper revealed a possible explanation for these observations. While the method was motivated, in part, by its applicability to the entire genome, the authors’ validation experiments heavily emphasized coding regions. In fact, it appears that even the ROC plot for “genome-wide” results (Figure 3a in the paper) is actually based almost exclusively (>92% by our interpretation of the paper) on missense variants in coding regions. The experiments that included substantial numbers of noncoding sites, in turn, were much more indirect—for example, by showing correlations with derived allele frequencies (Figure 2), known disease-causing status (Figure 4), and changes in expression in saturation mutagenesis experiments at two enhancers and one promoter. It is possible to have correlations of this kind without having substantial predictive power for regulatory variants.

When Greg Cooper saw our initial preprint on bioRxiv, he raised two major objections to our validation experiments. First, he pointed out that we were measuring the sensitivity of the fitCons scores in terms of bulk coverage of elements, when those elements actually consist of a mixture of sites at which mutations are deleterious and sites at which mutations are neutral or nearly neutral (such as degenerate positions in transcription factor binding sites). This approach to measuring sensitivity may be overly generous to the fitCons scores, which are relatively “blocky” along the genome, varying little from one site to the next, in comparison to higher-resolution prediction methods that properly distinguish between functionally important and neutral sites within elements. Second, Greg pointed out that we were using a naive genome-wide background set for our eQTL, which did not properly account for the ascertainment scheme used for eQTL identification.

We felt that these were fair and reasonable criticisms, and needed to be addressed. Therefore, we revised our validation experiments to consider only high-information-content positions in transcription factor binding sites (a proxy for functionally important nucleotides), and to use a more appropriate control for eQTL. The details of these follow-up experiments are described in our revised preprint (now on bioRxiv), but the bottom line was that they had almost no effect on our ROC plots. In other words, the apparent performance advantages of fitCons over CADD and other divergence-based methods is not an artifact of our experimental design but appears to reflect real advantages of the method. While Greg is correct that the coarse, “blocky” nature of the fitCons scores is a limitation of our current methods, the method still appears to perform significantly better than any competing method in distinguishing putatively functional regulatory nucleotides from background sequence. In other words, while scores that exhibit more variation from one nucleotide to the next—such as CADD, GERP, and phyloP—may appear on the surface to have higher predictive resolution, much of that variation is uninformative about regulatory function, and, on balance, the “blocky” fitCons scores are more useful in prediction.

We have spent some time trying to understand the differences in performance between fitCons and CADD, and believe we have some insights into why fitCons performs significantly better on regulatory elements. (What follows are our conjectures only; the authors of CADD do not agree with our analysis.) While the SVM in CADD is potentially a strength, we believe that it is substantially limited in this case by the use of a linear kernel and by pooling features across cell types, rather than focusing separately on each cell type of interest. In addition, we think there is a fundamental problem with the optimization scheme used by CADD. The SVM in CADD is trained by a global strategy, in the sense that a single set of parameter values is selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as expected, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution toward the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, and this may explain CADD’s superior performance in coding regions and its weaker performance in noncoding regions. FitCons avoids this problem by applying INSIGHT separately to each class of sites.

These observations raise the interesting possibility of a modified CADD that addresses some of these limitations. There is no reason why CADD couldn’t be trained separately on noncoding and coding regions, perhaps with different sets of covariates for each type of sites. Moreover, regulation-associated covariates could be treated in a cell-type-specific manner. A modified CADD designed along these lines (regulatory CADD, or rCADD?) could provide an interesting alternative to fitCons.


When we first discussed the idea for the fitCons scores during our drive across New York State three years ago, I envisioned a quick spin-off project that could be completed in perhaps half a year. As so often happens in research, several unanticipated challenges arose in completing this work, but we also found unexpected opportunities to connect our analysis with important open questions in the field. In addition, we were stimulated to think about the problem of combining functional and evolutionary data in new and deeper ways by another paper that addressed a similar problem but in a fundamentally different way. The end result is a paper I am quite proud of—one that provides what I think will be a useful resource to the genomics community and that also offers new insights into longstanding evolutionary questions.


[1] Gronau, I., Arbiza, L., Mohammed, J., & Siepel, A. (2013). Inference of Natural Selection from Interspersed Genomic Elements Based on Polymorphism and Divergence. Molecular Biology and Evolution. doi:10.1093/molbev/mst019

[2] Arbiza, L., Gronau, I., Aksoy, B. A., Hubisz, M. J., Gulko, B., Keinan, A., & Siepel, A. (2013). Genome-wide inference of natural selection on human transcription factor binding sites. Nature Genetics, 45(7), 723–729. doi:10.1038/ng.2658

[3] Rands, C. M., Meader, S., Ponting, C. P., & Lunter, G. (2014). 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genetics, 10(7), e1004525. doi:10.1371/journal.pgen.1004525

[4] Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics. doi:10.1038/ng.2892

Author post: An amino acid polymorphism in the Drosophila insulin receptor demonstrates pleiotropic and adaptive function in life history trait

This next guest post is by Annalise Paaby on her paper: Paaby et al. “An amino acid polymorphism in the Drosophila insulin receptor demonstrates pleiotropic and adaptive function in life history traits” bioRxived here.

Find the alleles!
Organisms vary, even within populations, in ways that appear adaptive. We would very much like to identify the genetic elements that encode these phenotypic differences—but this is a challenging task. For polygenic traits, the tiny contributions of single loci can be near-impossible to detect in an experimental setting. In contrast, natural selection operates on a grand scale, with power to discriminate between alleles. We took advantage of the fact that Drosophila melanogaster are distributed across an extreme environmental gradient in order to identify a specific polymorphism that contributes to adaptive variation.D. melanogaster live along the east coasts of North America and Australia. On both continents, flies in low-latitude, warm environments develop faster and are more fecund, while flies in high-latitude, cold environments live longer and are more resistant to most stresses.Knocking out insulin signaling genes extends lifespan, increases stress tolerance, and reduces reproduction. Given these phenotypes, we wondered whether insulin signaling genes might vary in natural populations and influence life history. In a paper published a few years ago, we showed that alleles of a polymorphism in the Insulin-like Receptor (InR) showed clines in frequency in both North America and Australia. Since the populations were founded at different times from different source populations, the replicated pattern on separate continents is good evidence that the polymorphism is a target of selection.

What is this polymorphism?
The polymorphism we discovered is a complex indel that disrupts a region of glutamines and histidines in the first exon of InR. In our original survey, we found many segregating alleles, all differing in length by multiples of three nucleotides.H owever, two alleles comprise the majority. An allele we call InRshort is common at high latitudes, and InRlong, which is six nucleotides longer, is common at low latitudes. The alleles differ in four amino acids across a span of 16 residues.

The alleles affect signaling
In our current study, we show that InRshort and InRlong affect levels of insulin signaling. We took InRshort and InRlong flies from a single population in New York, replaced the X and second chromosomes, and randomized the genetic backgrounds of the third chromosome, on which InR resides. We measured levels of insulin signaling in test lines by performing qPCR on seven transcriptional targets in the pathway, all downstream of the receptor.We found that for five of the seven targets (four of which were significant), signaling was highest in InRlong, lowest in InRshort, and intermediate in the heterozygote—suggesting that InRshort and InRlong act additively on signaling levels. The directionality of these results makes sense: reduction of insulin signaling is known to extend lifespan, increase stress tolerance and reduce reproductive success, and these are the phenotypes we see at high latitudes where InRshort is common.

Fluctuations over time
In our new study, we returned to the North American populations we evaluated five years prior. However, this time around we mapped 100-bp paired-end reads from pooled population samples. (These data relate to Alan Bergland’s larger exploration of spatial and temporal variation in D. melanogaster, described here on arXiv.) We called each of the discrete polymorphisms within the complex indel polymorphism—SNPs or small indels—individually. Some of those discrete polymorphisms distinguish between the InRshort and InRlong alleles, and they confirm that the clines persist in North America.We reasoned that alleles prevalent in high-latitude, cold climates might be selected for in the winter, and alleles prevalent in low-latitude, warm climates might be selected for in the summer. We examined a Pennsylvania population at multiple timepoints over three years and saw dramatic fluctuations in allele frequency (changes of approximately 20%) for discrete polymorphisms associated with InRshort and InRlong. As predicted, the “winter” and “summer” alleles were those common at high and low latitudes, respectively.However, the polymorphisms that showed the most dramatic fluctuations over seasonal time were not necessarily those with the strongest clines in frequency across geographical space. We suggest that aspects of demography and selection probably vary between seasonal and geographical environments, even in the face of apparently similar climatic pressures.

A question of pleiotropy
A longstanding question in the field of life history evolution is whether single alleles affect multiple traits at once (pleiotropy) or affect traits individually but reside near each other (linkage). The question itself arises from the observation, made many times over, that life history traits are typically correlated. For example, long-lived individuals often show reduced reproductive fitness. Longevity is also often positively correlated with the ability to tolerate stress. Do the same genetic variants encode multiple trait phenotypes?We assayed our InRshort and InRlong test lines for multiple phenotypes: fecundity, development time, body size and allometry, body weight and lipid content, tolerance for multiple stresses, and lifespan. We used the test lines described above, a replicate set of InRshort and InRlong lines derived from a second population, and lines in which we measured the effects of InRshort and InRlong in an InRhypomorph mutant background.Our full report can be found in the manuscript, but the take-home message is that InRshort and InRlong are significantly associated with all of the tested traits, in directions predicted by a selection regime favoring fast development time, rapid egg-laying, and high heat tolerance in warm climates, and resistance to cold and starvation stresses in cold climates. The InRshort allele was also associated with increased lifespan in males, though we do not necessarily expect that lifespan itself is associated with fitness.In conclusion, our results implicate insulin signaling as a major mediator of life history adaptation in D. melanogaster, and suggest that tradeoffs can be explained by extensive pleiotropy at a single locus.

Some other things I would like to mention
I value this study for its functional tests—phenotypic effects of candidate polymorphisms are often missing from evolutionary studies. However, and this is a major caveat: the InRshort and InRlong alleles were embedded in genotypic backgrounds that extended well beyond the locus in the test lines. On their own, I do not consider the functional tests definitive. But D. melanogaster have low linkage disequilibrium, which we know decays rapidly just outside our candidate polymorphism. In my opinion, the segregation of InRshort and InRlong in large, recombining wild populations pinpoints the functional alleles, while the experimental assays confirm our hypotheses about the selection regime.When we first measured fecundity, we counted every single egg laid by every single female over every single one of their lives. And the InRlong females, which we knew were more fecund—their culture bottles grew like gangbusters—laid only five more eggs on average than InRshort females! Highly non-significant. But, it looked like the InRlong flies laid eggs faster. We set up a different assay to measure eggs laid in the first day, and InRlong was six times more fecund. I think this provides an important lesson. We can easily imagine big fitness consequences for egg laying rate, but we might not think to measure it in the lab. Many studies, especially those from a molecular genetics point of view, have been keen to emphasize decoupling of lifespan and reproduction for so-called longevity genes. For conclusions drawn about natural genetic variants (which are the ones of utmost relevance, in my opinion), the question of tradeoffs must consider those fitness axes that are relevant to the wild organism. And these are often unknowable.We found that InRshort and InRlong were associated with smaller and larger body sizes, respectively. This makes sense in terms of levels of insulin signaling, but not in terms of body sizes in wild populations. High latitude flies are typically larger, not smaller. So, if InRshort and InRlong alleles affect body size, they either do so epistatically with other body size loci or they suffer antagonistic selection pressures along multiple fitness axes. Interesting!