Author post: Coalescent times and patterns of genetic diversity in species with facultative sex

This guest post is by Matthew Hartfield (@mathyhartfield) on “Coalescent times and patterns of genetic diversity in species with facultative sex”.

Our paper “Coalescent times and patterns of genetic diversity in species with facultative sex”, in which we investigate the genealogies of facultative sexuals, is now available from the biorxiv.

Most evolutionary biologists are obsessed with sex. Explaining why organisms reproduce sexually by combining genetic material is a tough problem. The main issue lies with the fact that asexuality (reproduction via clonality) should be able to outcompete sexuals due to sheer weight of numbers. Various theories have been put forward to explain why sex is so widespread. The majority of these revolve around the idea that exchanging genetic material enables the fittest possible genotype to be created, while that of asexuals should degrade over time.

While such theories are ubiquitous, data to test them has been scarce. Recent years have seen a boom in exploring the evolution of sex experimentally using facultative sexual organisms: species that can switch between sexual and asexual reproduction. Such experiments have demonstrated how sexual reproduction can evolve when exposed to stressful environments, or when moving between environmentally different areas. Yet major questions remain regarding what the underlying genetic causes of these transitions are. In addition, there are plenty of organisms that undergo ‘cryptic’ sex, which cannot be observed directly but can with genomic sequence analyses.

Coalescent models are important for analysing genomic data. These tools determine the relationship between neutral markers, and hence make predictions on how genetic diversity is affected depending on environmental structuring, localised natural selection, or other effects. However, classic models cannot be applied to systems with partial asexuality, as they assume the population reproduces entirely sexually.

We worked on introducing partial rates of sex into these models. In the simplest case (one population with a fixed rate of sex), we recovered a classic prediction that extensive divergence between alleles at the same site arises. This phenomenon occurs since lack of sex keeps the two alleles distinct over evolutionary time; only a rare bout of sex has any chance of creating the segregation needed for them to be descended from the same allele.

A schematic of Allelic Sequence Divergence (ASD) in asexuals: Xs are distinct mutations at each neutral site.

A schematic of Allelic Sequence Divergence (ASD) in asexuals: Xs are distinct mutations at each neutral site.

After recovering this familiar result, we worked to extend coalescent theory in partial asexuals to include various other biological phenomena. Two effects we looked at were gene conversion, and heterogeneity in sex rates that change over time or space.

Gene conversion, where one DNA sequence replaces part of a homologous chromosome, is usually regarded as being of minor evolutionary importance. Yet numerous studies of facultative sexuals often observe it as a common force, especially in species not exhibiting allelic sequence divergence (ASD). Could the two be related? Excitingly, we found that low rates of gene conversion become important in organisms with low rates of sex. That is, once sex becomes so rare as to caused ASD, small rates of gene conversion can then reverse the process, homogenizing alleles again. Rather than having higher diversity than otherwise similar sexual populations as expected with ASD (in the absence of gene conversion), asexual populations will have less diversity than comparable sexual populations if gene conversion is not too low.

It is also known that many organisms change their rates of sex over time or location, which can be triggered by environmental cues or organismal stress. By investigating such variation in the rate of sex, the analysis elegantly shows how even a short burst to obligate sex (over tens of generations) is enough to jumble genomes in the population, hence giving the same outcome as long-term obligate sex. If rates of sex are also different in separate geographical locals, then these differences can be detected if there is little gene flow between regions. Otherwise, both areas display intermediate rates of sex.

Coalescent tools are popular since they can be used to simulate complex evolutionary outcomes, which are then tested against genomic data. We used the mathematical analyses to outline a coalescent algorithm to account for partial rates of sex, and predict genetic diversity, under numerous scenarios. The code is available online ( for others to use.

These are exciting times for population genetics and evolution, with cheaper sequencing costs making it possible to wade through the genomes of more individuals than before. Yet accurately exploring the genetic landscape requires the creation of mathematical tools that accounts for organismal life history. These results will provide the first of many buildings blocks to determine the effects of selection and the environment on the evolution of facultative sexuals. They might eventually reveal why sex is so prevalent in nature.


Author post: Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis

This guest post is by Zamin Iqbal [@ZaminIqbal] and Phelim Bradley [@Phelimb]

Our paper “Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis” has just appeared on the Biorxiv. We’re excited about it for a number of reasons.

The idea of using a graph of genetic variation as a reference, instead of a linear genome, has been discussed for some while, and in fact a previous biorxiv preprint of ours applying them to the MHC has just come out in Nature Genetics:

In this paper we apply those ideas to bacteria, where we let go of the linear coordinate system in order to handle plasmid-mediated genes. Our idea is simple – we want to see if genomic data can be used to predict antibiotic resistance in bacteria, and we explicitly want to build a general framework that will extend to many species, and handle mixed infections.

The paper does not deal with the issue of discovering mechanisms/mutations/genes which drive drug resistance – we take a set of geno-pheno rules as prerequisite, and then use a graph of resistance mutations and genes on different genetic backgrounds to detect presence of alleles and compare statistical models – is the population clonal susceptible, minor resistant or major resistant? Although it is accepted that minor alleles can sweep to fixation, in general there is neither consensus nor quantitative data on the correlation between allele frequency and in vitro phenotypic resistance or patient outcome (the latter obviously being much harder). At a practical level,in some cases a clinician might avoid a drug if they knew there was a 5%-frequency resistance allele, and in others they might increase the dose. Resistance is of course a quantitative trait, often measured in terms of the minimum concentration of a drug required to stop growth of a fixed inoculum – but commonly a threshold is drawn and samples are classified in a binary fashion.

A paper last year from some of us ( showed that a simple panel of SNPs and genes was enough to predict resistance with high sensitivity and specificity for S. aureus (where SNPs, indels, chromosomal genes and plasmid-mediated genes can all cause resistance) – once you discard all samples with any mixed strains. (Standard process is to take a patient sample and culture “overnight” (12-24 hours), thus removing almost all diversity and samples which show any morphological signs of diversity after culture are discarded or subcultured). By contrast, for M. tuberculosis (which causes TB), known resistance mutations explain a relatively low proportion of phenotypic resistance (~85%) for first-line drugs, and even less for 2nd line (I explain below what 1st/2nd line are). The Mtb population within-host is highly structured and multiple genotypes can evolve in different loci within the body, so it’s important to be able to deal with mixtures. Typical phenotyping relies on several weeks of solid culture (Mtb is slow growing), but mixtures are more able to survive this type of culture than in the case of S. aureus.

We show with simulations that we can use the graph to detect low frequency mutations and genes (no surprise), and that for S. aureus we make no minor calls for our validation set of ~500 blood-cultured samples (no surprise). Each sample is phenotyped with 2 standard lab methods, and where they disagree a higher quality test is used to arbitrate. This consensus allows us to estimate error rates both for our method (called Mykrobe predictor) and for the phenotypic tests. As a result we’re able to show not only that we do comparably with FDA requirements for a diagnostic, but also that we match or beat the common phenotypic tests.

On the other hand for TB, the story is much more complex and interesting. We analyse ~3500 genomes in total, split into ~1900 training samples and ~1600 for validation. For M. tuberculosis, a sample is classed as resistant if after some weeks of culturing under drug pressure, the number of surviving colonies is >1% of the number of colonies from a control strain treated identically – the number 1% is of course arbitrary (set down by Canetti in the 1960s I think), though it has been shown that phenotypic resistance does correlate with worse patient outcome. Sequencing on the other hand is done before the drug pressure, so we are fundamentally testing a different population, and we can’t simply mirror that 1% allele frequency expectation. This is what we use the 1900 training samples for – determining what frequency to set for our minor-resistant model. We ended up using 20%, and also found that there
was an appreciable amount of lower frequency resistance, which did not survive the 6-week drug-pressure susceptibility test, but which might cause resistance in a patient.

Mtb infections can last a long time, and despite their slow growth, the sheer number of bacilli in a host result in a vast in-host diversity. As a result, mono therapies fail, as resistant strains sweep to fixation – standard treatment is therefore with 4 “first-line” drugs, reducing the chance that any strain has enough mutations to resist them all. If the first-line drugs fail, or if the strain is known to be resistant, then it is necessary to fall back to more toxic and less effective second-line drugs. We found, somewhat to our surprise, that

1. Overall, minor alleles contribute very little to phenotypic resistance in first-line drugs, but they do make a significant contribution to second-line drugs, improving predictive power by >15%. This matches previous reports that patient samples had mixed R and S alleles for 2nd line drugs. This could have major public health consequences, as resistance to these drugs needs to be detected to distinguish MDR-TB (resistant to isoniazid, rifampicin) from XDR-TB (isoniazid, rifampicin + second-line), a major concern for the WHO.

2. Interestingly, a noticeable number of rifampicin false-positive calls were due to SNPs which confer resistance but have been shown to slow growth. Since the phenotyping test is intrinsically a measure of relative growth, these strains may be misclassified as susceptible – i.e. these are probably false-susceptible calls due to an artefact of the nature of the test. This has been reported before by the way.

Anyway – please check out the paper for details. We think this large-scale analysis of whether minor alleles contribute to in vitro phenotype, and whether they should be used for prediction is new and interesting both scientifically and in terms of translation. The bigger question is what the consequences are for patient outcome, and how to deal with in-host diversity, and for that we of course need data collection and sharing. We’ve spent a lot of time in the Oxford John Radcliffe Hospital working with clinicians, and trying to determine what information they really need from this kind of predictive test, and we’ve produced both Windows/Mac apps with very simple user-interfaces (drag-the-fastq on, and let it run) for them to use; we’ve also produced an Illumina Basespace app, currently submitted to Illumina for approval, which should enable automated cloud-use.

Our paper also has a whole bunch of work I’ve not mentioned here, where we needed to identify species, and detect contaminants – most interesting when common contaminants can contain the same resistance gene as the species under test.

Our software is up on github here
including some desktop apps and example fastq files so you can test it.

Comments very welcome!

Zam and Phelim

PS By the way, the 4 first-line drugs have different effectiveness in different body compartments – see this interesting paper for the modelling of the consequences:

Author post: Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

This guest post is by William Wen on his preprint “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing”, arXived here.

Our paper “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing” has been published in the journal Biostatistics, the preprint is also updated on the arXiv. The paper discusses linear mixed models (LMM) and the models commonly used for (rare variants) SNP set testing in a unified Bayesian framework, where fixed and random effects are naturally treated as corresponding prior distributions. Based on this general Bayesian representation, we derive Bayes factors as our primary inference device and demonstrate their usage in solving problems of hypothesis testing (e.g., single SNP and SNP set testing) and variable selection (e.g., multiple SNP analysis) in genetic association analysis. Here, we take the opportunity to summarize our main findings for a general audience.

Agreement with Frequentist Inference

We are able to derive various forms of analytic approximations of the Bayes factor based on the unified Bayesian model, and we find that these analytic approximations are connected to the commonly used frequentist test statistics, namely the Wald statistic, score statistic in LMM and the variance component score statistic in SNP set testing.

In the case of LMM-based single SNP testing, we find that under a specific prior specification of genetic effects, the approximate Bayes factors become monotonic transformations of the Wald or score test statistics (hence their corresponding p-values) obtained from LMM. This connection is very similar to what is reported by Wakefield (2008) in the context of simple logistic regression. It should be noted the specific prior specification (following Wakefield (2008), we call it implicit p-value prior) essentially assumes a larger a priori effect for SNPs that are less informative (either due to a smaller sample size or minor allele frequency). Although, from the Bayesian point of view, there seems to be a lack of proper justification for such prior assumptions in general, we often note that the overall effect of the implicit p-value prior on the final inference may be negligible in practice, especially when the sample size is large (We demonstrate this point with numerical experiments in the paper).

For SNP set testing, we show that the variance component score statistic in the popular SKAT model (Wu et al. (2011)) is also monotonic to the approximate Bayes factor in our unified model if the prior effect size under the alternative scenario is assumed small. Interestingly, such prior assumption represents a “local” alternative scenario for which score tests are known to be most powerful.

The above connections are well expected: after all, the frequentist models and the Bayesian representation share the exact same likelihood functions. From the Bayesian point of view, the connections reveal the implicit prior assumptions in the Frequentist inference. These connections also provide a principled way to “translate” the relevant frequentist statistics/p-values into Bayes factors for fine mapping analysis using Bayesian hierarchical models as demonstrated by Maller et al. (2012) and Pickrell (2014).

Advantages of Bayesian Inference

Bayesian Model Averaging

Bayesian model averaging allows simultaneous modeling of multiple alternative scenarios which may not be nested or compatible with each other. One interesting example is in the application of rare variants SNP set testing, where two primary classes of competing models, burden model (assuming most rare causal variants in a SNP set are either consistently deleterious or consistently protective) and SKAT model (assuming the rare causal variants in a SNP set can have bi-directional effects), target complimentary alternative scenarios. In our Bayesian framework, we show that these two types of alternative models correspond to different prior specifications, and a Bayes factor jointly considering the two types of models can be trivially computed by model averaging. A frequentist approach, SKAT-O, proposed by Lee et al. (2012) achieves the similar goal by using a mixture kernel (or prior from the Bayesian perspective). We discuss the subtle theoretical difference between the Bayesian model averaging and the use of SKAT-O prior and show by simulations, the two approaches have similar performance. Moreover, we find that the Bayesian model averaging is more general and flexible. To this end, we demonstrate a Bayesian SNP set testing example where three categories of alternative scenarios are jointly considered: in addition to the two aforementioned rare SNP association models, a common SNP association model is also included for averaging. Such application can be useful for eQTL studies to identify genes harboring cis-eQTLs.

Prior Specification for Genetic Effects

The explicit specification of the prior distributions on genetic effects for alternative models is seemingly a distinct feature of Bayesian inference. However, as we have shown, even the most commonly applied frequentist test statistics can be viewed as resulting from some implicit Bayesian priors. Therefore, it is only natural to regard the prior specification as an integrative component in modeling alternative scenarios. Many authors have shown that it is effective to incorporate functional annotations into genetic association analysis through prior specifications. In addition, we also show that in many practical settings, the desired priors can be sufficiently “learned” from data facilitated by Bayes factors.

Multiple SNP Association Analysis

Built upon the Bayes factor results, we demonstrate an example of multiple SNP fine mapping analysis via Bayesian variable selection in the context of LMM. The advantages of Bayesian variable selection and its comparison to the popular conditional analysis approach have been thoroughly discussed in another recent paper of ours (Wen et al. 2015).

Take-home Message

If single SNP/SNP set association testing is the end point of the analysis, the Bayesian and the commonly applied frequentist approaches yield similar results with very little practical difference. However, going beyond the simple hypothesis testing in genetic association analysis, we believe that the Bayesian approaches possess many unique advantages and is conceptually simple to apply in rather complicated practical settings.

The software/scripts, simulated and real data sets used in the paper are publicly available at github.


1. Wakefield, J. (2009). Bayes factors for genome‐wide association studies: comparison with P‐values. Genetic epidemiology, 33(1), 79-86.
2. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1), 82-93.
3. Maller, J. B., McVean, G., Byrnes, J., Vukcevic, D., Palin, K., Su, Z., Wellcome Trust Case Control Consortium., et al. (2012). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics, 44(12), 1294-1301.
4. Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4), 559-573.
5. Lee, S., Wu, M. C., & Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4), 762-775.
6. Wen, X., Luca, F., & Pique-Regi, R. (2014). Cross-population meta-analysis of eQTLs: fine mapping and functional study. bioRxiv, 008797.

Author post: Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance

This guest post is by Pleuni Pennings about her paper (with co-authors) Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance, bioRxived here. This is a cross-post from Pleuni’s blog.

Almost three years ago, in early 2012, I attended a talk by Martin Nowak. He talked about cancer and one of the things he said was that treatment with multiple drugs at the same time is a good idea because it helps prevent the evolution of drug resistance. Specifically, he explained, when treatment is with multiple drugs, the pathogen (tumor cells in the case of cancer) needs to acquire multiple resistance mutations at the same time in order to escape drug pressure.

As I listened to Martin Nowak’s talk, I was thinking of HIV, not cancer. At that time, I had already spent about two years working on drug resistance in HIV. Treatment of HIV is always with multiple drugs, for the same reason that Martin Nowak highlighted in his talk: it helps prevent the evolution of drug resistance.

However, as I read the HIV drug resistance literature and analyzed sequence data from HIV patients, I found evidence that drug resistance mutations in HIV tend to accumulate one at a time. This is contrary to the generally accepted idea that the pathogen must acquire resistance mutations simultaneously.

There seemed to be a clear mismatch between data and theory. Data show mutations are acquired one at a time, and theory says mutations must be acquired simultaneously. One of the two must be wrong, and it can’t be the data![1]


After Martin Nowak’s talk, I went up to him and told him how I thought data didn’t fit the theory. Martin’s response: “Oh, that is interesting!” (Imagine this being said with an Austrian accent). “Let’s meet and talk about it.”

So, we met. Logically, Alison Hill and Daniel Rosenbloom, then grad students in Martin’s group, were there too. I had already met with Alison and Daniel many times, since they were also working on drug resistance in HIV. John Wakeley (my advisor at Harvard) came to the meeting too.

Between the five of us, we brainstormed and fairly quickly realized that one solution to the conundrum was to assume that a body’s patient consisted of different compartments and that each drug may not penetrate into each compartment. Maybe we found this solution quickly because Alison and Daniel had already been thinking of the issue of drug penetration in the context of another project. A body compartment that has only one drug instead of two or three would allow a pathogen that has acquired one drug resistance mutation to replicate. If a pathogen with just one mutation has a place to replicate, this makes it possible for the pathogen to acquire resistance mutations one at a time.

We decided to start a collaboration to analyze a formal model to see whether our intuition was correct. Over the following three years, there were some personnel changes and several moves, graduations and new jobs. Stefany Moreno joined the project as a student from the European MEME Master’s program when she spent a semester in Martin’s group. When I moved to Stanford, Dmitri Petrov became involved in the project. Next, Alison and Daniel each got their PhD and started postdocs (Alison at Harvard, Daniel at Columbia), Stefany got her MSc and started a PhD in Groningen, I had a baby and became an assistant professor at SFSU. No one would have been surprised if the project would never have been finished! But we stuck with it and after many hours of work, especially by the first authors Alison and Stefany, and uncountable Google Hangout meetings, we can now confidently say that our initial intuition from that meeting in 2012 was correct. Compartments with imperfect drug penetration indeed allow pathogens to acquire drug resistance one mutation at a time. And, importantly, the evolution of multi-drug resistance can happen fast if mutations can be acquired one at a time, much faster than when simultaneous mutations are needed.

Our manuscript can be found on the BioRxiv (link). It is entitled “Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance.” We hope you find it useful!

[1]Of course, it could be my interpretation of the data!

Author post: Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

This guest post is by Xiaoquan (William) Wen, Francesca Luca, and Roger Pique-Regi on their preprint Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study, bioRxived here.

Our paper presents an integrative analysis framework to perform fine mapping and functional analysis of cis-eQTLs. In particular, we consider a setting where eQTL data are collected from multiple population groups. Although the details of our methods and analysis results are described in the manuscript, here we’d like to take the opportunity to discuss some of our main features and interesting findings of this work.

From the methodological perspective, the Bayesian inference framework that we present in this paper enables efficient multiple SNP analysis in the presence of multiple heterogeneous (population) groups. This framework is a natural extension from our previous works in dealing with heterogeneous genetic association data in single SNP analysis (Flutre et al 2013, Wen and Stephens 2014). The output from our multiple cis-eQTL analysis fully characterizes the uncertainty of eQTL calls, which becomes critical in the downstream functional analysis. This represents a significant advantage over the commonly applied conditional analysis approach, which is non-trivial to generalize when there are multiple heterogeneous subgroups. Taking advantage of these features, we further extend our analysis framework to incorporate functional genomic annotations and assess their levels of enrichment in association signals. Although in this paper we solely focus on eQTLs, it should be noted that our statistical methods are completely general, and applicable in other contexts of genetic association analysis.

Applying this analysis framework, we re-analyzed the eQTL data from the GUEVADIS project that consists of samples from five population groups. Importantly, a key motivation is to identify eQTL signals that are consistently presented in all population groups. This analysis yields some interesting findings, which we will highlight below:

  1. Cross-population meta-analysis greatly improves the power of eQTL discovery. The power gain by integrating data across population groups is well expected: with a combined sample size ~400 in five population groups, we are able to identify 6,555 genes that harbor at least one cis-eQTL (which we refer to as “eGenes”) from 11,838 tested protein coding and lincRNA genes at 5% FDR level; in comparison, the union set from the population-by-population analysis yields 3,447 eGenes.
  2. Cross-population samples provide unique resources to fine map cis-eQTLs. We perform multiple SNP analysis for each identified eGene, and find that for a non-trivial proportion of genes (7% of all genes analyzed, or 14% of identified eGenes), two or more independent cis-eQTL signals can be confidently identified in the GEUVADIS data. In most of those cases, we are relatively certain about the existence of multiple eQTL signals, but cannot pinpoint the causal variants by fully resolving the LD. Nevertheless, we find that utilizing cross-population samples, the population heterogeneity in local patterns of LD can be effectively leveraged to narrow down the genomic regions that harbor causal eQTLs, a phenomenon that we refer to as “LD filtering”. Using the GEUVADIS data, we are able to quantify the effect of LD filtering. More specifically, we select a set of genes that are identified harboring exactly one cis-eQTL with high confidence, and construct credible regions for each eQTL signal based on both the population-by-population and the cross-population analyses. We find that for the majority of the genes tested (92% of 526 selected genes), the joint analysis yields a smaller credible region comparing to the minimum credible region length from separate population analyses. The median reduction in region length from the separate analysis to joint analysis is close to 50% in the set of genes examined.

    On the other hand, there are cases that population specific LD patterns can cause some SNPs to display large degree of heterogeneity across populations in their estimated effect sizes from single SNP analysis. In some extreme cases, a SNP may appear to possess strong “population specific” effects. As we acknowledge that genuine population specific eQTLs are certainly interesting phenomena and very much likely exist, we suggest interpreting highly heterogeneous eQTL signals from single SNP analysis with caution. In the paper, we demonstrate one such example where a set of SNPs in LD, when analyzed alone, appear to show strong but opposite effects on expression levels in European and African populations. The multiple SNP analysis yields a seemingly much more plausible alternative explanation: it identifies two independent eQTL signals in the region, and the “opposite effect” eQTLs tag one signal in the African population and the other signal in the European populations. This example, we believe, fully demonstrates the necessity and benefit of multiple SNP analysis using cross-population samples.

  3. Genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs. This point is demonstrated by our functional analysis approach based on the fine mapping results of cis-eQTLs. In brief, we classify every cis-SNP into three mutually exclusive categories based on the computational predictions of CENTIPEDE model: 1) SNPs strongly affecting TF binding 2) SNPs residing in a DNAse-I footprint region but with little or no effects on TF binding 3) all other SNPs, or baseline SNPs. We find that the first category of SNPs are 1.49 fold more likely than baseline SNPs to be eQTLs, and its enrichment level is statistically highly significant (p-value = 4.93 x 10-22). The SNPs in category 2 is also enriched but with much less impressive fold change (1.15) and statistical significance (p-value = 0.0035). Very interestingly, this finding seems in agreement with the results reported in our recent work Moyerbrailean et al 2014) where other cellular and organismal phenotype QTLs are examined.

Overall, the ability of our method to disentangle multiple eQTL signals represents a significant step forward towards fully comprehending the complex mechanisms regulating gene expression. Using the natural interventions represented by genetic polymorphisms can be used in future studies to identify multiple functional regulatory elements for a gene. The computational methods used in this paper are implemented in the software packages FM-eQTL and eQTLBMA. Our analysis results are also available for browsing and downloading at this site.


1. T Flutre, X Wen, J Pritchard, M Stephens (2013). A statistical framework for joint eQTL analysis in multiple tissues. PLoS genetics 9 (5), e1003486

2. X Wen, M Stephens (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. The Annals of Applied Statistics 8 (1), 176-203

3. T Lappalainen et al (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511

4. Moyberbrailean et al (2014) Are all genetic variants in DNase I sensitivity regions functional? bioRxiv, 007559

Author post: Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage

This guest post is by Claude Becker, Jörg Hagmann and Detlef Weigel on their preprint Century-scale methylome stability in a recently diverged Arabidopsis thaliana lineage, bioRxived here.

This paper is the result of a collaboration between experts in machine learning and statistical analysis (from the group of Karsten Borgwardt at the Max Planck Institute of Intelligent Systems), a lab that has spearheaded the assembly and SNP genotyping of a world-wide collection of Arabidopsis thaliana specimen (Joy Bergelson’s lab at the University of Chicago), a group specialized in large-scale phenotyping (the lab of Thomas Altmann at the Leibniz Institute of Plant Genetics and Crop Plant Research in Gatersleben) and our epigenomics group at the Max Planck Institute for Developmental Biology in Tübingen.

The epigenome of an organism, in a restricted definition, consists of the entirety of post-translational histone modifications (e.g. methylation, acetylation, etc.) and chemical modifications to the DNA, such as methylation of cytosines. Epigenetic marks can influence the transcriptional activity of genes and transposable elements by locally modulating the accessibility of the DNA. The local configuration of the epigenome can change (i) spontaneously, (ii) in dependence of genetic rearrangements, or (iii) as a consequence of external signals. That the epigenome reacts to external signals such as stress and nutrient supply and that it can influence physiological processes – even behavior – has caused much recent excitement. Academic and popular scientific articles have raised the question whether the epigenome has the potential to maintain environmental footprints across generations. The epigenome is thus presented as an entity that fuels acclimation to rapidly changing environmental conditions and that enables adaptation in subsequent generations. Studies investigating the epigenetic basis of the inheritance of acquired traits, however, often either lack the depth of analysis necessary for the identification of locus-specific epigenetic changes or investigate inheritance over a rather short time period of only one or two generations. Moreover, many study designs do not allow for easy distinction between genetic variation causing the observed epigenetic change and epigenetic differences independent of DNA sequence variation.

In our new study we aim to tackle the question to what extent long exposure to varying and diverse environmental conditions can change the heritable DNA methylation landscape. We overcome several of the above-mentioned problems and limitations by studying variation of DNA methylation in a quasi-isogenic lineage of the model plant Arabidopsis thaliana. North America (NA) was only recently colonized by A. thaliana, and approximately half of the current population is made of a single lineage that underwent a recent population bottleneck, having diverged from a common ancestor more a century or two ago, resulting in minimal genetic diversity in the current population [1].

We sequenced the genome and DNA methylome of thirteen closely related NA accessions originating from different geographical locations in order to determine the spectrum, frequency and effect of epigenetic variants. We then compared the epigenetic variation in the NA lineage to that of a previously analyzed set of isogenic A. thaliana lines that had been propagated for 30 generations in the greenhouse [2,3].

Pairwise comparison of the NA accessions revealed that only 3% of the genome-wide methylation showed variable methylation. By using the genetic mutations as a molecular clock, we found that – contrary to our expectation – epimutations did not accumulate at a higher rate under varying natural conditions compared to growth in a stable greenhouse environment. Even more surprisingly, changes in DNA methylation of single cytosines and of larger contiguous regions were often seen in both NA and greenhouse-grown accessions. In both datasets, accumulation of epimutations over time was non-linear, likely reflecting frequent reversions of methylation changes back to the initial configuration. Population structure inferred from methylation data reflected the genetic relatedness of the accessions and showed no signal of a genome-wide environmental footprint. This, together with the fact that most epigenetic variants were neutral and did not correlate with changes in gene expression, indicated that epigenetic variants accumulate to a large extent as a function of time and genetic diversification rather than as a consequence of local adaptation to environmental changes.

In summary, we have shown that long-term methylome variation of plants grown in varying and diverse natural sites is largely stable at the whole-genome level and in several aspects is intriguingly similar to that of lines raised in uniform conditions. This does not rule out a limited number of subtle adaptive DNA methylation changes that are linked to specific growth conditions, but it is in stark contrast to the published claims of broad, genome-wide epigenetic variation reflecting local adaptation. Heritable polymorphisms that arise in response to specific growth conditions certainly appear to be much less frequent than those that arise spontaneously or due to genetic variation.

In addition to the biological findings discussed above, an important part of our paper is an improved method for the detection of differentially methylated regions. Past studies have relied on clustering of differentially methylated positions or on fixed sliding windows, with the caveat of high rates of false negatives and false positives, respectively. We have adapted a Hidden Markov Model, initially developed for animal methylation data, to the more complex DNA methylation patterns in plants. Upon identification of methylated regions in each strain, these are then tested for differential methylation between strains. Our method results in increased specificity and higher accuracy and we believe it will be of broad interest to the epigenomics community.


1. Platt A, Horton M, Huang YS, Li Y, Anastasio AE, et al. (2010) The scale of population structure in Arabidopsis thaliana. PLoS Genet 6: e1000843.

2. Becker C, Hagmann J, Müller J, Koenig D, Stegle O, et al. (2011) Spontaneous epigenetic variation in the Arabidopsis thaliana methylome. Nature 480: 245-249.

3. Schmitz RJ, Schultz MD, Lewsey MG, O’Malley RC, Urich MA, et al. (2011) Transgenerational epigenetic instability is a source of novel methylation variants. Science 334: 369-373.

Author post: Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids

This guest post is by Russ Corbett-Detig, Emily Jacobs-Palmer, and Hopi Hoekstra (@hopihoekstra) on their paper Corbett-Detig et al Segregation distorters are not a primary source of Dobzhansky-Muller incompatibilities in house mouse hybrids bioRxived here.

What are segregation distorters and how can they contribute to reproductive isolation?

Within an individual, somatic cells are typically genetic clones of one another; in contrast, haploid gametes are related to their compatriots at only half of all loci on average, opening doors to intra-individual competition and conflict. Eggs and sperm may express selfish genetic elements called segregation distorters (SDs) that disable or destroy competitor gametes carrying unrelated alleles. The resulting transmission advantage attained by SDs allows them to invade populations without improving the fitness of individuals that harbor them. Indeed, SDs often negatively impact carriers’ fitness because such hosts transmit fewer fit (or viable) gametes. Hence natural selection favors the evolution of alleles that suppress distortion and thereby restore fertility.

Coevolution of SDs and their suppressors can in turn contribute to the evolution of reproductive isolation between diverging lineages. How? If two populations become temporarily isolated from one another, SDs and later their accompanying suppressors may arise and eventually fix in one isolated population, possibly multiple times over. Should the two populations then encounter each other again, the sperm of hybrid males, for example, will contain one or more distorters without the appropriate suppressors, and these males will suffer decreased fertility. Over time, gene flow may be substantially and perhaps permanently hindered leading to the formation of two reproductively isolated species.

In some Drosophila species pairs, and in many crop plants, it is clear that the coevolution of SDs and their suppressors are major, even primary, contributors to the evolution of reproductive isolation between diverging lineages. At present, however, the relative importance of SDs-suppressor systems to reproductive isolation in broader taxonomic swathes of sexually reproducing organisms (e.g. mammals) is largely unexplored.

Our solution to the practical challenges of studying SDs


The primary impediment to addressing this important question in evolutionary biology is practical, not conceptual. Conventionally, researchers detect SD-suppressor systems by crossing two strains to produce a large second-generation hybrid population; they then genotype these hybrids at a set of markers across the genome to identify loci that show substantive deviations from 50:50 mendelian ratios—putative SDs. Ultimately, this traditional approach suffers from two major pitfalls. First, for many organisms it is not feasible to raise and genotype enough hybrids (hundreds to thousands) to have sufficient statistical power to detect SDs, especially those with weaker effects. Second, by genotyping these second generation hybrids, rather than the gametes of their parents, one conflates SD with hybrid inviability, and it can be very difficult to disentangle these two factors.

How to circumvent these challenges? In this work, we develop an alternative approach that avoids these practical challenges. We first obtain high quality, motile sperm from first generation hybrid males (generated from two strains with available genome sequences), and then sequence these sperm in bulk as well as a somatic ‘control’ tissue. We then contrast the relative representation of the parental chromosomes in windows across the genome in both samples, searching for regions where the sperm allele ratios show more DNA copies of one parental haplotype, but the somatic alleles do not. Importantly, this approach is very general, and it can easily be applied to any number of interspecific or intraspecific crosses where it is possible to obtain large quantities of viable gametes.

Little evidence for SDs in house mouse hybrids

We apply this method to a nascent pair of Mus musculus subspecies,M. m. castaneus and M. m. domesticus. We chose these subspecies because hybrid males formed in this cross are known to be partially reproductively dysfunctional. Nonetheless, using our novel method we find no evidence supporting the presence of SDs—no genomics regions showing a statistical deviation from 50:50 compared to control tissue—despite strong statistical power to detect them. We conclude that SDs do not contribute appreciably to the evolution of reproductive isolation in this nascent species pair. Instead, reproductive isolation in these mammalian subspecies likely stems from other incompatibilities in spermatogenesis or ejaculate production unrelated to SD-suppressor coevolution.

So what’s next? Because this approach—bulk sequencing of sperm from hybrid males—can be used on almost any pair of interfertile taxa, we can begin to better understand the prevalence of SD and its role in speciation in a wide diversity of species.