Author post: VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

This guest post is by Olly Burren and Chris Wallace on their preprint, VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes, arXived here.

The idea for this paper came from reading a study by Liu et al. ( and the fact that summary p values from genome wide association studies are increasingly becoming publicly available. In the field of human disease, genome-wide association studies have been very successful in isolating regions of the genome that confer disease susceptibility. The next step however, is to understand mechanistically exactly how variation in these loci gives rise to this susceptibility. There are a myriad of pre-existing methods available for integrating genetic and genomic datasets, however things are complicated by the high degree of linkage disequilibrium that exists, which causes substantial inflation in the variance of any test statistic. This inter-SNP correlation must be taken into account, classically by permuting case/control status and recomputing association, requiring access to raw genotyping data. Indeed, this approach was taken in our previously published method see Heing et al. ( which uses a non-parametric test to compare distribution of GWAS p values from two sets of SNPs (“test” and “control”). As most researchers working with GWAS know gaining access to raw genotyping data is often difficult, and then how to include meta-analysis and imputed data? Liu et al., got around this by estimating the inter-SNP correlation using public datasets and sampling from a multivariate normal to generate simulated p values, analogous to the permuted p values possible with permuting phenotype status when raw data are available. VEGAS uses genotype data publicly available through the International HapMap project and aims to integrate GWAS results with trans eQTLs to identify causal disease genes.

Our thought was that by combining our previously published method, with the VEGAS approach, we could create a novel approach that would allow the integration of genetic information from GWAS with functional information from for example a set of micro-array experiments, crucially without the need for genotype information. The rationale being that it would help to prioritise future mechanistic studies, which can be costly and time-consuming to conduct. We also upped the stakes, and decided to use 1000 Genomes Project genotyping information for our estimations, to allow application to dense-genotyping technologies. The result was a software pipeline that takes as input a gene set of interest, a matched ‘control’ set and a summary set of GWAS statistics and computes an enrichment score.

Note that this approach differs from the Bayesian model suggested by Pickrell ( as it focuses on comparing broad regions, rather than on considering more targeted genomic annotation, and in that sense is perhaps more akin to pathway analysis, although we do suggest that functionally defined genes sets, such as those found by knock down experiments in cell lines, may be more productive than using manually annotated pathways whose completeness can vary considerably.

To illustrate the method we applied it to a large meta-analysis GWAS study of type 1 diabetes (8000 case vs 8000 controls), and an interesting dataset examining the effect on gene-expression of knocking down a series of 59 transcription factors in a lymphoblastoid cell line see Cusanovich et al ( We identified three transcription factors, IKZF3, BATF and ESRRA, whose putative targets are significantly enriched for variation associated with type 1 diabetes susceptibility. IKZF3 overlaps a known type 1 diabetes susceptibility region, whereas BATF and ESRRA overlap other autoimmune susceptibility regions, validating our approach. Of course there are caveats interpreting results derived from cell lines, however we think it’s promising that our top hit lies in a region already associated with type 1 diabetes susceptibility.
Using the quantities already computed, once enrichment is detected, we implemented a simple technique to prioritise genes within the set. This allows the generation of a succinct list of genes that are responsible for the enrichment detected on the global level. Cross referenced with other information these can either be informative in their own right or be used to inform future studies.

This study is also an example of the preprint process speeding up scientific discovery. We knew about the Cusanovich dataset because they released a preprint on arXiv, which was caught by Haldane’s Sieve ( in October 2013. One email, and the authors kindly shared their complete results. Had we waited for it to be published in PLoS Genetics in March 2014, we’d have been five months behind where we are.

The major benefit is that all of the datasets employed are within the public domain. Our hope is that either this or other methods in the same vein will help to bridge the gap between GWAS and disease mechanisms, ultimately fuelling the development of new therapeutics.

Author post: Genetic influences on translation in yeast

This guest post is by Frank Albert and Leonid Kruglyak on their preprint (with co-authors) “Genetic influences on translation in yeast.

This post is on a manuscript we recently posted to both biorXiv and arXiv on how genetic differences between yeast strains influence protein translation. Here, we give some background for the project and what the results mean in our eyes. We also highlight a few interesting aspects of regression analysis that may be useful to other researchers in genomics, but aren’t currently widely appreciated (at least they weren’t obvious to us before we dove into these data). We appreciate any comments and thoughts!

Protein translation and the genetics of gene expression

Our study was motivated by earlier reports that regulatory variation among individuals in a species can have surprisingly different effects on mRNA vs. on protein levels. For those not fully up to speed on these questions, here’s a quick recap. Individuals differ from each other in many aspects (e.g. in appearance and disease susceptibility), and for many traits, these differences are at least in part due to genetic variation. Research in genetics (both in humans and in other species) is focused on identifying DNA sequence variants in the genome that contribute to phenotypic variation, and on understanding the molecular mechanisms through which these variants alter traits. One general class of such mechanisms is alteration of gene expression, and genetic loci that affect gene expression are known as “expression quantitative trait loci” (eQTL). DNA sequence variants can change expression levels of genes in a number of ways, and some of the expression differences in turn are thought to alter organismal traits. Large and growing eQTL catalogues exist for several species. However, to measure “gene expression”, virtually all these studies measure mRNA rather than protein levels. This isn’t because mRNA is the more relevant molecule to look at, but primarily because mRNA is much easier and cheaper to measure on a global basis than are proteins. There have been a few studies in model organisms and—more recently— in humans that examined genetic variants that influence protein levels. Their results were surprising: the loci that influenced proteins (protein QTL, pQTL) were apparently often different from those that influenced mRNA levels, and vice versa. These results were both troubling and exciting. They were troubling because if genetic changes in protein levels are not a faithful reflection of effects on mRNA levels, the relevance of mRNA-based eQTL maps is less obvious. The results are exciting from a basic science perspective, because they suggest that lots of genetic variants that specifically influence posttranscriptional processes remain to be discovered.

Our current paper begins to explore these issues by measuring one such posttranscriptional process: protein translation, the process by which mRNA molecules are read by ribosomes and “translated” into peptide chains. Translation is interesting because there is literature that suggests that its regulation is a major determinant of protein levels, perhaps as important as the regulation of mRNA levels. Translation is also convenient because it is the last step along the gene expression cascade that can still be assayed using the high throughput sequencing technologies that underlie much of the current boom in genomics in general and eQTL detection in particular. The trick is to isolate only those bits of mRNA that sit inside of ribosomes as they march along the mRNAs during translation. These “ribosome footprints” can then be sequenced, counted, and compared to mRNA levels measured in parallel to get a quantitative readout of how much each gene is being translated.

For our current paper, we teamed up with Jonathan Weissman at UCSF. The Weissman lab pioneered measuring translation by sequencing, and Dale Muzzey (a postdoc with Jonathan) generated the data for our current experiment. We chose a very simple but powerful design: we compared translation in two strains of yeast that are genetically different from each other, and also measured allele-specific translation in the diploid hybrid between these two strains. Using the strain comparison, we can quantify the aggregate effect of all genetic differences between the two strains on translation. In the hybrid, we can specifically see the effects of those variants that act in cis, an important sub-group of eQTL.

We encourage you to read the detailed results in the paper, but in a nutshell, we found that genetic differences clearly do have an effect on translation—but not a terribly large one. Genes that differ in mRNA abundance typically also differ in footprint abundance, and the effect of translation was typically to subtly modulate mRNA differences, rather than erase them or create protein differences from scratch. While there are some exceptions, the take-home message is that on average, differences in protein synthesis are reasonably well approximated by differences in mRNA levels.

Thus, translation does not appear to create major discrepancies between eQTL and pQTL. So what explains such reported discrepancies? Without going into great detail, we think that at least a part of the explanation is that those discrepancies may have been overestimated. This fits with our recent paper in which we used extremely large yeast populations to map pQTL with higher statistical power, and recovered many pQTL at sites where earlier work had found an eQTL, but no pQTL. Are pQTL in most cases simply a reflection of eQTL? It is too early to tell, and we’ll need improved designs and datasets to answer to answer this question with certainty. At the very least, our recent work suggests that genetic influences on protein levels more accurately reflect those on mRNA levels than previous reports had suggested.

“Spurious” correlations and adventures in line fitting

While we worked on the translation paper, we encountered a few technical aspects that are worth sharing in some detail. Specifically, a natural question to ask is how differences in translation compare to differences in mRNA levels. When a gene differs in mRNA abundance between strains, does translation typically lead to a stronger or weaker footprint difference? Does translation typically “reinforce” or “buffer” mRNA differences (or is there no preference either way)? An intuitively attractive analysis is to plot the mRNA differences versus differences in “translation efficiency”, or TE (i.e. the amount of ribosome footprints divided by the amount of mRNA for a given gene). This comparison is shown in Figure 1 for our hybrid data.


In the figure, differences are plotted as log2-transformed fold changes so that zero indicates “no change”, and the resulting distributions of differences are more or less normally distributed. We see a seductively strong negative correlation between mRNA differences and TE differences. Taken at face value, this seems to suggest that mRNA differences are typically accompanied by a difference in TE that buffers the mRNA difference: more mRNA in one strain is counteracted by lower TE, presumably resulting in a protein difference that is smaller than the mRNA difference would predict.

However, this analysis is misleading. The key point is that the TE difference is not independent from the mRNA difference. In log2 space, the TE difference is simply the footprint difference minus the mRNA difference (in non-log space, the TE difference is the footprint difference divided by the mRNA difference). Therefore, the larger the mRNA difference becomes, the smaller the TE difference becomes simply by definition. The fact that the correlation between TE differences and mRNA differences is negative is by itself not informative about the relationship between differences in translation and mRNA levels. This can further be illustrated in Figure 2 (which is Supplementary Figure S4 in our preprint). Here, we simply draw two uncorrelated samples a and b from a normal distribution. The plot of b – a over a has a strong negative correlation, although there is no systematic relationship at all between these two quantities.


It turns out that the problems with comparing ratios (such as the TE difference) to their components (e.g., the mRNA difference, which serves as the denominator in computing TE) have been noted a long time ago (e.g., Karl Pearson termed them “spurious” correlations in 1897). They are worth remembering when carrying out functional genomic and systems biology data analyses in which multiple classes of molecules (mRNA, proteins, metabolites, etc.) are compared to each other and to ratios between them.

Because of this effect, we directly analyzed differences in the two quantities we measured: mRNA and ribosome footprint abundance. The slope of a linear regression between these two quantities was less than one (the blue line in Figure 3, which shows the data from the strain comparison). This might once again suggest a predominance of buffering interactions—for any given mRNA difference, we would predict a smaller footprint difference.


However, this inference is again misleading, because regression to the mean ensures that even if we measure the same quantity twice with some measurement noise (which is usually unavoidable), the slope of the regression line between these two replicates is always less than one. This is because when the first measure for a given gene is by chance larger than its true value, the measure is likely to be smaller in the second replicate. We estimated the regression slope between footprint differences and mRNA differences that would be expected if there were no preference for reinforcing or buffering interactions by using a randomization test that you can read about in the manuscript. The upshot is that the observed regression slope was steeper than those in the randomized data. We therefore concluded that on average, translation more often reinforces than buffers mRNA differences.

A related issue (pointed out to us by J.J. Emerson at UC Irvine) is that linear regression fits a line by minimizing only the error on the y-axis. This makes linear regression ideal for predicting y from x, but less than ideal for measuring the relationship (the slope) between x and y. For our purposes, it is better to use a different way of fitting lines to bivariate data: “Major Axis Estimation”. You can read all about MA in this great review, but briefly, the idea is to fit a line that minimizes the perpendicular distance of points from the line. The minimized error gets distributed between the y and the x axis, resulting in fitted lines that, unlike regression, provide unbiased estimates of the slope, and hence of the true linear relationship between x and y. Using MA on our data did not alter the conclusions we arrived at by using regression together with the randomization test, but it did provide more directly interpretable values for the various slopes. For example, the slopes in the randomized datasets were now centered on 1, whereas they had been degraded to be less than one using linear regression. The MA slope in our data was greater than one (the red line in Figure 3), supporting the inference of reinforcement.

These points on “spurious” correlations, regression to the mean and MA will be obvious to some (they were not to us going in), but we suspect and hope that they may prove useful for others working on genomic datasets.

Author post: Genome scans for detecting footprints of local adaptation using a Bayesian factor model

This guest post is by Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg on their preprint Genome scans for detecting footprints of local adaptation using a Bayesian factor model, available from the arXiv here.

Finding genomic regions subject to local adaptation is a central part of population genomics, which is based on genotyping numerous molecular markers and looking for outlier loci. Most common approaches use measures of genetic differentiation such as Fst. There are many software implementing genome scans based on statistics related to Fst (BayeScan, DetSel, FDist2 , Lositan), and they contribute to the popularity of this approach in population genomics.

However, there are different statistical and computational problems that may arise with approaches based on Fst or related measures. The first problem arises because methods related to Fst assume the so-called F-model, which corresponds to a particular covariance structure for gene frequencies among populations (Bierne et al. 2013). When spatial structure departs from the assumption of the F-model, it can generate many false positives. A second potential problem concerns the computational burden of some Bayesian approaches, which can become an obstacle with large number of SNPs. The last problem is that individuals should be grouped into populations in advance whereas working at the scale of individuals is desirable because it avoids defining populations.

Using a Bayesian factor model, we address the three aforementioned problems. Factor models capture population structure by inferring latent variables called factors. Factor models have already been proposed to ascertain population structure (Engelhardt and Stephens 2010). Here we extend the framework of factor model in order to identify outlier loci in addition to the ascertainment of population structure. Our approach is not the first one to account for deviations to the assumptions of the F-model (Bonhomme et al. 2010, Günther and Coop 2013) but it does not require to define populations by contrast to the previous approaches. Using simulations, we show that factor model can achieve a 2-fold or more reduction of false discovery rate compared to the Fst-related approaches. We also analyze the HGDP human dataset to provide an example of how factor models can be used to detect local adaptation with a large number of SNPs. The Bayesian factor model is implemented in the PCAdapt software and we would be happy to answer to comments or questions regarding the software.

To explain why the factor model generates less false discoveries, we can introduce the notions of mechanistic and phenomenological models. Mechanistic models aim to mimic the biological processes that are thought to have given rise to the data whereas phenomenological models seek only to best describe the data using a statistical model. In the spectrum between mechanistic and phenomenological model, the F-model would stand close to mechanistic models whereas factor models would be closer to the phenomenological ones. Mechanistic models are appealing because they provide quantitative measures that can be related to biologically meaningful parameters. For instance, the parameters of the F-model measures genetic drift that can be related to migration rates, divergence times or population sizes. By contrast, phenomenological models work with mathematical abstractions such as latent factors that can be difficult to interpret biologically. The downside of mechanistic models is that violation of the modeling assumption can invalidate the proposed framework and generate many false discoveries in the context of selection scan. The F-model assumes a particular covariance matrix between populations which is found with star-like population trees for instance. However, more complex models of population structure can arise for various reasons including non-instantaneous divergence or isolation-by-distance, and they will violate the mechanistic assumptions and make phenomenological models preferable.

Michael Blum, Eric Bazin, and Nicolas Duforet-Frebourg

Author post: Hierarchical Bayesian model of population structure reveals convergent adaptation to high altitude in human populations

This guest post is by Matthieu Foll and Laurent Excoffier on their preprint (with co-authors) Hierarchical Bayesian model of population structure reveals convergent adaptation to high altitude in human populations, arXived here.


Since the seminal paper of Lewontin and Krakauer (1973), Fst-based genome scan methods had to struggle with the confounding effect of population structure. These methods started to be very popular with the FDIST software implemented by Beaumont and Nichols (1996), which was based on an island model. At that time it was proposed that the island model was robust to different demographic scenario (recent divergence and growth, isolation by distance or heterogeneous levels of gene flow between populations). A Bayesian version of this model (generally called the F-model) in which populations can receive unequal number of migrants has then been proposed (Beaumont and Balding 2004; Foll and Gaggiotti 2008), and implemented in the BayeScan software (, which is now quite widely used.

However, all these models assume that migrant genes originate from a unique and common migrant pool. We started to realize that this assumption could lead to a massive amount of false positive when we tried to analyze the HGDP data, where this assumption was clearly not supported. To overcome this problem, we proposed an extension of Beaumont and Nichols’s (1996) based on a hierarchical island model (Excoffier et al. 2009) in which populations were assigned to different groups or regions. An island model was assumed in each group, and the group themselves were assumed to follow an island model. This new method was then implemented in Arlequin (Excoffier and Lischer 2010).

Note that alternative ways to deal with complex genetic structure have also been proposed (Coop et al. 2010; Bonhomme et al. 2010; Fariello et al. 2013; Günther and Coop 2013), but the main message people took from these papers was that methods aiming at identifying loci under selection can be quite sensitive to some hidden (or unaccounted) population structure and should be used with caution. Hermisson (2009) even rather provocatively asked: “Who believes in whole-genome scans for selection?”

One radical way to deal with the problem of complex genetic structure is to reduce the number of sampled populations to just two (Vitalis et al. 2001). This leads to a GWAS-like strategy where people sample two populations living in contrasting environments (playing the role of cases and controls in GWAS) with potentially different selection pressures. However, other problems occur when doing so: (i) having only two populations leads to a reduction in power, (ii) related to this first point, one generally needs to sample a larger number of individuals to have sufficient power, (iii) the comparison of results obtained from different pairs of populations can be problematic, especially when one is interested in detecting convergent selection by looking at the overlap in lists of candidate genes. In the last few years, studies comparing pairs of populations living in different environments have accumulated. Typically, each pairwise comparison produced a set of candidate loci, and people often used some informal criterion to identify “repeated outliers” based on the number of times they were identified in the different tests performed (see Nosil et al. 2008 or Paris et al. 2010 for example).

A new hierarchical F-model

We started to think that the introduction of a Bayesian F-model dealing with a hierarchical population structure could solve some of these problems. We therefore introduced a hierarchical F-model where populations are assigned to different groups. In each group the genetic structure is modeled with a classical F-model, and the group themselves are modeled with a higher-level F-model. One advantage is that the Beaumont and Balding (2004) decomposition of Fst as population- and locus-specific effects can be done in each group separately as well as between groups. This allows the identification of selection at different levels: within a specific group of populations, or at a higher level (among groups). Here again, an interesting question is to identify loci responding similarly to selection in several groups. In order to look at that particular case, we explicitly included a convergent selection model were at any given locus all groups share the same locus-specific effect. Posterior probabilities of all possible models of selection are then evaluated using a Reversible Jump MCMC algorithm.

Adaptation to high altitude

We applied this new method to the very interesting case of high altitude adaptation in humans. We reanalyzed a published large SNP dataset (Bigham et al. 2010) including two populations living at high altitude in the Andes and in Tibet, as well as two lowland related populations from Central-America and East Asia. One of the most striking results we find is that convergent selection is much more common than previously found based on separate analyses in the two continents. We checked with simulations that this was in fact expected: being able to analyze the four populations together is indeed more powerful than performing two separate pairwise tests. In addition to confirming several known candidate genes and biological processes involved in high altitude adaptation, we were able to identify additional new genes and processes under convergent selection. In particular, we were very excited to find two specific biological pathways that could have evolved to counter the toxic levels of fatty acids and the neuronal excitotoxicity induced by hypoxia in both continents. Interestingly, several genes included in these pathways had been identified in high altitude Ethiopians (Scheinfeldt et al. 2012; Alkorta-Aranburu et al. 2012; Huerta-Sánchez et al. 2013), suggesting that these pathways could represent a striking example of convergent adaptation in three continents.


Our hierarchical F-model appears very flexible and can cope with a variety of sampling strategies to identify adaptation. Whereas we have considered only two groups of two populations in our paper, it is worth noting that our method can handle more than two groups and more than two populations per group. An alternative sampling scheme to detect selection could for instance to contrast several genetically related high altitude populations to several related lowland populations (see e.g. Pagani et al. 2011). Our method could also deal with such a sampling scheme, but this time, one would focus on the decomposition of the genetic differentiation between the groups (i.e. Fct). In summary, our approach allows the simultaneous analysis of populations living in contrasting environments in several geographic regions. It can be used to specifically test for convergent adaptation, and this approach is more powerful than previous methods contrasting pairs of populations separately.

Matthieu Foll and Laurent Excoffier


Alkorta-Aranburu, G., C. M. Beall, D. B. Witonsky, A. Gebremedhin, et al., 2012 The genetic architecture of adaptations to high altitude in ethiopia. PLoS Genet 8: e1003110.

Beaumont, M. A., and R. A. Nichols, 1996 Evaluating Loci for Use in the Genetic Analysis of Population Structure. Proc Biol Sci 263: 1619-1626.

Beaumont, M. A., and D. J. Balding, 2004 Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol 13: 969-980.

Bigham, A., M. Bauchet, D. Pinto, X. Y. Mao, et al., 2010 Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data. PLoS Genet 6: e1001116.

Bonhomme, M., C. Chevalet, B. Servin, S. Boitard, et al., 2010 Detecting selection in population trees: the Lewontin and Krakauer test extended. Genetics 186: 241-262.

Coop, G., D. Witonsky, A. Di Rienzo, and J. K. Pritchard, 2010 Using environmental correlations to identify loci underlying local adaptation. Genetics 185: 1411-1423.

Excoffier, L., T. Hofer, and M. Foll, 2009 Detecting loci under selection in a hierarchically structured population. Heredity (Edinb) 103: 285-298.

Excoffier, L., and H. E. Lischer, 2010 Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10: 564-567.

Fariello, M. I., S. Boitard, H. Naya, M. SanCristobal, and B. Servin, 2013 Detecting signatures of selection through haplotype differentiation among hierarchically structured populations. Genetics 193: 929-941.

Foll, M., and O. Gaggiotti, 2008 A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977-993.

Günther, T., and G. Coop, 2013 Robust identification of local adaptation from allele frequencies. Genetics 195: 205-220.

Hermisson, J., 2009 Who believes in whole-genome scans for selection? Heredity (Edinb) 103: 283-284.

Huerta-Sánchez, E., M. Degiorgio, L. Pagani, A. Tarekegn, et al., 2013 Genetic signatures reveal high-altitude adaptation in a set of ethiopian populations. Mol Biol Evol 30: 1877-1888.

Lewontin, R. C., and J. Krakauer, 1973 Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74: 175-195.

Nosil, P., S. P. Egan, and D. J. Funk, 2008 Heterogeneous genomic differentiation between walking-stick ecotypes: “isolation by adaptation” and multiple roles for divergent selection. Evolution 62: 316-336.

Pagani, L., Q. Ayub, D. G. Macarthur, Y. Xue, et al., 2011 High altitude adaptation in Daghestani populations from the Caucasus. Hum Genet 131: 423-433.

Paris, M., S. Boyer, A. Bonin, A. Collado, et al., 2010 Genome scan in the mosquito Aedes rusticus: population structure and detection of positive selection after insecticide treatment. Mol Ecol 19: 325-337.

Scheinfeldt, L. B., S. Soi, S. Thompson, A. Ranciaro, et al., 2012 Genetic adaptation to high altitude in the Ethiopian highlands. Genome Biol 13: R1.

Vitalis, R., K. Dawson, and P. Boursot, 2001 Interpretation of variation across marker loci as evidence of selection. Genetics 158: 1811-1823.

Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs

This guest post is by Ilan Gronau on Freedman et al. Genome Sequencing Highlights the Dynamic Early History of Dogs. This preprint was one of the most popular on Haldane’s Sieve last year, and was published yesterday in PLoS Genetics.

Earlier this week, our paper entitled “Genome Sequencing Highlights the Dynamic Early History of Dogs” was published in PLoS Genetics. This paper explores central questions having to do with the origin of domestic canines by sequencing and analyzing the complete genomes of six individuals from carefully selected dog and wolf lineages. We posted an earlier version of this manuscript on ArXiv, and received quite a lot of comments and questions regarding the methodology we employed for demography inference (many through Haldane’s Sieve). This feedback exposed an increasing interest in demography inference methods that utilize small numbers of complete individual genomes, but it also highlighted the need to examine the strengths and weaknesses of the different methods we employed, and how best to combine them to obtain a unified and robust picture of demographic history. We discuss some of these issues in the revised version published this week, but I thought some of the points were worth spelling out a bit more explicitly, which is the purpose of this post. I’d like to thank Adam Siepel, John Novembre, and Adam Freedman for sharing their insights through the process of writing this post.

The Methods

Our study takes advantage of three recently developed demography inference methods:
the Pairwise Sequential Markovian Coalescent (PSMC; Li and Durbin, 2011), the D statistic, or as it is more commonly referred to, the ABBA/BABA test (Durand et al., 2011), and the Generalized Phylogenetic Coalescent Sampler (G-PhoCS; Gronau, et al., 2011). All three methods base their inferences on the genealogical relationships among a relatively small number of individuals, taking advantage of the wealth of genealogical information encoded in individual genomes due to genetic recombination. PSMC, for instance, makes use of the information on changes in coalescent times between the two chromosome copies within a single individual to infer ancestral population sizes. The ABBA/BABA test makes use of asymmetries in genealogies spanning four chromosomes to detect post-divergence gene flow. G-PhoCS jointly considers all individuals using a multi-population coalescent-based demographic model, which includes population divergence times, changes in ancestral population size, and gene flow. A major advantage of G-PhoCS is that it produces a single detailed image of demographic history, inferred using a unified probabilistic model for all individuals. However, in the interest of computational tractability, the method relies on several simplifying modeling assumptions not required by the other methods. In addition, by being constrained to subsets of individuals, the PSMC and ABBA/BABA approaches are free to specialize in capturing particular aspects of demographic history (ancestral Ne and admixture, resp.), which G-PhoCS treats more coarsely. Thus, we found all three methods to be complementary and their combination to be particularly informative about the demographic history of these wild and domestic canids.

Inference of Ancestral Population Sizes

Both PSMC and G-PhoCS provide information about ancestral population sizes (Ne). PSMC is specifically designed for this task, and provides a high-resolution trace of changes in ancestral Ne by separately analyzing each diploid genome. Using PSMC, we could detect sharp declines in Ne for wolves and dogs without making any assumptions regarding the canid phylogeny. However, we found that the traces of ancestral Ne inferred by PSMC should be interpreted with care; simulations we conducted showed that the gradual reduction in Ne inferred by PSMC was also consistent with a more recent severe population bottleneck. Eventually, the coarser model inferred by G-PhoCS, in which shortly after the divergence of dogs and wolves the two ancestral populations suffered severe bottlenecks, ended up fitting the data better than the model implied by the PSMC traces. Thus in our case we found the phylogenetic context to be quite useful for dating the major changes in canid population size. Ideally, it should be possible to directly infer PSMC-style traces along the branches of the population phylogeny (alongside inference of divergence times), but this would involve a fairly major methodological undertaking.

Detection of Admixture and Post Divergence Gene Flow

One of the central findings in our study was that gene flow, particularly between dogs and wolves, played a prominent role in the history of canids. Indeed, we found that several previous claims about dog origins in the Middle East and East Asia were likely influenced by ancient gene flow from wolves to dogs in these regions. The ABBA/BABA test was an obvious choice for a method to detect admixture. It is a fairly simple method, sensitive to even low amounts of gene flow, and robust to assumptions about the demographic history of the populations being tested. By applying this test separately to all sample quartets that include the jackal outgroup, we were able to obtain a good set of candidate ancestral admixture events between dogs and wolves. However, interpreting some of these signals and combining them into a single unified hypothesis was not straightforward, especially since we found signatures for multiple ancestral admixture events among overlapping pairs of populations (e.g., Basenji-Israeli wolf and Boxer-Israeli wolf). G-PhoCS is better suited to deal with this more complex scenario of gene flow, because it jointly analyzes all samples and can consider multiple migration bands in a single analysis. We exploited this feature to find strong evidence of gene flow between wolves and jackals and to show that the signal found for Boxer-Israeli wolf admixture in the ABBA/BABA test was a result of ancestral gene flow from Basenji to Israeli wolf. Still, coming up with a scenario of ancestral gene flow that best fit the data required developing a fairly complex framework for model comparison that involved a combination of multiple separate G-PhoCS runs and comparison with simulated data (see below).

Model Comparison

Addressing subtle questions having to do with the origin of dogs and post-divergence gene flow with wolves required the ability to compare alternative hypotheses for dog domestication in terms of their fit to the data. We did this by considering a collection of plausible topologies for the population phylogeny augmented with different sets of migration bands, and using G-PhoCS to infer demographic parameters for each case. This provided us with a complete demographic model we could associate with each alternative hypothesis and then use to simulate data representing that hypothesis. The hypotheses were then assessed by comparing the simulated data with the real data. While this approach does not constitute a formal model-testing method, it did allow us to explore the space of plausible models in a systematic way and show that the data supports a model with single origin for dogs and that the origin was ancient and similarly distant from all sampled wolf populations.

Future Development

This study allowed us to closely examine recently developed methods for demography inference and ways of combining them to obtain a unified and robust inference of demographic history. While the different methods used in our study were all shown to be quite powerful, particularly when combined, there is obvious room for improvement. In my view, the most promising developments in this field will come from methods (such as G-PhoCS) that capture all major aspects of the demographic history—divergence times, ancestral population sizes and post-divergence gene flow—in a single analysis. The great advantage of such methods is that they provide a framework for rigorous hypothesis testing and model comparison. In principle, the fully Bayesian nature of G-PhoCS enables this quite naturally through the use of Bayes factors for comparison of different sets of model assumptions. Bayes factors are essentially the relative probabilities of different models given the data. Throughout this study, we experimented with various simple ways of estimating Bayes factors based on the data likelihoods of the MCMC samples generated by G-PhoCS, but we were not able to robustly capture the differences in likelihoods of genealogies sampled for the different hypotheses. Solving this important problem will require additional work, but it is definitely within reach. Another important set of extensions involves using richer models that rely on weaker sets of assumptions. This includes modeling recombination, gradual changes in ancestral population sizes, and more realistic models for gene flow. Progress is being made in these directions as well (see, for example, our recent work on ancestral recombination graph inference), and there is much room for optimism that the next generation of demography inference methods, coupled with emerging genomic data sets, will allow researchers an unprecedented capability to investigate the demographic history of additional species.

Author post: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

This guest post is by Jared Decker on his preprint (with colleagues) “Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle“, arXived here. The post is a response to the review posted by Joe Pickrell here.

I have posted an updated version of my preprint on arXiv. Because Joe Pickrell posted his review of my preprint “Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle” on Haldane’s Sieve, I thought readers might enjoying seeing my response. I have really enjoyed having the process open to the public.

Reviewers comments are in blue.
My comments are in black.
Quotes from the manuscript are in Arial font.

Reviewer #1 [Joe Pickrell]

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. Itís probably most interesting to focus interpretation on the edges that are consistent.

We followed Reviewer #1ís recommendation and have included 6 phylogenetic networks (the original network and 5 replicates) as supplementary Figure S4. The admixed histories of several of the sample populations are quite complex, and as seen in Figure S4, the same relationships can be represented multiple ways. For example if population A is admixed between populations B, C, and D, it can be placed sister to population B with migration edges from C and D, or it can be placed sister to C with migration edges from B and D. We have tried to note in the manuscript when migration edges are not consistent. But, one of the main points of the paper, introgression for an ancestral population into the African taurine clade is consistent across all replicates.

To the third paragraph of the Admixture in Europe subsection we added, “The placement of Italian breeds is not consistent across independent TreeMix runs (Figure S4), likely due to their complicated history of admixture.”

In the second to last paragraph of the manuscript we state. “In TreeMix replicates, Texas Longhorn and Romosinuano are either sister to admixed Anatolian breeds or they receive a migration edge that originates near Brahman (Figure S4).

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not.

We have edited the manuscript by adding citations to clarify which inference is from genetics and which is from previous studies or breed histories.

For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but itís difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources.

We edited this sentence to say, “Anatolian breeds (AB, EAR, TG, ASY, and SAR) are admixed between blue European-like, grey African-like, and green indicine-like cattle (Figures 5 and 6), and we infer they do not represent the taurine populations originally domesticated in this region due to a history of admixture.

In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

The corresponding authors of this manuscript (Decker and Taylor) prefer to have the results and discussion sections combined, so we appreciate Review #1 leaving that decision up to us. But, we recognize that he brings up a valid point and have strived to make the distinction between results and discussion clearer throughout the manuscript.

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc.

Subsection headings have been added.

Specific comments:

There are quite a few results claimed in this paper, so Iím going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

Map added as Figure 5, with previous ADMIXTURE figure as Figure 6 so that readers can still see individual breed ancestries.

“We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.”

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

This migration edge is robust across 6 different TreeMix runs. The edge is from a node that is ancestral to European, Asian, and African taurine, and this node is approximately halfway between the common ancestor of domesticated indicine and the common ancestor of domesticated taurine. African auroch are extinct. Most, if not all, bovine ancient DNA samples come from much colder climates than northern Africa. So we are unable to sample African aurochs.

But, we feel it is a strength of the TreeMix analysis to identify introgression from ancestral populations that have not been sampled. We feel the interpretation that the introgression is from African auroch is the most parsimonious explanation of our PCA, ADMIXTURE, and TreeMix results.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

The claim that there was a single origin of domesticated taurine is based upon the topology of the phylogenetic network, as European, Asian, and African taurine all share a common ancestor, and the Asian clade is sister to the rest of the ingroup. This rules out the possibility of a separate domestication in Africa as a separate domestication would cause African domesticates to be sister to the rest of taurus. Larson and Burger (2013) do not consider admixture a separate domestication, and we choose to follow their definition. Two domestications with the resulting population in Africa a mixture of the two is not very parsimonious. The most parsimonious explanation is admixture from a wild relative.

We agree that we have not tested the influence of trypanosomiasis resistance on driving admixture, but we feel it is an interesting hypothesis that explains the force that drove admixture. We have rephrased the sentence as:

“We hypothesize that the introgression in Africa may have been driven by trypanosomiasis resistance in African auroch which may be the source of resistance in African taurine populations [48].”

“We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.”

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We have added a map figure so that the ADMIXTURE estimates will be easier to interpret in a geographic frame. We also added, From previous research [3] and Figures 2 and 3, these relationships should be tree-like if there were no admixture. For 53 of the possible 280 tests, the Z-score was more extreme than ±2.575829. The most extreme test statistics were f4(Wagyu, Mongolian; Simmental, Shorthorn) = -0.003 (Z-score = -5.21, other rearrangements of these groups had Z-scores of 7.32 and 16.55) and f4(Hanwoo, Wagyu; Piedmontese, Shorthorn) = 0.002 (Z-score = 4.90, other rearrangements of these groups had Z-scores of 21.79 and 27.77)

While the f4 statistics do have multiple interpretations, we do feel confident that the ADMIXTURE analysis highlights which interpretation is the most likely.

“We also identify the influence of species other than Bos taurus in the formation of Asian breeds.”

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statistics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here.

Supplementary Table 2 has been updated to report the 5 most negative statistics. The Z-scores for Brebes are smaller than -18 and the Z-scores for Madura are smaller than -13. We also note that these results are supported by the ADMIXTURE analysis.

In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

The SNPs were ascertained multiple ways, they were either a SNP in the reference Hereford animal, discovered from Sanger resequencing of 9 breeds, or reduced representation sequencing of Angus, Holstein, or a pool of breeds. Most of the SNPs were ascertained in Hereford, Angus, or Holstein.

“We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.”

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write ìshare ancestry with Shorthorn cattleî when using the Durham shorthorns as one reference?

Durham Shorthorn is the ancestral breed of Beef Shorthorn, Milking Shorthorn, and Lincoln Red (reference 30 from the manuscript), and as these are direct relationships (tree-like) we wouldnít expect significant f-statistics. We added Table S3 to report the negative f3 statistics for Maine Anjou, Santa Gertrudis, and Beefmaster. We suspect Belgian Blue have undergone too much change in allele frequencies due to intense selection and small effective population sizes since admixture to produce significant f3 statistics. We have edited the sentence to say:

“As shown in Figures S6 through S9, Table S3, and from their breed histories [31], many breeds share ancestry with Shorthorn cattle, including Milking Shorthorn, Beef Shorthorn, Lincoln Red, Maine-Anjou, Belgian Blue, Santa Gertrudis, and Beefmaster.”

Charolais and Holstein did not produce significant f3 statistics. Although they did produce significant f4 statistics, we choose to not report these.

“Iberian and Italian cattle possess introgression from African taurine.”

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

We added this as the last paragraph of the Admixture in Europe subsection.

“We also used f-statistics to explore the evidence for African taurine introgression into Spain and Italy. We did not see any significant f3 statistics, but this test may be underpowered because of the low-level of introgression. With Italian and Spanish breeds as a sister group and African breeds, including OulmËs Zaer, as the other sister group, we see 321 significant tests out of 1911 possible tests. Of these 321 significant tests, 218 contained Oulmes Zaer. We also calculated f4 statistics with the Spanish breeds as sister and the African taurine breeds as sister (excluding Oulmes Zaer). With this setup, out of the possible 675 tests we only see 1 significant test, f4(Berrenda en Negro, Pirenaica;Lagune, N’Dama (ND2)) = 0.0007, Z-score = 3.064. With Italian cattle as sister and African taurine as sister (excluding Oulmes Zaer), we see 17 significant test out of 90 possible. Patterson et al. [27] define the f4-ratio as f4(A, O; X, C)/f4(A, O; B, C), where A and B are a sister group, C is sister to (A,B), X is a mixture of B and C, and O is the outgroup. This ratio estimates the ancestry from B, denoted as α. We calculated this ratio using Shorthorn as A, Montbeliard as B, Lagune as C, Morucha as X, and Hariana as O. We choose Shorthorn, Montbeliard, Lagune, and Hariana as they appeared the least admixed in the ADMIXTURE analyses. We choose Morucha because it is solid red with African ancestry in Figure S10. This statistic estimates that Morucha is 91.23% European (α†= 0.0180993/0.0198386) and 8.77% African, which is similar to the proportion estimated by TreeMix. The multiple f4 statistics with Italian breeds as sister and African breeds as the opposing sister support African admixture into Italy. The f4-ratio test with Morucha also supports our conclusion of African admixture in Spain.”

We understand that the f4 statistics are not as easily interpreted, but the f4-ratio seems to have a straight-forward interpretation.

“American Criollo cattle are shown to be of Iberian, and not African, decent.”

I found this difficult to follow-the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

We reworded as:

“American Criollo cattle are shown to be imported from Iberia, and not directly from Africa, and African ancestry is inherited via Iberian ancestors.”

“Indicine introgression into American cattle occurred in the Americas, and not Europe”

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.

We added this sentence:

“To reiterate, Iberian cattle do not have indicine ancestry, American Criollo breeds originated from exportations from Iberia, Brahman cattle were developed in the United States in the 1880ís [31], and American Criollo breeds carry indicine ancestry, and the introgression likely occurred from Brahman cattle.”

Other responses [NB: these are responses to comments from another reviewer, but his/her comments are not printed]:

We have attempted to make the manuscript easier to read for a wider audience, but welcome additional feedback.
NOTE TO BLOG READERS: Please send me your feed back as well! @pop_gen_JED

We have rearranged the nodes of Figure 4 and we believe it is now easier to read.

The position of the migration edges denote where in time or ancestry the migration occurred. The more basal a migration edge is placed, the
migration occurred earlier in time or from a more divergent population.

As mentioned above, the placement of the migration edges is meaningful, so we prefer to keep the information displayed in this manner. We have added a brief explanation of TreeMix to the manuscript under the TreeMix analysis paragraph of the Methods section.

The geographic origin of all the populations is given in Table S1. We have edited these two sentences to say,

We find that the Indonesian Brebes (BRE) and Madura (MAD) breeds have significant Bos javanicus (BALI) ancestry demonstrated by the short branch lengths in Figures 2-4, shared ancestry with Bali in ADMIXTURE analyses (light green in Figures S7-S9), and significant f3 statistics (Table S2). The Indonesian Pesisir and Aceh and the Chinese Hainan and Luxi breeds also have Bali ancestry (migration edge c in Figure 4, and light green in Figures S8 and S9).

We agree that the reference to Murray adds confusion and have deleted these references from the manuscript.

We add “previously suggested” to this statement to identify that these two waves have previously been inferred in the literature from archeology and genetics. We also feel that the use of “possibly” suggests that this is an interpretation and not a concrete result. In regards to the evidence to support our interpretation, we see two analyses, ADMIXTURE and TreeMix, suggesting two clades of indicine introgression.

Durham Shorthorns are the ancestors of Beef Shorthorns, Milking Shorthorns, Lincoln Reds, Belgium Blues, and Maine Anjous. We add a parenthetical with a citation to reference 31 to clarify this.

Table 1 was moved to the supplement.

One of the main assumptions and conclusions of McTavish et al is that there are no pure taurines in Africa; all cattle in Africa have indicine ancestry. Our results suggest that this is not true and pure taurines do exist in Africa. We have added, “Thus, we conclude that contrary to the assumptions and conclusions of [57] cattle with pure taurine ancestry do exist in Africa.

Added “The f3 and f4 statistics look for correlations in allele frequencies that are not compatible with a bifurcating tree; these statistics provide support for admixture in the history of the tested populations [26,27].” as the first sentence of the f3 and f4 statistics subsection in the Methods section.

If cattle were separately domesticated in Africa they would be the most divergent taurine clade. But, TreeMix finds, separate from user intervention, that the best model for the relationship between indicine, Asian taurine, African taurine, and European taurine is indicine as the outgroup, European and African taurine† as sister groups, and Asian taurine as the most divergent taurine group. I.e. (indicine,(Asian taurine, (African taurine, European taurine))). But, this model also includes admixture from an unsampled ancestral population that is approximately equally divergent from taurine and indicine. Our sampling is quite extensive and has sampled populations across Europe and Africa. But, we are unable to sample African auroch as they are extinct. Rather than separate domestication and admixture being indistinguishable, the gene frequencies suggest that there was introgression into African domesticated taurine from an ancestral population. We strongly feel the most parsimonious explanation is introgression from African auroch.

From Stock and Gifford-Gonzalez 2013, “The central fact around which disparate speculations about the origins of African cattle turn is one upon which all can also agree: northern Africa was home to wild aurochsen, Bos primigenius, from the Middle Pleistocene onwards (Linseele 2004).” We have added citations to Stock and Gifford-Gonzalez 2013 and Linseele 2004 to our manuscript.

Other changes:

Changed “elucidate” to “reveal” in Author Summary.

Second paragraph of TreeMix subsection of Methods section: Changed migration rate to migration proportion

Results section, Worldwide patterns subsection, 2nd paragraph, 6th sentence: Changed “(Figure 4)” to “(Figures 4 and 5, discussed in detail in the following subsections)”.

Results section,
Divergence within the taurine lineage subsection, 1st paragraph. Added “
We also see some runs of TreeMix placing a migration edge from Chianina cattle to Asian taurines (Figure S4).

Author post: Physical constraints determine the logic of bacterial promoter architectures

Our next guest post is by Radu Zabet on his manuscript (with co-workers) Physical constraints determine the logic of bacterial promoter architectures, arXived here

Earlier last year we explored the possibility of understanding ‘real biology’ using our stochastic simulation framework GRiP ( That software simulates how transcription factors (TFs) find their target sites in the genome, using a combination of three-dimensional diffusion around and one-dimensional walk on the DNA. This biophysical mechanism is quite well studied and is commonly termed ‘facilitated diffusion’. Unlike a homing missile, the trace of a TF molecule to its target site occurs somewhat erratic, and with many other factors around, even ‘traffic jams’ on the DNA seem possible (that and other interesting phenomena were subject of two other arXiv contributions we put online last year – see Haldane’s Sieve for more, and the two publications: and

Often times, TF binding sites are closely packed or even overlapping. In our latest paper, we explore how the spacing of binding sites along the DNA can influence the probability of a “TF traffic jam” occurring, and thereby influencing the length of a TF’s “commute” to its binding site ( We notice that one of the promoter organisations that we predict would cause massive traffic jams is underrepresented among E. coli promoters, suggesting that this phenomenon may have an important biological role.

One of the most common approaches to predicting TF occupancy is statistical thermodynamics, which assumes that the system is in steady state. Here we show that under biologically relevant parameters, a TF might take longer than a cell cycle to arrive to its binding site when the promoter is organized in a “traffic jam” inducing way. Therefore, it is important to consider the dynamics of TF binding, rather than just the steady state.

Usually, transcriptional logic refers to the idea that the specific combinations of TFs that bind to a gene promoter control the expression level of that gene. We extend this notion of transcriptional logic by proposing that the response to multiple regulatory inputs can also depend on the dynamics of TF binding. In other words: not only the final combinatorial pattern, but also the order in which these sites are occupied matters. In this context, we suggest that the spatial organisation of the promoter encodes the logic, influenced by TF concentrations that modulate promoter occupancy dynamics in biologically relevant time scales.

Using computer simulations of the search process, we show that the logic of complex bacterial promoters can be explained by the combinatorial action of three commonly found basic building blocks: switches, barriers and clusters, whose characteristics we analyse in detail.

The precise spacing of TF binding sites plays a key role in our model, and we show that physically constrained promoter organizations are commonly found in bacterial genomes and are conserved.

Finally, we also developed a new web-based computational tool (faster GRiP, or fGRIP), which is able to generate the dynamics of promoter occupancy for bacterial systems. This tool is available at

Author post: Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer

Our next guest post is by David Baltrus (@surt_lab) on his group’s preprint Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer, posted on bioRviv here.

The function of modern pickup trucks is usually to haul heavy loads from point A to point B. However, the F-150 sitting in my driveway right now looks very different from its Model T ancestor from ~100 years ago. Over the years, as truck design has been modified and improved, all of the parts (brakes, air conditioning systems, doors, wheels, etc…) have been crafted to fit and work efficiently together. In process, each of the parts you see on a pickup truck today have been selectively co-evolving with all of the other design elements on the truck. The function of a house is to provide shelter.You can easily extend the the co-evolutionary metaphor from above to explain how different aspects of the house I live in relate to one another.

Some time ago, someone had the brilliant idea merge houses and pickup trucks into a camper. These hybrids between pickups and houses provide the functionality of being able to drive around, while also maintaining the ability to provide shelter. However, in the beginning, these hybrids likely didn’t accelerate as fast and consumed more energy and resources than unweighted pickups. They were likely a little taller than unweighted pickups, and as such might not be able to use certain bridges or tunnels. The brakes probably didn’t work as well. I can go on and on, but that would belabor the point I’m trying to make. In the beginning, if you just place two independently designed systems together Rube Goldberg style, the result will likely be functional but inefficient. Over the years, as engineers have worked to smoothly merge all of the systems of pickup and house together, campers have gotten much better at doing both jobs simultaneously.

Fig. 1: A truck-house hybrid is born. Images from Wikipedia

Fig. 1: A truck-house hybrid is born. Images from Wikipedia

Continue reading

Author post: Sex-biased microRNAs in Drosophila melanogaster

This guest post is by Antonio Marco (@amarcobio) on his paper: Sex-biased microRNAs in Drosophila melanogaster

The expression profile of a gene affects its evolutionary fate. Conversely, the evolutionary history of a gene is reflected in its expression pattern. Understanding the complex relationship between expression and evolution is a major challenge in evolutionary genetics. In particular, the evolution of sex-biased gene expression is an all-time favourite. With the advent of high-throughput technologies, sex-biased expression has been widely studied, and a number of significant observations are generally accepted.

First, there is a paucity of male-biased genes in the X chromosome (in X/Y species). However, recently emerged genes in the X tend to be male-biased. An ongoing demasculinization of X chromosomes may explain this pattern. (Interestingly, recent works suggest that demasculinization may not be happening in Drosophila.) On the contrary, female-biased genes are enriched in the X chromosome and less frequently found in the autosomes. However, these studies are based on protein-coding genes, and little is known about other genes.

MicroRNAs are short regulatory RNAs which repress translation. MicroRNAs are now known to be involved in many developmental process, including sex differentiation. Unlike protein-coding genes, microRNA genes frequently emerge de novo in the genome. Also, a microRNA transcript frequently produces multiple products, including other microRNAs or protein-coding genes. In summary, the biology of microRNAs is substantially different to that of protein-coding genes, and so must be its evolutionary dynamics. In a recent paper deposited in arXiv, I explore the evolutionary origin of sex-biased microRNAs in Drosophila melanogaster.

By analysing deep sequencing data from multiple sources I observed that sex-biased microRNAs are, as expected, involved in the reproductive function. Contrary to protein-coding genes, there is an enrichment of male-biased genes in the X chromosome. Also, there is no conclusive evidence of demasculinization affecting microRNAs. On the other hand, female-biased microRNAs are encoded in the autosomes. Interestingly, many female-biased microRNAs are encoded within the introns of female-biased protein-coding genes. A detailed analysis reveals that maternally transmitted microRNAs may be hitch-hiked by the maternal deposition of the host gene transcript. Ongoing work in the lab is aimed to confirm this hypothesis.

In summary, the chromosomal distribution of sex-biased expressed microRNAs is exactly the opposite we observe in protein-coding genes. This analysis suggests that this is a consequence of a differential evolutionary dynamics. As novel microRNAs frequently emerge in the X chromosome, they acquire ‘at birth’ male biased expression. However, instead of a movement out-of-the-X, these microRNAs get eventually lost. Hence, there is an enrichment of male-biased microRNAs in the X. On the contrary, female-biased expression is frequently acquired by microRNAs encoded in the intron of female expressed host genes. The origin and evolution of sex-biased microRNAs is, therefore, a consequence of a high rate of de novo emergence.

Author post: Joint analysis of functional genomic data and genome-wide association studies of 18 human traits

The following post is by Joe Pickrell [@joe_pickrell] on his preprint Joint analysis of functional genomic data and genome-wide association studies of 18 human traits, available on bioRxiv here.

Until recently, the field of human genetics struggled to identify genetic variants that influence complex traits and diseases like height or diabetes. With the arrival of genome-wide association studies (GWAS), studies now regularly identify tens to hundreds of genomic regions that contain such variants. The question going forward is clear: how do these variants influence traits?

One way to answer this question involves annotating variants according to their potential functions–does a given variant change the sequence of a protein? Or does it disrupt the splicing of a gene? Or does it fall in a regulatory region in an important cell type? Many groups (like those that are part of the ENCODE project) are generating hundreds of datasets that are potentially informative about these types of questions. But which of these hundreds of datasets are relevant when studying a given trait?


In this preprint, I develop a statistical method (an empirical Bayes hierarchical model) that takes summary statistics from a genome-wide association study of a given trait and identifies the types of genomic annotation that are relevant for the trait; software implementing this method is available here. I then applied this method to a set of 18 traits and 450 genomic annotations. Feedback on the method itself is of course welcome, but I’d also like to highlight what I think are the most interesting biological results:

  1. The relative importance of protein-coding versus regulatory variants varies across traits. The fraction of GWAS hits driven by changes in protein sequence depends on the trait, ranging from a low of around 2% up to around 20% (see above).
  2. Repressed chromatin is depleted for loci that influence traits. I was surprised to find that the most informative type of information for interpreting a GWAS is often repressed chromatin, which is depleted for loci influencing traits. This type of chromatin covers up to 70% of the genome.
  3. Cell type-specific DNase-I hypersensitive sites are enriched for loci that influence traits. A “hypothesis-free” scan across all regulatory regions in many tissues can identify unexpected connections between traits and tissues. For example, loci that influence bone density are enriched in gene regulatory regions in muscle tissue, and loci that influence Crohn’s disease are enriched in regulatory regions in fibroblasts.
  4. Incorporating functional information into a GWAS increases power to detect loci. Finally, re-weighting a GWAS using this method increases the number of loci identified in each GWAS by around 5%; many of the loci identified with this method have been replicated in larger studies.

I’m hopeful that this method (and others like it) will be useful in making the transition from identifying statistical associations in a GWAS to understanding the underlying biology; comments and criticisms are welcome.