Author post: Estimating transcription factor abundance and specificity from genome-wide binding profiles

This guest post is by Radu Zabet on his preprint (with Boris Adryan) “Estimating transcription factor abundance and specificity from genome-wide binding profiles“, arXived here.

Binding of transcription factors (TFs) to the genome controls gene activity by either increasing or reducing the rate of transcription. We previously used stochastic simulations of the TF search mechanism (the facilitated diffusion mechanism which assumes both three-dimensional diffusion and one-dimensional random walk on the DNA) and investigated the binding of TFs to the genome; see http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0073714#pone-0073714-g006 and http://nar.oxfordjournals.org/content/42/7/4196; also covered on https://haldanessieve.org/2013/04/09/our-paper-the-effects-of-transcription-factor-competition-on-gene-regulation/ and
https://haldanessieve.org/2014/01/10/author-post-physical-constraints-determine-the-logic-of-bacterial-promoter-architectures/. Our results confirmed that the binding profiles of TFs are mainly affected by the binding energy (usually represented by the Position Weight Matrix – PWM) between the TF and DNA and the number of molecules. What this means is that the binding profiles can now be approximated by the equilibrium occupancy and, thus, instead of running computationally expensive stochastic simulations, one can use the statistical thermodynamics framework to predict these binding profiles.

The statistical thermodynamics framework entails the computation of the statistical weight for each possible configuration of the system (the specific combination of locations on the DNA where TF molecules are bound). It immediately becomes clear that the number of possible configurations grows with increasing DNA segment size; thus making it impossible to compute genome-wide profiles. We addressed this using several approximations within the statistical thermodynamics framework and, based on these approximations, we derived an analytical solution. This allows the computation of genome-wide binding profiles by scanning the DNA quite similar to more naïve PWM based approaches. Our model takes as inputs four parameters: (i) the PWM scores, (ii) DNA accessibility data, (iii) the number of bound molecules and (iv) a factor that controls the specificity of the TF by rescaling the PWM scores. The first two are usually known from experimental data, while the last two are difficult to estimate from experiments and are usually computed by fitting the model to the data.

To test our model, we applied it to five ChIP-seq data sets (for Drosophila Bicoid, Caudal, Giant, Hunchback and Kruppel). Our results confirmed that, when including DNA accessibility data, the model fits the ChIP-seq profile with high accuracy (correlation coefficient > 0.65 for 4/5 TFs). Interestingly, we found that most TFs display lower abundance (in the range of 10-1000) than previously estimated (10000-100000). In addition, we also observed that while Bicoid and Caudal display high specificity (and our model predicts with good accuracy their ChIP-seq profiles), Giant, Hunchback and Kruppel display a lower specificity. Finally, we would like to emphasize that our method is applicable to any eukaryotic system for which the required data is available and can be applied genome-wide.

Our paper is accompanied by a how-to and all raw data to replicate our results: http://logic.sysbiol.cam.ac.uk/nrz/ChIPprofile/.

The evolution of genetic diversity in changing environments

The evolution of genetic diversity in changing environments

Oana Carja, Uri Liberman, Marcus W. Feldman

The production and maintenance of genetic and phenotypic diversity under temporally fluctuating selection and the signatures of environmental and selective volatility in the patterns of genetic and phenotypic variation have been important areas of focus in population genetics. On one hand, stretches of constant selection pull the genetic makeup of populations towards local fitness optima. On the other, in order to cope with changes in the selection regime, populations may evolve mechanisms that create a diversity of genotypes. By tuning the rates at which variability is produced, such as the rates of recombination, mutation or migration, populations may increase their long-term adaptability. Here we use theoretical models to gain insight into how the rates of these three evolutionary forces are shaped by fluctuating selection. We compare and contrast the evolution of recombination, mutation and migration under similar patterns of environmental change and show that these three sources of phenotypic variation are surprisingly similar in their response to changing selection. We show that knowing the shape, size, variance and asymmetry of environmental runs is essential for accurate prediction of genetic evolutionary dynamics.

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

Josef C Uyeda, Luke J Harmon

Our understanding of macroevolutionary patterns of adaptive evolution has greatly increased with the advent of large-scale phylogenetic comparative methods. Widely used Ornstein-Uhlenbeck (OU) models can describe an adaptive process of divergence and selection. However, inference of the dynamics of adaptive landscapes from comparative data is complicated by interpretational difficulties, lack of identifiability among parameter values and the common requirement that adaptive hypotheses must be assigned a priori. Here we develop a reversible-jump Bayesian method of fitting multi-optima OU models to phylogenetic comparative data that estimates the placement and magnitude of adaptive shifts directly from the data. We show how biologically informed hypotheses can be tested against this inferred posterior of shift locations using Bayes Factors to establish whether our a priori models adequately describe the dynamics of adaptive peak shifts. Furthermore, we show how the inclusion of informative priors can be used to restrict models to biologically realistic parameter space and test particular biological interpretations of evolutionary models. We argue that Bayesian model-fitting of OU models to comparative data provides a framework for integrating of multiple sources of biological data–such as microevolutionary estimates of selection parameters and paleontological timeseries–allowing inference of adaptive landscape dynamics with explicit, process-based biological interpretations.

Soft selective sweeps in complex demographic scenarios

Soft selective sweeps in complex demographic scenarios

Benjamin A Wilson, Dmitri Petrov, Philipp W Messer

Recent studies have shown that adaptation from de novo mutation often produces so-called soft selective sweeps, where adaptive mutations of independent mutational origin sweep through the population at the same time. Population genetic theory predicts that soft sweeps should be likely if the product of the population size and the mutation rate towards the adaptive allele is sufficiently large, such that multiple adaptive mutations can establish before one has reached fixation; however, it remains unclear how demographic processes affect the probability of observing soft sweeps. Here we extend the theory of soft selective sweeps to realistic demographic scenarios that allow for changes in population size over time. We first show that population bottlenecks can lead to the removal of all but one adaptive lineage from an initially soft selective sweep. The parameter regime under which such ‘hardening’ of soft selective sweeps is likely is determined by a simple heuristic condition. We further develop a generalized analytical framework, based on an extension of the coalescent process, for calculating the probability of soft sweeps under arbitrary demographic scenarios. Two important limits emerge within this analytical framework: In the limit where population size fluctuations are fast compared to the duration of the sweep, the likelihood of soft sweeps is determined by the harmonic mean of the variance effective population size estimated over the duration of the sweep; in the opposing slow fluctuation limit, the likelihood of soft sweeps is determined by the instantaneous variance effective population size at the onset of the sweep. We show that as a consequence of this finding the probability of observing soft sweeps becomes a function of the strength of selection. Specifically, in species with sharply fluctuating population size, strong selection is more likely to produce soft sweeps than weak selection. Our results highlight the importance of accurate demographic estimates over short evolutionary timescales for understanding the population genetics of adaptation from de novo mutation.

Author post: VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

This guest post is by Olly Burren and Chris Wallace on their preprint, VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes, arXived here.

The idea for this paper came from reading a study by Liu et al. ( http://www.sciencedirect.com/science/article/pii/S0002929710003125) and the fact that summary p values from genome wide association studies are increasingly becoming publicly available. In the field of human disease, genome-wide association studies have been very successful in isolating regions of the genome that confer disease susceptibility. The next step however, is to understand mechanistically exactly how variation in these loci gives rise to this susceptibility. There are a myriad of pre-existing methods available for integrating genetic and genomic datasets, however things are complicated by the high degree of linkage disequilibrium that exists, which causes substantial inflation in the variance of any test statistic. This inter-SNP correlation must be taken into account, classically by permuting case/control status and recomputing association, requiring access to raw genotyping data. Indeed, this approach was taken in our previously published method see Heing et al. (http://www.nature.com/nature/journal/v467/n7314/full/nature09386.html) which uses a non-parametric test to compare distribution of GWAS p values from two sets of SNPs (“test” and “control”). As most researchers working with GWAS know gaining access to raw genotyping data is often difficult, and then how to include meta-analysis and imputed data? Liu et al., got around this by estimating the inter-SNP correlation using public datasets and sampling from a multivariate normal to generate simulated p values, analogous to the permuted p values possible with permuting phenotype status when raw data are available. VEGAS uses genotype data publicly available through the International HapMap project and aims to integrate GWAS results with trans eQTLs to identify causal disease genes.

Our thought was that by combining our previously published method, with the VEGAS approach, we could create a novel approach that would allow the integration of genetic information from GWAS with functional information from for example a set of micro-array experiments, crucially without the need for genotype information. The rationale being that it would help to prioritise future mechanistic studies, which can be costly and time-consuming to conduct. We also upped the stakes, and decided to use 1000 Genomes Project genotyping information for our estimations, to allow application to dense-genotyping technologies. The result was a software pipeline that takes as input a gene set of interest, a matched ‘control’ set and a summary set of GWAS statistics and computes an enrichment score.

Note that this approach differs from the Bayesian model suggested by Pickrell (https://haldanessieve.org/2013/12/16/author-post-joint-analysis-of-functional-genomic-data-and-genome-wide-association-studies-of-18-human-traits) as it focuses on comparing broad regions, rather than on considering more targeted genomic annotation, and in that sense is perhaps more akin to pathway analysis, although we do suggest that functionally defined genes sets, such as those found by knock down experiments in cell lines, may be more productive than using manually annotated pathways whose completeness can vary considerably.

To illustrate the method we applied it to a large meta-analysis GWAS study of type 1 diabetes (8000 case vs 8000 controls), and an interesting dataset examining the effect on gene-expression of knocking down a series of 59 transcription factors in a lymphoblastoid cell line see Cusanovich et al (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004226). We identified three transcription factors, IKZF3, BATF and ESRRA, whose putative targets are significantly enriched for variation associated with type 1 diabetes susceptibility. IKZF3 overlaps a known type 1 diabetes susceptibility region, whereas BATF and ESRRA overlap other autoimmune susceptibility regions, validating our approach. Of course there are caveats interpreting results derived from cell lines, however we think it’s promising that our top hit lies in a region already associated with type 1 diabetes susceptibility.
Using the quantities already computed, once enrichment is detected, we implemented a simple technique to prioritise genes within the set. This allows the generation of a succinct list of genes that are responsible for the enrichment detected on the global level. Cross referenced with other information these can either be informative in their own right or be used to inform future studies.

This study is also an example of the preprint process speeding up scientific discovery. We knew about the Cusanovich dataset because they released a preprint on arXiv, which was caught by Haldane’s Sieve (https://haldanessieve.org/2013/10/22/the-functional-consequences-of-variation-in-transcription-factor-binding/) in October 2013. One email, and the authors kindly shared their complete results. Had we waited for it to be published in PLoS Genetics in March 2014, we’d have been five months behind where we are.

The major benefit is that all of the datasets employed are within the public domain. Our hope is that either this or other methods in the same vein will help to bridge the gap between GWAS and disease mechanisms, ultimately fuelling the development of new therapeutics.

Estimating transcription factor abundance and specificity from genome-wide binding profiles


Estimating transcription factor abundance and specificity from genome-wide binding profiles

Nicolae Radu Zabet, Boris Adryan
Comments: 39 pages, 25 figures, 10 tables
Subjects: Quantitative Methods (q-bio.QM)

The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in case of eukaryotes), the number of TF molecules expected to be bound to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display a lower number of DNA-bound molecules than previously assumed (in the range of tens and hundreds) and that, while Bicoid and Caudal display a higher specificity, the other three transcription factors (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound to the DNA.

Are we able to detect mass extinction events using phylogenies ?

Are we able to detect mass extinction events using phylogenies ?
Sacha S.J. Laurent, Marc Robinson-Rechavi, Nicolas Salamin
Comments: 14 pages, 8 figures
Subjects: Populations and Evolution (q-bio.PE)

The estimation of the rates of speciation and extinction provides important information on the macro-evolutionary processes shaping biodiversity through time (Ricklefs 2007). Since the seminal paper by Nee et al. (1994), much work have been done to extend the applicability of the birth-death process, which now allows us to test a wide range of hypotheses on the dynamics of the diversification process. Several approaches have been developed to identify the changes in rates of diversification occurring along a phylogenetic tree. Among them, we can distinguish between lineage-dependent, trait-dependent, time-dependent and density-dependent changes. Lineage specific methods identify changes in speciation and extinction rates — {\lambda} and {\mu}, respectively — at inner nodes of a phylogenetic tree (Rabosky et al. 2007; Alfaro et al. 2009; Silvestro et al. 2011). We can also identify trait-dependence in macro-evolutionary rates if the states of the particular trait of interest are known for the species under study (Maddison et al. 2007; FitzJohn et al. 2009; Mayrose et al. 2011). It is also possible to look for concerted changes in rates on independent branches of the phylogenetic tree by dividing the tree into time slices (Stadler 2011a). Finally, density-dependent effects can be detected when changes of diversification are correlated with overall species number (Etienne et al. 2012). Most methods can correct for incomplete taxon sampling, by assigning species numbers at tips of the phylogeny (Alfaro et al. 2009; Stadler and Bokma 2013), or by introducing a sampling parameter (Nee et al. 1994). By taking into account this sampling parameter at time points in the past, one can also look for events of mass extinction (Stadler 2011a).