This guest post is by Radu Zabet on his preprint (with Boris Adryan) “Estimating transcription factor abundance and specificity from genome-wide binding profiles“, arXived here.
Binding of transcription factors (TFs) to the genome controls gene activity by either increasing or reducing the rate of transcription. We previously used stochastic simulations of the TF search mechanism (the facilitated diffusion mechanism which assumes both three-dimensional diffusion and one-dimensional random walk on the DNA) and investigated the binding of TFs to the genome; see http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0073714#pone-0073714-g006 and http://nar.oxfordjournals.org/content/42/7/4196; also covered on https://haldanessieve.org/2013/04/09/our-paper-the-effects-of-transcription-factor-competition-on-gene-regulation/ and
https://haldanessieve.org/2014/01/10/author-post-physical-constraints-determine-the-logic-of-bacterial-promoter-architectures/. Our results confirmed that the binding profiles of TFs are mainly affected by the binding energy (usually represented by the Position Weight Matrix – PWM) between the TF and DNA and the number of molecules. What this means is that the binding profiles can now be approximated by the equilibrium occupancy and, thus, instead of running computationally expensive stochastic simulations, one can use the statistical thermodynamics framework to predict these binding profiles.
The statistical thermodynamics framework entails the computation of the statistical weight for each possible configuration of the system (the specific combination of locations on the DNA where TF molecules are bound). It immediately becomes clear that the number of possible configurations grows with increasing DNA segment size; thus making it impossible to compute genome-wide profiles. We addressed this using several approximations within the statistical thermodynamics framework and, based on these approximations, we derived an analytical solution. This allows the computation of genome-wide binding profiles by scanning the DNA quite similar to more naïve PWM based approaches. Our model takes as inputs four parameters: (i) the PWM scores, (ii) DNA accessibility data, (iii) the number of bound molecules and (iv) a factor that controls the specificity of the TF by rescaling the PWM scores. The first two are usually known from experimental data, while the last two are difficult to estimate from experiments and are usually computed by fitting the model to the data.
To test our model, we applied it to five ChIP-seq data sets (for Drosophila Bicoid, Caudal, Giant, Hunchback and Kruppel). Our results confirmed that, when including DNA accessibility data, the model fits the ChIP-seq profile with high accuracy (correlation coefficient > 0.65 for 4/5 TFs). Interestingly, we found that most TFs display lower abundance (in the range of 10-1000) than previously estimated (10000-100000). In addition, we also observed that while Bicoid and Caudal display high specificity (and our model predicts with good accuracy their ChIP-seq profiles), Giant, Hunchback and Kruppel display a lower specificity. Finally, we would like to emphasize that our method is applicable to any eukaryotic system for which the required data is available and can be applied genome-wide.
Our paper is accompanied by a how-to and all raw data to replicate our results: http://logic.sysbiol.cam.ac.uk/nrz/ChIPprofile/.