Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.

Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here


Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

Robust estimation of microbial diversity in theory and in practice

Robust estimation of microbial diversity in theory and in practice
Bart Haegeman, Jérôme Hamelin, John Moriarty, Peter Neal, Jonathan Dushoff, Joshua S. Weitz
(Submitted on 15 Feb 2013)

Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao’s estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (“Hill diversities”), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao’s estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor
Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
(Submitted on 15 Feb 2013)

RNA sequencing (RNA-seq) has been rapidly adopted for the multilayered profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, thus there is a need for guidance on current best practices. This protocol presents a “state-of-the-art” computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day, even with modest resources.

Muller’s ratchet with overlapping generations

Muller’s ratchet with overlapping generations
Jakob J. Metzger, Stephan Eule
(Submitted on 14 Feb 2013)

Muller’s ratchet is a paradigmatic model for the accumulation of deleterious mutations in a population of finite size. A click of the ratchet occurs when all individuals with the least number of deleterious mutations are lost irreversibly due to a stochastic fluctuation. In spite of the simplicity of the model, a quantitative understanding of the process remains an open challenge. In contrast to previous works, we here study a model of the ratchet with overlapping generations. Employing an approximation which describes the fittest individuals as one class and the rest as a second class, we obtain closed analytical expressions of the ratchet rate in the rare clicking regime. As a click in this regime is caused by a rare large fluctuation from a metastable state, we do not resort to a diffusion approximation but apply an approximation scheme which is especially well suited to describe extinction events from metastable states. This method also allows for a derivation of expressions for the quasi-stationary distribution of the fittest class. Additionally, we show numerically that the formulation with overlapping generations leads to similar results as the standard model with non-overlapping generations and the diffusion approximation in the regime where the ratchet clicks frequently. For parameter values closer to the rare clicking regime, however, we find that the rate of Muller’s ratchet strongly depends on the microscopic reproduction model.

Disentangling the effects of geographic and ecological isolation on genetic differentiation

Disentangling the effects of geographic and ecological isolation on genetic differentiation
Gideon Bradburd, Peter Ralph, Graham Coop
(Submitted on 13 Feb 2013)

Populations can be genetically isolated by geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (e.g. the partial Mantel test) have a number of drawbacks. In this article, we present a Bayesian method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. We model the allele frequencies in populations at a set of unlinked loci as spatial Gaussian processes, and model the covariance structure of pairs of populations as a decreasing function of both geographic and ecological distance between that pair. Parameters of the model are estimated using a Markov chain Monte Carlo algorithm. We have implemented this method, Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE), in a user-friendly format in the statistical platform R, and we demonstrate its utility with a simulation study and empirical applications to human and teosinte datasets.

Slow evolution of vertebrates with large genomes

Slow evolution of vertebrates with large genomes
Bianca Sclavi, John Herrick

Darwin introduced the concept of the “living fossil” to describe species belonging to lineages that have experienced little evolutionary change, and suggested that species in more slowly evolving lineages are more prone to extinction (1). Recent studies revealed that some living fossils such as the lungfish are indeed evolving more slowly than other vertebrates (2, 3). The reason for the slower rate of evolution in these lineages remains unclear, but the same observations suggest a possible genome size effect on rates of evolution. Genome size (C-value) in vertebrates varies over 200 fold ranging from pufferfish (0.4 pg) to lungfish (132.8 pg) (4). Variation in genome size and architecture is a fundamental cellular adaptation that remains poorly understood (5). C-value is correlated with several allometric traits such as body size and developmental rates in many, but not all, organisms (6, 7). To date, no consensus exists concerning the mechanisms driving genome size evolution or the effect that genome size has on species traits such as evolutionary rates (8-12). In the following we show that: 1) within the same range of divergence times, genetic diversity decreases as genome size increases and 2) average rates of molecular evolution decline with increasing genome size in vertebrates. Together, these observations indicate that genome size is an important factor influencing rates of speciation and extinction.