Methods to study splicing from high-throughput RNA Sequencing data

Methods to study splicing from high-throughput RNA Sequencing data
Gael P. Alamancos, Eneritz Agirre, Eduardo Eyras
(Submitted on 22 Apr 2013)

The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.

Our paper: Inferring non-neutral regulatory change in pathways from transcriptional profiling data

This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.

We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.

This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.

Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.

Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.

We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me:
jgschraiber@berkeley.edu.

Inferring non-neutral regulatory change in pathways from transcriptional profiling data

Inferring non-neutral regulatory change in pathways from transcriptional profiling data
Joshua G. Schraiber, Yulia Mostovoy, Tiffany Y. Hsu, Rachel B. Brem
(Submitted on 19 Apr 2013)

An outstanding question in comparative genomics is the evolutionary importance of gene expression differences between species. Rigorous molecular-evolution methods to infer evidence for natural selection from transcriptional profiling data are at a premium in the field, and to date, phylogenetic approaches have not been well-suited to address the question in the small sets of taxa profiled in standard surveys of gene expression. To meet this challenge, we have developed a strategy to infer evolutionary histories from expression data by analyzing suites of genes of common function. In a manner conceptually similar to molecular-evolution models in which the evolutionary rates of DNA sequence at multiple loci follow a gamma distribution, we modeled expression of the genes of an a priori-defined pathway with rates drawn from an inverse-gamma distribution. We then developed a fitting strategy to infer the parameters of this distribution from expression measurements, and to identify gene groups whose expression patterns were consistent with evolutionary constraint or rapid evolution in particular species. Simulations confirmed the power and accuracy of our inference method. As an experimental testbed for our approach, we generated and analyzed transcriptional profiles of four Saccharomyces yeasts. The results revealed pathways with signatures of constrained and accelerated regulatory evolution in individual yeasts, and across the phylogeny, highlighting the prevalence of pathway- level expression change during the divergence of yeast species. We anticipate that our pathway-based phylogenetic approach will be of broad utility in the search to understand the evolutionary relevance of regulatory change.

Clusters of microRNAs emerge by new hairpins in existing transcripts

Clusters of microRNAs emerge by new hairpins in existing transcripts
Antonio Marco, Maria Ninova, Matthew Ronshaugen, Sam Griffiths-Jones
(Submitted on 9 Apr 2013)

Genetic linkage may result in the expression of multiple products from a single polycistronic transcript, under the control of a single promoter. In animals, protein-coding polycistronic transcripts are rare. However, microRNAs are frequently clustered in the genomes of animals and plants, and these clusters are often transcribed as a single unit. The evolution of microRNA clusters has been the subject of much speculation, and a selective advantage of clusters of functionally related microRNAs is often proposed. However, the origin of microRNA clusters has not been so far systematically explored. Here we study the evolution of all microRNA clusters in Drosophila melanogaster, and suggest a number of models for their emergence. We observed that a majority of microRNA clusters arose by the de novo formation of new microRNA-like hairpins in existing microRNA transcripts. Some clusters also emerged by tandem duplication of a single microRNA. Comparative genomics show that these clusters, once formed, are unlikely to split or undergo rearrangements. We did not find any instances of clusters appearing by rearrangement of pre-existing microRNA genes. We propose a model for microRNA cluster origin and evolution in which selection over one of the microRNAs in the cluster interferes with the evolution of the other tightly linked microRNAs. Our analysis suggests that the evolutionary study of microRNAs and other small RNAs must consider and account for linkage associations.

The effects of transcription factor competition on gene regulation

The effects of transcription factor competition on gene regulation

Nicolae Radu Zabet, Boris Adryan
(Submitted on 27 Mar 2013)

We performed stochastic simulations of transcription factor (TF) molecules translocating by facilitated diffusion (a combination of 3D diffusion in the cytoplasm and 1D random walk on the DNA), and consider various abundances of cognate and non-cognate TFs to assess the influence of competitor molecules that also move along the DNA. We show that molecular crowding on the DNA always leads to longer times required by TF molecules to locate their target sites as well as to lower occupancy, which may confer a general mechanism to control gene activity levels globally. Finally, we show that crowding on the DNA may increase transcriptional noise through increased variability of the occupancy time of the target sites.

The influence of transcription factor competition on the relationship between occupancy and affinity

The influence of transcription factor competition on the relationship between occupancy and affinity

Nicolae Radu Zabet, Robert Foy, Boris Adryan
(Submitted on 27 Mar 2013)

Transcription factors (TFs) are proteins that bind to specific sites on the DNA and regulate gene activity. Identifying where TF molecules bind and how much time they spend on their target sites is key for understanding transcriptional regulation. It is usually assumed that the free energy of binding of a TF to the DNA (the affinity of the site) is highly correlated to the amount of time the TF remains bound (the occupancy of the site). However, knowing the binding energy is not sufficient to infer actual binding site occupancy. This mismatch between the occupancy predicted by the affinity and the observed occupancy may be caused by various factors, such as TF abundance, competition between TFs or the arrangement of the sites on the DNA. We investigated the relationship between the affinity of a TF for a set of binding sites and their occupancy. In particular, we considered the case of lac repressor (lacI) in E.coli and performed stochastic simulations of the TF dynamics on the DNA for various combinations of lacI abundance in competition with TFs that contribute to macromolecular crowding. Our results showed that for medium and high affinity sites, TF competition does not play a significant role in genomic occupancy, except in cases when the abundance of lacI is significantly increased or when a low-information content PWM was used. Nevertheless, for medium and low affinity sites, an increase in TF abundance (for both lacI or other molecules) leads to an increase in occupancy at several sites. Keywords: facilitated diffusion, Position Weight Matrix, thermodynamic equilibrium, motif information content, molecular crowding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding
Mathilde Paris, Tommy Kaplan, Xiao Yong Li, Jacqueline E. Villalta, Susan E. Lott, Michael B. Eisen
(Submitted on 1 Mar 2013)

To better characterize how variation in regulatory sequences drives divergence in gene expression, we undertook a systematic study of transcription factor binding and gene expression in the blastoderm embryos of four species that sample much of the diversity in the 60 million-year old genus Drosophila: D. melanogaster, D. yakuba, D. pseudoobscura and D. virilis. We compared gene expression, as measured by mRNA-seq to the genome-wide binding of four transcription factors involved in early development, as measured by ChIP-seq (Bicoid, Giant, Hunchback and Kr\”uppel). Surprisingly, we found that mRNA levels are much better conserved than individual binding events. We looked at binding characteristics that may explain such evolutionary disparity. As expected, we found that binding divergence increases with phylogenetic distance. Interestingly, binding events in non-coding regions that were bound strongly by single factors, or bound by multiple factors, were more likely to be conserved. As this class of sites are most likely to be involved in gene regulation, the divergence of other bound regions may simply reflect their lack of function. We used a model of quantitative trait evolution to compare the changes of gene expression with nearby regulatory TF binding. We found that changes in gene expression were poorly explained by changes in associated TF binding. These results suggest that some of the differences in sequence and binding have limited effect on gene expression or act in a compensatory manner to maintain the overall expression levels of regulated genes.

Our paper: Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Our next guest post is by Mike Eisen [@mbeisen] on his paper with Peter Combs [@rflrob]
Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression. arXived here.

This is cross posted from Mike’s blog.

It’s no secret to people who read this blog that I hate the way scientific publishing works today. Most of my efforts in this domain have focused on removing barriers to the access and reuse of published papers. But there are other things that are broken with the way scientists communicate with each other, and chief amongst them is pre-publication peer review. I’ve written about this before, and won’t rehash the arguments here, save to say that I think we should publish first, and then review. But one could argue that I haven’t really practiced what I preach, as all of my lab’s papers have gone through peer review before they were published.

No more. From now on we are going to post all of our papers online when we feel they’re ready to share – before they go to a journal. We’ll then solicit comments from our colleagues and use them to improve the work prior to formal publication. Physicists and mathematicians have been doing this for decades, as have an increasing number of biologists. It’s time for this to become standard practice.

Some ground rules. I will not filter comments except to remove obvious spam. You are welcome to post comments under your name or under a pseudonym – I will not reveal anyone’s identity – but I urge you to use your real name as I think we should have fully open peer review in science.

OK. Now for the paper, which is posted on arxiv and can be linked to, cited there. We also have a copy here, in case you’re having trouble with figures on arXiv.

Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Several years ago a postdoc in my lab, Susan Lott (now at UC Davis) developed methods to sequence the RNA’s from single Drosophila embryos. She was interested in looking at expression differences between males and females in early embryogenesis, and published a beautiful paper on that topic.

Although we were initially worried that we wouldn’t be able to get enough RNA from single embryos to get reliable sequencing results, it turns out we got more than enough. Each embryo yielded around 100ng of total RNA, and we would end up loading only ~10% of the sample onto the sequencer. So it occurred to us that maybe we could work with material from pieces of individual embryos and thereby get spatial expression information on a genomic scale in a single quick experiment – an alternative to highly informative, but slow imaging-based methods.

I recruited a new biophysics student, Peter Combs, to work on slicing embryos with a microtome along the anterior-posterior axis and sequencing each of the sections to identify genes with patterned expression along the A-P axis. In typical PI fashion, I figured this would take a few weeks, but it ended up taking over a year to get right.

The major challenge was that, while a tenth of an embyro contains more than enough RNA to analyze by mRNA-seq, it turned out to be very difficult to shepherd that RNA successfully from a single cryosection to the sequencer. Peter was routinely failing to recover RNA and make libraries from these samples using methods that worked great for whole embryos. While there are various protocols out there claiming to analyze RNA from single cells, we were reluctant to use these amplification-based strategies.

The typical way people deal with loss of small quantities of nucleic acids during experimental manipulation is to add carrier RNA or DNA – something like tRNA or salmon sperm DNA. We didn’t want to do that, since we would just end up with tons of useless sequencing reads. So we came up with a different strategy – adding embryos from distantly related Drosophila species to each slice at an early stage in the process. This brought the total amount of RNA in each sample well amove the threshold where our purification and library preparation worked robustly, and we could easily separate the D. melanogaster RNA we were interested in for this experiment from that of the “carrier” embryo. But we could avoid wasting sequencing reads by turning the carrier RNAs into an experiment of their own – in this case looking at expression variation between species.

With this trick, the method now works great, and the paper is really just a description of the method and a demonstration that accurate expression patterns can be recovered from individual cryosectioned embryos. The resolution here is not that great – we used 6 slices of ~60um each per embryo. But we’ve started to make smaller sections, and a back of the envelope calculation suggests we can, with available sample handling and sequencing techniques, make up to 100 slices per embryo. This would be more than enough to see stripes and other subtle patterns missed in the current dataset.

Our immediate near term goals are to do a developmental time course, compare patterns in male and female embryos, look at other species and examine embryos from strains carrying various patterning defects. For those of you going to the fly meeting in DC in April, Peter’s talk will, I hope, have some of this new data.

Anyway, we would love comments on either the method or the manuscript.

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression
Peter A. Combs, Michael B. Eisen
(Submitted on 19 Feb 2013)

Complex spatial and temporal patterns of gene expression underlie embryo differentiation, yet methods do not yet exist for the efficient genome-wide determination of spatial patterns of gene expression. {\em In situ} imaging of transcripts and proteins is the gold-standard, but is difficult and time consuming to apply to an entire genome, even when highly automated. Sequencing, in contrast, is fast and genome-wide, but generally applied to homogenized tissues, thereby discarding spatial information. At some point, these methods will converge, and we will be able to sequence RNAs {\em in situ}, simultaneously determining their identity and location. As a step along this path, we developed methods to cryosection individual blastoderm stage {\em Drosophila melanogaster} embryos along the anterior-posterior axis and sequence the mRNA isolated from each 60\micron{} slice. The spatial patterns of gene expression we infer closely match patterns determined by {\em in situ} hybridization and microscopy, where such data exist, and thus we conclude that we have generated the first genome-wide map of spatial patterns in the {\em Drosophila} embryo. We identify numerous genes with spatial patterns that have not yet been screened in the several ongoing systematic in situ based projects, the majority of which are localized to the posterior end of the embryo, likely in the pole cells. This simple experiment demonstrates the potential for combining careful anatomical dissection with high-throughput sequencing to obtain spatially resolved gene expression on a genome-wide scale.

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.