Differential meta-analysis of RNA-seq data from multiple studies

Differential meta-analysis of RNA-seq data from multiple studies
Andrea Rau (GABI), Guillemette Marot (INRIA Lille – Nord Europe, CERIM), Florence Jaffrézic (GABI)
(Submitted on 16 Jun 2013)

High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. To conclude, the p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the R Forge.

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Yinlong Xie, Gengxiong Wu, Jingbo Tang, Ruibang Luo, Jordan Patterson, Shanlin Liu, Weihua Huang, Guangzhu He, Shengchang Gu, Shengkang Li, Xin Zhou, Tak-Wah Lam, Yingrui Li, Xun Xu, Gane Ka-Shu Wong, Jun Wang
(Submitted on 29 May 2013)

Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences of many (but not all) of the genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popu-larity; but given the short reads (e.g. 2 * 90 bp paired ends), de novo assembly to recover complete full length gene sequences remains an algorithmic challenge.
Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on 2Gb and 5Gb of transcriptome data from mouse and rice. Using the known transcripts from these two well-annotated genomes as a benchmark, we assessed how SOAPdenovo-Trans and other competing software handle the practical issues of alterna-tive splicing and variable expression levels. Compared with other de novo transcriptome assemblers, SOAPdenovo-Trans provides high-er contiguity, lower redundancy, and faster execution.

Stochastic gene expression with delay

Stochastic gene expression with delay
Martin Jansen, Peter Pfaffelhuber
(Submitted on 28 May 2013)

The expression of genes usually follows a two-step procedure. First, a gene (encoded in the genome) is transcribed resulting in a strand of (messenger) RNA. Afterwards, the RNA is translated into protein. Classically, this gene expression is modeled using a Markov jump process including activation and deactivation of the gene, transcription and translation rates together with degradation of RNA and protein. We extend this model by adding delays (with arbitrary distributions) to transcription and translation. Such delays can e.g.\ mean that RNA has to be transported to a different part of a cell before translation can be initiated. Already in the classical model, production of RNA and protein come in bursts by activation and deactivation of the gene, resulting in a large variance of the number of RNA and proteins in equilibrium. We derive precise formulas for this second-order structure with the model including delay in equilibrium. As a general fact, the delay decreases the variance of the number of RNA and proteins.

Methods to study splicing from high-throughput RNA Sequencing data

Methods to study splicing from high-throughput RNA Sequencing data
Gael P. Alamancos, Eneritz Agirre, Eduardo Eyras
(Submitted on 22 Apr 2013)

The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.

Our paper: Inferring non-neutral regulatory change in pathways from transcriptional profiling data

This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.

We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.

This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.

Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.

Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.

We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me:
jgschraiber@berkeley.edu.

Inferring non-neutral regulatory change in pathways from transcriptional profiling data

Inferring non-neutral regulatory change in pathways from transcriptional profiling data
Joshua G. Schraiber, Yulia Mostovoy, Tiffany Y. Hsu, Rachel B. Brem
(Submitted on 19 Apr 2013)

An outstanding question in comparative genomics is the evolutionary importance of gene expression differences between species. Rigorous molecular-evolution methods to infer evidence for natural selection from transcriptional profiling data are at a premium in the field, and to date, phylogenetic approaches have not been well-suited to address the question in the small sets of taxa profiled in standard surveys of gene expression. To meet this challenge, we have developed a strategy to infer evolutionary histories from expression data by analyzing suites of genes of common function. In a manner conceptually similar to molecular-evolution models in which the evolutionary rates of DNA sequence at multiple loci follow a gamma distribution, we modeled expression of the genes of an a priori-defined pathway with rates drawn from an inverse-gamma distribution. We then developed a fitting strategy to infer the parameters of this distribution from expression measurements, and to identify gene groups whose expression patterns were consistent with evolutionary constraint or rapid evolution in particular species. Simulations confirmed the power and accuracy of our inference method. As an experimental testbed for our approach, we generated and analyzed transcriptional profiles of four Saccharomyces yeasts. The results revealed pathways with signatures of constrained and accelerated regulatory evolution in individual yeasts, and across the phylogeny, highlighting the prevalence of pathway- level expression change during the divergence of yeast species. We anticipate that our pathway-based phylogenetic approach will be of broad utility in the search to understand the evolutionary relevance of regulatory change.

Clusters of microRNAs emerge by new hairpins in existing transcripts

Clusters of microRNAs emerge by new hairpins in existing transcripts
Antonio Marco, Maria Ninova, Matthew Ronshaugen, Sam Griffiths-Jones
(Submitted on 9 Apr 2013)

Genetic linkage may result in the expression of multiple products from a single polycistronic transcript, under the control of a single promoter. In animals, protein-coding polycistronic transcripts are rare. However, microRNAs are frequently clustered in the genomes of animals and plants, and these clusters are often transcribed as a single unit. The evolution of microRNA clusters has been the subject of much speculation, and a selective advantage of clusters of functionally related microRNAs is often proposed. However, the origin of microRNA clusters has not been so far systematically explored. Here we study the evolution of all microRNA clusters in Drosophila melanogaster, and suggest a number of models for their emergence. We observed that a majority of microRNA clusters arose by the de novo formation of new microRNA-like hairpins in existing microRNA transcripts. Some clusters also emerged by tandem duplication of a single microRNA. Comparative genomics show that these clusters, once formed, are unlikely to split or undergo rearrangements. We did not find any instances of clusters appearing by rearrangement of pre-existing microRNA genes. We propose a model for microRNA cluster origin and evolution in which selection over one of the microRNAs in the cluster interferes with the evolution of the other tightly linked microRNAs. Our analysis suggests that the evolutionary study of microRNAs and other small RNAs must consider and account for linkage associations.

The effects of transcription factor competition on gene regulation

The effects of transcription factor competition on gene regulation

Nicolae Radu Zabet, Boris Adryan
(Submitted on 27 Mar 2013)

We performed stochastic simulations of transcription factor (TF) molecules translocating by facilitated diffusion (a combination of 3D diffusion in the cytoplasm and 1D random walk on the DNA), and consider various abundances of cognate and non-cognate TFs to assess the influence of competitor molecules that also move along the DNA. We show that molecular crowding on the DNA always leads to longer times required by TF molecules to locate their target sites as well as to lower occupancy, which may confer a general mechanism to control gene activity levels globally. Finally, we show that crowding on the DNA may increase transcriptional noise through increased variability of the occupancy time of the target sites.

The influence of transcription factor competition on the relationship between occupancy and affinity

The influence of transcription factor competition on the relationship between occupancy and affinity

Nicolae Radu Zabet, Robert Foy, Boris Adryan
(Submitted on 27 Mar 2013)

Transcription factors (TFs) are proteins that bind to specific sites on the DNA and regulate gene activity. Identifying where TF molecules bind and how much time they spend on their target sites is key for understanding transcriptional regulation. It is usually assumed that the free energy of binding of a TF to the DNA (the affinity of the site) is highly correlated to the amount of time the TF remains bound (the occupancy of the site). However, knowing the binding energy is not sufficient to infer actual binding site occupancy. This mismatch between the occupancy predicted by the affinity and the observed occupancy may be caused by various factors, such as TF abundance, competition between TFs or the arrangement of the sites on the DNA. We investigated the relationship between the affinity of a TF for a set of binding sites and their occupancy. In particular, we considered the case of lac repressor (lacI) in E.coli and performed stochastic simulations of the TF dynamics on the DNA for various combinations of lacI abundance in competition with TFs that contribute to macromolecular crowding. Our results showed that for medium and high affinity sites, TF competition does not play a significant role in genomic occupancy, except in cases when the abundance of lacI is significantly increased or when a low-information content PWM was used. Nevertheless, for medium and low affinity sites, an increase in TF abundance (for both lacI or other molecules) leads to an increase in occupancy at several sites. Keywords: facilitated diffusion, Position Weight Matrix, thermodynamic equilibrium, motif information content, molecular crowding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding

Gene expression in early Drosophila embryos is highly conserved despite extensive divergence of transcription factor binding
Mathilde Paris, Tommy Kaplan, Xiao Yong Li, Jacqueline E. Villalta, Susan E. Lott, Michael B. Eisen
(Submitted on 1 Mar 2013)

To better characterize how variation in regulatory sequences drives divergence in gene expression, we undertook a systematic study of transcription factor binding and gene expression in the blastoderm embryos of four species that sample much of the diversity in the 60 million-year old genus Drosophila: D. melanogaster, D. yakuba, D. pseudoobscura and D. virilis. We compared gene expression, as measured by mRNA-seq to the genome-wide binding of four transcription factors involved in early development, as measured by ChIP-seq (Bicoid, Giant, Hunchback and Kr\”uppel). Surprisingly, we found that mRNA levels are much better conserved than individual binding events. We looked at binding characteristics that may explain such evolutionary disparity. As expected, we found that binding divergence increases with phylogenetic distance. Interestingly, binding events in non-coding regions that were bound strongly by single factors, or bound by multiple factors, were more likely to be conserved. As this class of sites are most likely to be involved in gene regulation, the divergence of other bound regions may simply reflect their lack of function. We used a model of quantitative trait evolution to compare the changes of gene expression with nearby regulatory TF binding. We found that changes in gene expression were poorly explained by changes in associated TF binding. These results suggest that some of the differences in sequence and binding have limited effect on gene expression or act in a compensatory manner to maintain the overall expression levels of regulated genes.