RNA-seq gene profiling – a systematic empirical comparison

RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.

When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis

When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis
Mauricio A. Galdos, Alexandra A. Erwin, Michelle L. Wickersheim, Chris C. Harrison, Kendra D. Marr, Justin Blumenstiel

In sexually reproducing species the union of gametes that are not closely related can result in genomic incompatibility. Hybrid dysgenic syndromes represent a form of genomic incompatibility that can arise when transposable element (TE) abundance differs between two parents. When TEs lacking in the female parent are transmitted paternally, a lack of corresponding silencing small RNAs (piRNAs) transmitted through the female germline can lead to TE mobilization in progeny. The epigenetic nature of this phenomenon is demonstrated by the fact that genetically identical females of the reciprocal cross are normal. Here we show that in the hybrid dysgenic syndrome of Drosophila virilis, an excess of paternally inherited TE families leads not only to increased expression of these TEs, but also coincides with derepression of TEs in equal abundance within parents. Moreover, TE derepression is stable as flies age and associated with piRNA biogenesis defects for only some TEs. At the same time, TE activation is associated with a genome wide shift in the distribution of endogenous gene expression and an increase in abundance of off-target genic piRNAs. To identify regions of the maternal genome that most protect against dysgenesis, we performed an F3 backcross analysis. We find that pericentric regions play a dominant role in maternal protection. This F3 backcross approach additionally allowed us to clarify the properties of genic paramutation in D. virilis. Overall, results support a model in which early germline events in dysgenesis establish a chronic, stable state of mis-expression that is maintained through adulthood. Such early events in the germline that are mediated by parent-of-origin effects may be important in determining patterns of gene expression in natural populations.

Evolution of bow-tie architectures in biology

Evolution of bow-tie architectures in biology
Tamar Friedlander, Avraham E. Mayo, Tsvi Tlusty, Uri Alon
Subjects: Molecular Networks (q-bio.MN)

Bow-tie or hourglass structure is a common architectural feature found in biological and technological networks. A bow-tie in a multi-layered structure occurs when intermediate layers have much fewer components than the input and output layers. Examples include metabolism where a handful of building blocks mediate between multiple input nutrients and multiple output biomass components, and signaling networks where information from numerous receptor types passes through a small set of signaling pathways to regulate multiple output genes. Little is known, however, about how bow-tie architectures evolve. Here, we address the evolution of bow-tie architectures using simulations of multi-layered systems evolving to fulfill a given input-output goal. We find that bow-ties spontaneously evolve when two conditions are met: (i) the evolutionary goal is rank deficient, where the rank corresponds to the minimal number of input features on which the outputs depend, and (ii) The effects of mutations on interaction intensities between components are described by product rule – namely the mutated element is multiplied by a random number. Product-rule mutations are more biologically realistic than the commonly used sum-rule mutations that add a random number to the mutated element. These conditions robustly lead to bow-tie structures. The minimal width of the intermediate network layers (the waist or knot of the bow-tie) equals the rank of the evolutionary goal. These findings can help explain the presence of bow-ties in diverse biological systems, and can also be relevant for machine learning applications that employ multi-layered networks.

Estimating transcription factor abundance and specificity from genome-wide binding profiles


Estimating transcription factor abundance and specificity from genome-wide binding profiles

Nicolae Radu Zabet, Boris Adryan
Comments: 39 pages, 25 figures, 10 tables
Subjects: Quantitative Methods (q-bio.QM)

The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in case of eukaryotes), the number of TF molecules expected to be bound to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display a lower number of DNA-bound molecules than previously assumed (in the range of tens and hundreds) and that, while Bicoid and Caudal display a higher specificity, the other three transcription factors (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound to the DNA.

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data
Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

A chromatin structure based model accurately predicts DNA replication timing in human cells

A chromatin structure based model accurately predicts DNA replication timing in human cells
Yevgeniy Gindin, Manuel S. Valenzuela, Mirit I. Aladjem, Paul S. Meltzer, Sven Bilke
Subjects: Subcellular Processes (q-bio.SC); Genomics (q-bio.GN)

The metazoan genome is replicated in precise cell lineage specific temporal order. However, the mechanism controlling this orchestrated process is poorly understood as no molecular mechanisms have been identified that actively regulate the firing sequence of genome replication. Here we develop a mechanistic model of genome replication capable of predicting, with accuracy rivaling experimental repeats, observed empirical replication timing program in humans. In our model, replication is initiated in an uncoordinated (time-stochastic) manner at well-defined sites. The model contains, in addition to the choice of the genomic landmark that localizes initiation, only a single adjustable parameter of direct biological relevance: the number of replication forks. We find that DNase hypersensitive sites are optimal and independent determinants of DNA replication initiation. We demonstrate that the DNA replication timing program in human cells is a robust emergent phenomenon that, by its very nature, does not require a regulatory mechanism determining a proper replication initiation firing sequence.

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences
Dennis Kostka, Tara Friedrich, Alisha K. Holloway, Katherine S. Pollard
(Submitted on 1 Feb 2014)

Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms. We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain. Assuming that motif counts are Binomially distributed, and allowing for dependencies between motif instances in evolutionarily related sequences, we derive the probability mass function of the difference in motif counts between two nucleotide sequences. We provide a method to numerically estimate this distribution from genomic data and show through simulations that our estimator is accurate. Finally, we introduce the R package {\tt motifDiverge} that implements our methodology and illustrate its application to gene regulatory enhancers identified by a mouse developmental time course experiment. While this study was motivated by analysis of regulatory motifs, our results can be applied to any problem involving two correlated Bernoulli trials.

Impact of RNA degradation on measurements of gene expression

Impact of RNA degradation on measurements of gene expression

Irene Gallego Romero, Athma A. Pai, Jenny Tung, Yoav Gilad

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be normalized, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples are at times the sole means of addressing specific questions – e.g., samples collected in the course of fieldwork. We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a quality metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples. While standard normalizations failed to account for the effects of degradation, we found that a simple linear model that controls for the effects of RIN can correct for the majority of these effects. We conclude that in instances where RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Author post: Sex-biased microRNAs in Drosophila melanogaster

This guest post is by Antonio Marco (@amarcobio) on his paper: Sex-biased microRNAs in Drosophila melanogaster

The expression profile of a gene affects its evolutionary fate. Conversely, the evolutionary history of a gene is reflected in its expression pattern. Understanding the complex relationship between expression and evolution is a major challenge in evolutionary genetics. In particular, the evolution of sex-biased gene expression is an all-time favourite. With the advent of high-throughput technologies, sex-biased expression has been widely studied, and a number of significant observations are generally accepted.

First, there is a paucity of male-biased genes in the X chromosome (in X/Y species). However, recently emerged genes in the X tend to be male-biased. An ongoing demasculinization of X chromosomes may explain this pattern. (Interestingly, recent works suggest that demasculinization may not be happening in Drosophila.) On the contrary, female-biased genes are enriched in the X chromosome and less frequently found in the autosomes. However, these studies are based on protein-coding genes, and little is known about other genes.

MicroRNAs are short regulatory RNAs which repress translation. MicroRNAs are now known to be involved in many developmental process, including sex differentiation. Unlike protein-coding genes, microRNA genes frequently emerge de novo in the genome. Also, a microRNA transcript frequently produces multiple products, including other microRNAs or protein-coding genes. In summary, the biology of microRNAs is substantially different to that of protein-coding genes, and so must be its evolutionary dynamics. In a recent paper deposited in arXiv, I explore the evolutionary origin of sex-biased microRNAs in Drosophila melanogaster.

By analysing deep sequencing data from multiple sources I observed that sex-biased microRNAs are, as expected, involved in the reproductive function. Contrary to protein-coding genes, there is an enrichment of male-biased genes in the X chromosome. Also, there is no conclusive evidence of demasculinization affecting microRNAs. On the other hand, female-biased microRNAs are encoded in the autosomes. Interestingly, many female-biased microRNAs are encoded within the introns of female-biased protein-coding genes. A detailed analysis reveals that maternally transmitted microRNAs may be hitch-hiked by the maternal deposition of the host gene transcript. Ongoing work in the lab is aimed to confirm this hypothesis.

In summary, the chromosomal distribution of sex-biased expressed microRNAs is exactly the opposite we observe in protein-coding genes. This analysis suggests that this is a consequence of a differential evolutionary dynamics. As novel microRNAs frequently emerge in the X chromosome, they acquire ‘at birth’ male biased expression. However, instead of a movement out-of-the-X, these microRNAs get eventually lost. Hence, there is an enrichment of male-biased microRNAs in the X. On the contrary, female-biased expression is frequently acquired by microRNAs encoded in the intron of female expressed host genes. The origin and evolution of sex-biased microRNAs is, therefore, a consequence of a high rate of de novo emergence.

Sex-biased expression of microRNAs in Drosophila melanogaster

Sex-biased expression of microRNAs in Drosophila melanogaster
Antonio Marco
(Submitted on 11 Dec 2013)

Most animals have separate sexes. The differential expression of gene products, in particular that of gene regulators, is underlying sexual dimorphism. Analyses of sex-biased expression have focused mostly in protein coding genes. Several lines of evidence indicate that microRNAs, a class of major gene regulators, are likely to have a significant role in sexual dimorphism. This role has not been systematically explored so far. Here I study the sex-biased expression pattern of microRNAs in the model species Drosophila melanogaster. As with protein coding genes, sex biased microRNAs are associated with the reproductive function. Strikingly, contrary to protein-coding genes, male biased microRNAs are enriched in the X chromosome whilst female microRNAs are mostly autosomal. I propose that the chromosomal distribution is a consequence of high rates of de novo emergence, and a preference of new microRNAs to be expressed in the testis. I also suggest that demasculinization of the X chromosome may not affect microRNAs. Interestingly, female biased microRNAs are often encoded within protein coding genes that are also expressed in females. These results strongly suggest that the sex-biased expression of microRNAs is mainly a consequence of high rates of microRNA emergence in the X (male bias) or hitch-hiked expression by host genes (female bias).