READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data
Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at (DOI:10.6084/m9.figshare.977849).

A chromatin structure based model accurately predicts DNA replication timing in human cells

A chromatin structure based model accurately predicts DNA replication timing in human cells
Yevgeniy Gindin, Manuel S. Valenzuela, Mirit I. Aladjem, Paul S. Meltzer, Sven Bilke
Subjects: Subcellular Processes (q-bio.SC); Genomics (q-bio.GN)

The metazoan genome is replicated in precise cell lineage specific temporal order. However, the mechanism controlling this orchestrated process is poorly understood as no molecular mechanisms have been identified that actively regulate the firing sequence of genome replication. Here we develop a mechanistic model of genome replication capable of predicting, with accuracy rivaling experimental repeats, observed empirical replication timing program in humans. In our model, replication is initiated in an uncoordinated (time-stochastic) manner at well-defined sites. The model contains, in addition to the choice of the genomic landmark that localizes initiation, only a single adjustable parameter of direct biological relevance: the number of replication forks. We find that DNase hypersensitive sites are optimal and independent determinants of DNA replication initiation. We demonstrate that the DNA replication timing program in human cells is a robust emergent phenomenon that, by its very nature, does not require a regulatory mechanism determining a proper replication initiation firing sequence.

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences
Dennis Kostka, Tara Friedrich, Alisha K. Holloway, Katherine S. Pollard
(Submitted on 1 Feb 2014)

Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms. We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain. Assuming that motif counts are Binomially distributed, and allowing for dependencies between motif instances in evolutionarily related sequences, we derive the probability mass function of the difference in motif counts between two nucleotide sequences. We provide a method to numerically estimate this distribution from genomic data and show through simulations that our estimator is accurate. Finally, we introduce the R package {\tt motifDiverge} that implements our methodology and illustrate its application to gene regulatory enhancers identified by a mouse developmental time course experiment. While this study was motivated by analysis of regulatory motifs, our results can be applied to any problem involving two correlated Bernoulli trials.

Impact of RNA degradation on measurements of gene expression

Impact of RNA degradation on measurements of gene expression

Irene Gallego Romero, Athma A. Pai, Jenny Tung, Yoav Gilad

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be normalized, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples are at times the sole means of addressing specific questions – e.g., samples collected in the course of fieldwork. We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a quality metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples. While standard normalizations failed to account for the effects of degradation, we found that a simple linear model that controls for the effects of RIN can correct for the majority of these effects. We conclude that in instances where RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Author post: Sex-biased microRNAs in Drosophila melanogaster

This guest post is by Antonio Marco (@amarcobio) on his paper: Sex-biased microRNAs in Drosophila melanogaster

The expression profile of a gene affects its evolutionary fate. Conversely, the evolutionary history of a gene is reflected in its expression pattern. Understanding the complex relationship between expression and evolution is a major challenge in evolutionary genetics. In particular, the evolution of sex-biased gene expression is an all-time favourite. With the advent of high-throughput technologies, sex-biased expression has been widely studied, and a number of significant observations are generally accepted.

First, there is a paucity of male-biased genes in the X chromosome (in X/Y species). However, recently emerged genes in the X tend to be male-biased. An ongoing demasculinization of X chromosomes may explain this pattern. (Interestingly, recent works suggest that demasculinization may not be happening in Drosophila.) On the contrary, female-biased genes are enriched in the X chromosome and less frequently found in the autosomes. However, these studies are based on protein-coding genes, and little is known about other genes.

MicroRNAs are short regulatory RNAs which repress translation. MicroRNAs are now known to be involved in many developmental process, including sex differentiation. Unlike protein-coding genes, microRNA genes frequently emerge de novo in the genome. Also, a microRNA transcript frequently produces multiple products, including other microRNAs or protein-coding genes. In summary, the biology of microRNAs is substantially different to that of protein-coding genes, and so must be its evolutionary dynamics. In a recent paper deposited in arXiv, I explore the evolutionary origin of sex-biased microRNAs in Drosophila melanogaster.

By analysing deep sequencing data from multiple sources I observed that sex-biased microRNAs are, as expected, involved in the reproductive function. Contrary to protein-coding genes, there is an enrichment of male-biased genes in the X chromosome. Also, there is no conclusive evidence of demasculinization affecting microRNAs. On the other hand, female-biased microRNAs are encoded in the autosomes. Interestingly, many female-biased microRNAs are encoded within the introns of female-biased protein-coding genes. A detailed analysis reveals that maternally transmitted microRNAs may be hitch-hiked by the maternal deposition of the host gene transcript. Ongoing work in the lab is aimed to confirm this hypothesis.

In summary, the chromosomal distribution of sex-biased expressed microRNAs is exactly the opposite we observe in protein-coding genes. This analysis suggests that this is a consequence of a differential evolutionary dynamics. As novel microRNAs frequently emerge in the X chromosome, they acquire ‘at birth’ male biased expression. However, instead of a movement out-of-the-X, these microRNAs get eventually lost. Hence, there is an enrichment of male-biased microRNAs in the X. On the contrary, female-biased expression is frequently acquired by microRNAs encoded in the intron of female expressed host genes. The origin and evolution of sex-biased microRNAs is, therefore, a consequence of a high rate of de novo emergence.

Sex-biased expression of microRNAs in Drosophila melanogaster

Sex-biased expression of microRNAs in Drosophila melanogaster
Antonio Marco
(Submitted on 11 Dec 2013)

Most animals have separate sexes. The differential expression of gene products, in particular that of gene regulators, is underlying sexual dimorphism. Analyses of sex-biased expression have focused mostly in protein coding genes. Several lines of evidence indicate that microRNAs, a class of major gene regulators, are likely to have a significant role in sexual dimorphism. This role has not been systematically explored so far. Here I study the sex-biased expression pattern of microRNAs in the model species Drosophila melanogaster. As with protein coding genes, sex biased microRNAs are associated with the reproductive function. Strikingly, contrary to protein-coding genes, male biased microRNAs are enriched in the X chromosome whilst female microRNAs are mostly autosomal. I propose that the chromosomal distribution is a consequence of high rates of de novo emergence, and a preference of new microRNAs to be expressed in the testis. I also suggest that demasculinization of the X chromosome may not affect microRNAs. Interestingly, female biased microRNAs are often encoded within protein coding genes that are also expressed in females. These results strongly suggest that the sex-biased expression of microRNAs is mainly a consequence of high rates of microRNA emergence in the X (male bias) or hitch-hiked expression by host genes (female bias).

Comment on “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions” by Kim et al.

Comment on “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions” by Kim et al.
Alexander Dobin, Thomas R Gingeras

In the recent paper by Kim et al. (Genome biology, 2013. 14(4): p. R36) the accuracy of TopHat2 was compared to other RNA-seq aligners. In this comment we re-examine most important analyses from this paper and identify several deficiencies that significantly diminished performance of some of the aligners, including incorrect choice of mapping parameters, unfair comparison metrics, and unrealistic simulated data. Using STAR (Dobin et al., Bioinformatics, 2013. 29(1): p. 15-21) as an exemplar, we demonstrate that correcting these deficiencies makes its accuracy equal or better than that of TopHat2. Furthermore, this exercise highlighted some serious issues with the TopHat2 algorithms, such as poor recall of alignments with a moderate (>3) number of mismatches, low sensitivity and high false discovery rate for splice junction detection, loss of precision for the realignment algorithm, and large number of false chimeric alignments.

Joint analysis of functional genomic data and genome-wide association studies of 18 human traits

Joint analysis of functional genomic data and genome-wide association studies of 18 human traits
Joseph Pickrell

Annotations of gene structures and regulatory elements can inform genome-wide association studies (GWAS). However, choosing the relevant annotations for interpreting an association study of a given trait remains challenging. We describe a statistical model that uses association statistics computed across the genome to identify classes of genomic element that are enriched or depleted for loci that influence a trait. The model naturally incorporates multiple types of annotations. We applied the model to GWAS of 18 human traits, including red blood cell traits, platelet traits, glucose levels, lipid levels, height, BMI, and Crohn’s disease. For each trait, we evaluated the relevance of 450 different genomic annotations, including protein-coding genes, enhancers, and DNase-I hypersensitive sites in over a hundred tissues and cell lines. We show that the fraction of phenotype-associated SNPs that influence protein sequence ranges from around 2% (for platelet volume) up to around 20% (for LDL cholesterol); that repressed chromatin is significantly depleted for SNPs associated with several traits; and that cell type-specific DNase-I hypersensitive sites are enriched for SNPs associated with several traits (for example, fibroblasts in Crohn’s disease and muscle tissue in bone density). Finally, by re-weighting each GWAS using information from functional genomics, we increase the number of loci with high-confidence associations by around 5%.

Drosophila embryogenesis scales uniformly across temperature and developmentally diverse species

Drosophila embryogenesis scales uniformly across temperature and developmentally diverse species
Steven Gregory Kuntz, Michael B Eisen
Temperature affects both the timing and outcome of animal development, but the detailed effects of temperature on the progress of early development have been poorly characterized. To determine the impact of temperature on the order and timing of events during Drosophila melanogaster embryogenesis, we used time-lapse imaging to track the progress of embryos from shortly after egg laying through hatching at seven precisely maintained temperatures between 17.5°C and 32.5°C. We employed a combination of automated and manual annotation to determine when 36 milestones occurred in each embryo. D. melanogaster embryogenesis takes 33 hours at 17.5°C, and accelerates with increasing temperature to a low of 16 hours at 27.5°C, above which embryogenesis slows slightly. Remarkably, while the total time of embryogenesis varies over two fold, the relative timing of events from cellularization through hatching is constant across temperatures. To further explore the relationship between temperature and embryogenesis, we expanded our analysis to cover ten additional Drosophila species of varying climatic origins. Six of these species, like D. melanogaster, are of tropical origin, and embryogenesis time at different temperatures was similar for them all. D. mojavensis, a sub-tropical fly, develops slower than the tropical species at lower temperatures, while D. virilis, a temperate fly, exhibits slower development at all temperatures. The alpine sister species D. persimilis and D. pseudoobscura develop as rapidly as tropical flies at cooler temperatures, but exhibit diminished acceleration above 22.5°C and have drastically slowed development by 30°C. Despite ranging from 13 hours for D. erecta at 30°C to 46 hours for D. virilis at 17.5°C, the relative timing of events from cellularization through hatching is constant across all of the species and temperatures examined here, suggesting the existence of a previously unrecognized timer controlling the progress of embryogenesis that has been tuned by natural selection in response to the thermal environment in which each species lives.

Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs

Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs
Nick Schurch, Christian Cole, Alexander Sherstnev, Junfang Song, Céline Duc, Kate G. Storey, W. H. Irwin McLean, Sara J. Brown, Gordon G. Simpson, Geoffrey J. Barton
(Submitted on 11 Nov 2013)

The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data