Long non-coding RNAs as a source of new peptides

Long non-coding RNAs as a source of new peptides

Jorge Ruiz-Orera, Xavier Messeguer, Juan A. Subirana, M.Mar Albà
(Submitted on 16 May 2014)

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames and which have been termed long non-coding RNAs (lncRNAs). Despite the existence of several well-characterized lncRNAs that play roles in the regulation of gene expression, the vast majority of them do not yet have a known function. Motivated by the existence of ribosome profiling data for several species, we have tested the hypothesis that they may act as a repository for the synthesis of new peptides using data from human, mouse, zebrafish, fruit fly, Arabidopsis and yeast. The ribosome protection patterns are consistent with the presence of translated open reading frames (ORFs) in a very large number of lncRNAs. Most of the ribosome-protected ORFs are shorter than 100 amino acids and usually cover less than half the transcript. Ribosome density in these ORFs is high and contrasts sharply with the 3UTR region, in which very often there is no detectable ribosome binding, similar to bona fide protein-coding genes. The coding potential of ribosome-protected ORFs, measured using hexamer frequencies, is significantly higher than that of randomly selected intronic ORFs and similar to that of evolutionary young coding sequences. Selective constraints in ribosome-protected ORFs from lncRNAs are lower than in typical protein-coding genes but again similar to young proteins. These results strongly suggest that lncRNAs play an important role in de novo protein evolution.

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Automation and Evaluation of the SOWH Test of Phylogenetic Topologies with SOWHAT

Automation and Evaluation of the SOWH Test of Phylogenetic Topologies with SOWHAT

Samuel H. Church, Joseph F. Ryan, Casey W. Dunn

The Swofford-Olsen-Waddell-Hillis (SOWH) test is a method to evaluate incongruent phylogenetic topologies. It is used, for example, when an investigator wishes to know if the maximum likelihood tree recovered in their analysis is significantly different than an alternative phylogenetic hypothesis. The SOWH test compares the observed difference in likelihood between the topologies to a null distribution of differences in likelihood generated by parametric resampling. The SOWH test is a well-established and important phylogenetic method, but it can be difficult to implement and its sensitivity to various factors is not well understood. We wrote SOWHAT, a program that automates the SOWH test. In test analyses, we find that variation in parameter estimation as well as the use of a more complex model of parameter estimation have little impact on results, but that results can be inconsistent when an insufficient number of replicates are used to estimate the null distribution. We provide methods of analyzing the sampling as well as a simple stopping criteria for sufficient bootstrap replicates, which increase the overall reliability of the approach. Applications of the SOWH test should include explicit evaluations of sampling adequacy. SOWHAT is available for download from https://github.com/josephryan/SOWHAT.

Phylogenetic confidence intervals for the optimal trait value

Phylogenetic confidence intervals for the optimal trait value

Krzysztof Bartoszek, Serik Sagitov

We consider a stochastic evolutionary model for a phenotype developing amongst n related species with unknown phylogeny. The unknown tree is modelled by a Yule process conditioned on n contemporary nodes. The trait value is assumed to evolve along lineages as an Ornstein-Uhlenbeck process. As a result, the trait values of the n species form a sample with dependent observations. We establish three limit theorems for the sample mean corresponding to three domains for the adaptation rate. In the case of fast adaptation, we show that for large n the normalized sample mean is approximately normally distributed. Using these limit theorems, we develop novel confidence interval formulae for the optimal trait value.

Adaptation to a novel predator in Drosophila melanogaster: How well are we able to predict evolutionary responses?

Adaptation to a novel predator in Drosophila melanogaster: How well are we able to predict evolutionary responses?

Michael DeNieu, William Pitchers, Ian Dworkin

Evolutionary theory is sufficiently well developed to allow for short-term prediction of evolutionary trajectories. In addition to the presence of heritable variation, prediction requires knowledge of the form of natural selection on relevant traits. While many studies estimate the form of natural selection, few examine the degree to which traits evolve in the predicted direction. In this study we examine the form of natural selection imposed by mantid predation on wing size and shape in the fruitfly, Drosophila melanogaster. We then evolve populations of D. melanogaster under predation pressure, and examine the extent to which wing size and shape have responded in the predicted direction. We demonstrate that wing form partially evolves along the predicted vector from selection, more so than for control lineages. Furthermore, we re-examined phenotypic selection after ~30 generations of experimental evolution. We observed that the magnitude of selection on wing size and shape was diminished in populations evolving with mantid predators, while the direction of the selection vector differed from that of the ancestral population for shape. We discuss these findings in the context of the predictability of evolutionary responses, and the need for fully multivariate approaches.

Author post: When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis

This guest post is by Justin Blumenstiel on his preprint (with co-authors) When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis, available from bioRxiv here.

Does the activation of one transposable element (TE) family typically lead to the activation of many? If so, this would indicate a synergism between different TE families with significance for TE dynamics in natural populations. A standard model of TE dynamics typically takes into account population size, transposition rate (which may vary based on host defense) and selection against TE insertions. If the mobilization of one TE can lead to the mobilization of others, the transposition rate of one TE family could influence the transposition rate of others.

Hybrid dysgenic syndromes in Drosophila are an important model for TE dynamics when one TE family becomes mobilized. In the 1980’s, it was generally concluded that P element dysgenesis did not lead to mobilization of other TEs. However, studies in the D. virilis system of hybrid dysgenesis indicated otherwise. More recently, an analysis of transposition in the P element system indicated movement of elements other than the P element. Thus, it appears that co-mobilization may be a common feature of dysgenic syndromes.

What is the mechanism of co-mobilization? For the P element system, studies by William Theurkauf and colleagues point to the DNA damage response as key. Specifically, via Chk2 kinase, DNA damage signaling leads to perturbed piRNA biogenesis, which in turn leads to the activation of other elements under control of piRNA-based silencing. Does this mechanism also apply to other systems?

To study the mechanism of TE co-mobilization in the D. virilis system, we performed small RNA sequencing and mRNA sequencing experiments using germline material of reciprocal females of the dysgenic and non-dysgenic crosses. In contrast to the P element and I element systems, hybrid dysgenesis in D. virilis is more complex. For one, there is not a single element that has been proven to be the sole cause. This was previously shown, but in this study we identified several more elements that likely contribute. From small RNA sequencing, we find that TE mis-expression persists in the progeny of the dysgenic cross, without a persisting global defect in piRNA biogenesis. Rather, it appears that piRNA biogenesis defects are idiosyncratic across different TE families. Interestingly, we also find evidence that piRNA silencing loses specificity in the dysgenic cross, with some highly expressed genes becoming non-specific targets.

Overall, this study provided several insights, but the mechanism of co-mobilization in the D. virilis system remains unknown. The complexity of this syndrome makes it a challenge for study, but it may provide significant insight into genome dynamics of hybrids whose parents differ for more than one TE family. Future genetic analysis may allow us to determine the role of the DNA damage response in maintaining the activity of some TE families, but not others.

Author post: qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

This guest post is by Stephen Turner on his preprint qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots, available on bioRxiv here. This is a modified version of a post from his blog.

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

The pre-print is on biorXiv: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165.

Here's a short demo of the package for creating Q-Q and manhattan plots from GWAS results.

Installation

First, let's install and load the package. We can see more examples by viewing the package vignette.

# Install only once:
install.packages("qqman")

# Load every time you use it:
library(qqman)

The qqman package includes functions for creating manhattan plots (the manhattan() function) and Q-Q plots (with the qq() function) from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes in a format similar to the output from PLINK. Take a look at the data:

head(gwasResults)
  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191

Creating manhattan and Q-Q plots

Let's make a basic manhattan plot. If you're using results from PLINK where columns are named SNP, CHR, BP, and P, you only need to call the manhattan() function on the results data.frame you read in.

manhattan(gwasResults)

manhattan_01_basic

We can also change the colors, add a title, and remove the genome-wide significance and “suggestive” lines:

manhattan(gwasResults, col = c("blue4", "orange3"), main = "Results from simulated trait",
    genomewideline = FALSE, suggestiveline = FALSE)

manhattan_02_color_title_lines

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.

head(snpsOfInterest)
[1] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" "rs3006"
manhattan(gwasResults, highlight = snpsOfInterest)

manhattan_03_highlight

We can combine highlighting and limiting to a single chromosome to “zoom in” on an interesting chromosome or region:

manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3 Results")

manhattan_04_highlight_chr

Finally, creating Q-Q plots is straightforward – simply supply a vector of p-values to the qq() function. You can optionally provide a title.

qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")

qq

Read the blog post or check out the package vignette for more examples and options.

vignette("qqman")

Deleterious passengers in adapting populations

Deleterious passengers in adapting populations
Benjamin H Good, Michael M Desai
Subjects: Populations and Evolution (q-bio.PE)

Most new mutations are deleterious and are eventually eliminated by natural selection. But in an adapting population, the rapid amplification of beneficial mutations can hinder the removal of deleterious variants in nearby regions of the genome, altering the patterns of sequence evolution. Here, we analyze the interactions between beneficial “driver” mutations and linked deleterious “passengers” during the course of adaptation. We derive analytical expressions for the substitution rate of a deleterious mutation as a function of its fitness cost, as well as the reduction in the beneficial substitution rate due to the genetic load of the passengers. We find that the fate of each deleterious mutation varies dramatically with the rate and spectrum of beneficial mutations, with a non-monotonic dependence on both the population size and the rate of adaptation. By quantifying this dependence, our results allow us to estimate which deleterious mutations will be likely to fix, and how many of these mutations must arise before the progress of adaptation is significantly reduced.

Locus architecture affects mRNA expression levels in Drosophila embryos

Locus architecture affects mRNA expression levels in Drosophila embryos
Tara Lydiard-Martin, Meghan Bragdon, Kelly B Eckenrode, Zeba Wunderlich, Angela H DePace

Structural variation in the genome is common due to insertions, deletions, duplications and rearrangements. However, little is known about the ways structural variants impact gene expression. Developmental genes are controlled by multiple regulatory sequence elements scattered over thousands of bases; developmental loci are therefore a good model to test the functional impact of structural variation on gene expression. Here, we measured the effect of rearranging two developmental enhancers from the even-skipped (eve) locus in Drosophila melanogaster blastoderm embryos. We systematically varied orientation, order, and spacing of the enhancers in transgenic reporter constructs and measured expression quantitatively at single cell resolution in whole embryos to detect changes in both level and position of expression. We found that the position of expression was robust to changes in locus organization, but levels of expression were highly sensitive to the spacing between enhancers and order relative to the promoter. Our data demonstrate that changes in locus architecture can dramatically impact levels of gene expression. To quantitatively predict gene expression from sequence, we must therefore consider how information is integrated both within enhancers and across gene loci.

RNA-seq gene profiling – a systematic empirical comparison

RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.