Author post: Long non-coding RNAs as a source of new peptides

This post is by M.Mar Albà on her preprint (with co-authors) available from arRxiv Long non-coding RNAs as a source of new peptides.

Several recent studies based on deep sequencing of ribosome protected fragments have reported that many long non-coding RNAs (lncRNAs) associate with ribosomes (see for example Everything old is new again: (linc)RNAs make proteins! a comment by Stephen M Cohen). We have analyzed the original data from experiments performed in six different eukaryotic species and confirmed that this is a widespread phenomenon. This is paradoxical because lncRNAs apparently have very little coding capacity with only short open reading frames (ORFs) that do not show sequence similarity to known proteins.

In contrast to typical mRNAs, many lncRNAs are lineage-specific. Therefore, if they are translated, they should be similar to recently evolved protein-coding genes. This is exactly what we have found. It turns out that transcripts encoding young proteins show very similar properties to lncRNAs; short and non-conserved ORFs, low coding sequence potential, and relatively weak selective constraints.

Evidence has accumulated in recent years that new protein-coding genes are continuously evolving (The continuing evolution of genes by Carl Zimmer). The birth of a new functional protein is a process of trial and error that most likely requires the expression of many transcripts that will not survive the test of time. LncRNAs seem to fit the bill for this role.

Author post: Diversity and evolution of centromere repeats in the maize genome

This guest post is by Paul Bilinski on his paper with coauthors Diversity and evolution of centromere repeats in the maize genome BioRxived here.

Centromeres have the potential to play a central role in speciation, yet our ability to study them has been limited because of their repetitive nature. The centromeres of many eukaryotes consist partly of large arrays of short tandem repeats, though the actual sequence of the repeat varies widely across taxa. To investigate the whether the variation found in the tandem repeats themselves could inform our understanding of their evolutionary history we made use of the reference maize genome as well as resequencing data from several lines of maize and its wild relative teosinte.

Although tandem repeats should be identical upon duplication, our analysis of CentC in maize revealed that most copies genome-wide are unique. We observed only three instances where adjacent copies were identical in sequence and length, driving home the idea that these tandem repeats have accumulated immense diversity. Given such diversity, we wanted to investigate genetic relatedness across CentC copies.

Using positional and genetic relatedness information from the fully-sequenced centromeres 2 and 5, we found high within-cluster similarity, suggesting that tandem duplications drove most CentC copy number increase. Contrary to patterns seen in Arabidopsis (Kawabe and Nasuda 2005), principle coordinate analysis of repeats found no clustering by chromosome, with groups of CentC with similar sequence distributed across all of the chromosomes.

Another surprising discovery involved the origin of the biggest arrays of CentC. As an ancient tetraploid maize originally had 20 chromosomes with 20 centromeres. Processes of fractionation and rearrangement have led to the 10 chromosomes in the extant maize genome. Schnable et al (2011) were able to identify which chromosomal segments derive from each of maize’s ancient parents, referred to as subgenomes one and two. Wang and Bennetzen (2011) built on this information, and found that about half of the modern centromeres came from each parent. Inferring subgenome of origin by flanking regions, we found that all of the CentC clusters >20kb in length derive from subgenome 1. The proportions are less skewed when looking at clusters >10kb, though in all cases we see more bp of CentC assigned to subgenome 1 than we expect based on its total bp in the genome. This is particularly interesting because subgenome 1 also shows higher overall gene expression and fewer deletions than subgenome two (Schnable et al 2011).

The diversity of CentC seen might suggest that CentC repeats were reasonably static in the genome, persisting in the same spot for a long time with occasional increases in copy number via tandem duplication. However, fluorescent in situ hybridization suggested that domestication resulted in a large loss of CentC signal across many of maize’s 10 chromosomes. We confirmed and quantified the loss of CentC using resequencing data from a set of maize and teosinte lines (Chia et al. 2012).

Combined, our results suggest long term stability of CentC clusters with new copies arising from tandem duplication, while mutation serves to homogenize rather than separate clusters. We hope our insights into centromere repeat evolution will build toward a better understanding of their role in evolution.

Evidence for strong co-evolution of mitochondrial and somatic genomes

Evidence for strong co-evolution of mitochondrial and somatic genomes

Michael G.Sadovsky
(Submitted on 20 May 2014)

We studied a relations between the triplet frequency composition of mitochondria genomes, and the phylogeny of their bearers. First, the clusters in 63dimensional space were developed due to K-means. Second, the clade composition of those clusters has been studied. It was found that genomes are distributed among the clusters very regularly, with strong correlation to taxonomy. Strong co-evolution manifests through this correlation: the proximity in frequency space was determined over the mitochondrion genomes, while the proximity in taxonomy was determined morphologically.

Long non-coding RNAs as a source of new peptides

Long non-coding RNAs as a source of new peptides

Jorge Ruiz-Orera, Xavier Messeguer, Juan A. Subirana, M.Mar Albà
(Submitted on 16 May 2014)

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames and which have been termed long non-coding RNAs (lncRNAs). Despite the existence of several well-characterized lncRNAs that play roles in the regulation of gene expression, the vast majority of them do not yet have a known function. Motivated by the existence of ribosome profiling data for several species, we have tested the hypothesis that they may act as a repository for the synthesis of new peptides using data from human, mouse, zebrafish, fruit fly, Arabidopsis and yeast. The ribosome protection patterns are consistent with the presence of translated open reading frames (ORFs) in a very large number of lncRNAs. Most of the ribosome-protected ORFs are shorter than 100 amino acids and usually cover less than half the transcript. Ribosome density in these ORFs is high and contrasts sharply with the 3UTR region, in which very often there is no detectable ribosome binding, similar to bona fide protein-coding genes. The coding potential of ribosome-protected ORFs, measured using hexamer frequencies, is significantly higher than that of randomly selected intronic ORFs and similar to that of evolutionary young coding sequences. Selective constraints in ribosome-protected ORFs from lncRNAs are lower than in typical protein-coding genes but again similar to young proteins. These results strongly suggest that lncRNAs play an important role in de novo protein evolution.

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Automation and Evaluation of the SOWH Test of Phylogenetic Topologies with SOWHAT

Automation and Evaluation of the SOWH Test of Phylogenetic Topologies with SOWHAT

Samuel H. Church, Joseph F. Ryan, Casey W. Dunn

The Swofford-Olsen-Waddell-Hillis (SOWH) test is a method to evaluate incongruent phylogenetic topologies. It is used, for example, when an investigator wishes to know if the maximum likelihood tree recovered in their analysis is significantly different than an alternative phylogenetic hypothesis. The SOWH test compares the observed difference in likelihood between the topologies to a null distribution of differences in likelihood generated by parametric resampling. The SOWH test is a well-established and important phylogenetic method, but it can be difficult to implement and its sensitivity to various factors is not well understood. We wrote SOWHAT, a program that automates the SOWH test. In test analyses, we find that variation in parameter estimation as well as the use of a more complex model of parameter estimation have little impact on results, but that results can be inconsistent when an insufficient number of replicates are used to estimate the null distribution. We provide methods of analyzing the sampling as well as a simple stopping criteria for sufficient bootstrap replicates, which increase the overall reliability of the approach. Applications of the SOWH test should include explicit evaluations of sampling adequacy. SOWHAT is available for download from https://github.com/josephryan/SOWHAT.

Phylogenetic confidence intervals for the optimal trait value

Phylogenetic confidence intervals for the optimal trait value

Krzysztof Bartoszek, Serik Sagitov

We consider a stochastic evolutionary model for a phenotype developing amongst n related species with unknown phylogeny. The unknown tree is modelled by a Yule process conditioned on n contemporary nodes. The trait value is assumed to evolve along lineages as an Ornstein-Uhlenbeck process. As a result, the trait values of the n species form a sample with dependent observations. We establish three limit theorems for the sample mean corresponding to three domains for the adaptation rate. In the case of fast adaptation, we show that for large n the normalized sample mean is approximately normally distributed. Using these limit theorems, we develop novel confidence interval formulae for the optimal trait value.

Adaptation to a novel predator in Drosophila melanogaster: How well are we able to predict evolutionary responses?

Adaptation to a novel predator in Drosophila melanogaster: How well are we able to predict evolutionary responses?

Michael DeNieu, William Pitchers, Ian Dworkin

Evolutionary theory is sufficiently well developed to allow for short-term prediction of evolutionary trajectories. In addition to the presence of heritable variation, prediction requires knowledge of the form of natural selection on relevant traits. While many studies estimate the form of natural selection, few examine the degree to which traits evolve in the predicted direction. In this study we examine the form of natural selection imposed by mantid predation on wing size and shape in the fruitfly, Drosophila melanogaster. We then evolve populations of D. melanogaster under predation pressure, and examine the extent to which wing size and shape have responded in the predicted direction. We demonstrate that wing form partially evolves along the predicted vector from selection, more so than for control lineages. Furthermore, we re-examined phenotypic selection after ~30 generations of experimental evolution. We observed that the magnitude of selection on wing size and shape was diminished in populations evolving with mantid predators, while the direction of the selection vector differed from that of the ancestral population for shape. We discuss these findings in the context of the predictability of evolutionary responses, and the need for fully multivariate approaches.

Author post: When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis

This guest post is by Justin Blumenstiel on his preprint (with co-authors) When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis, available from bioRxiv here.

Does the activation of one transposable element (TE) family typically lead to the activation of many? If so, this would indicate a synergism between different TE families with significance for TE dynamics in natural populations. A standard model of TE dynamics typically takes into account population size, transposition rate (which may vary based on host defense) and selection against TE insertions. If the mobilization of one TE can lead to the mobilization of others, the transposition rate of one TE family could influence the transposition rate of others.

Hybrid dysgenic syndromes in Drosophila are an important model for TE dynamics when one TE family becomes mobilized. In the 1980’s, it was generally concluded that P element dysgenesis did not lead to mobilization of other TEs. However, studies in the D. virilis system of hybrid dysgenesis indicated otherwise. More recently, an analysis of transposition in the P element system indicated movement of elements other than the P element. Thus, it appears that co-mobilization may be a common feature of dysgenic syndromes.

What is the mechanism of co-mobilization? For the P element system, studies by William Theurkauf and colleagues point to the DNA damage response as key. Specifically, via Chk2 kinase, DNA damage signaling leads to perturbed piRNA biogenesis, which in turn leads to the activation of other elements under control of piRNA-based silencing. Does this mechanism also apply to other systems?

To study the mechanism of TE co-mobilization in the D. virilis system, we performed small RNA sequencing and mRNA sequencing experiments using germline material of reciprocal females of the dysgenic and non-dysgenic crosses. In contrast to the P element and I element systems, hybrid dysgenesis in D. virilis is more complex. For one, there is not a single element that has been proven to be the sole cause. This was previously shown, but in this study we identified several more elements that likely contribute. From small RNA sequencing, we find that TE mis-expression persists in the progeny of the dysgenic cross, without a persisting global defect in piRNA biogenesis. Rather, it appears that piRNA biogenesis defects are idiosyncratic across different TE families. Interestingly, we also find evidence that piRNA silencing loses specificity in the dysgenic cross, with some highly expressed genes becoming non-specific targets.

Overall, this study provided several insights, but the mechanism of co-mobilization in the D. virilis system remains unknown. The complexity of this syndrome makes it a challenge for study, but it may provide significant insight into genome dynamics of hybrids whose parents differ for more than one TE family. Future genetic analysis may allow us to determine the role of the DNA damage response in maintaining the activity of some TE families, but not others.

Author post: qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

This guest post is by Stephen Turner on his preprint qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots, available on bioRxiv here. This is a modified version of a post from his blog.

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

The pre-print is on biorXiv: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165.

Here's a short demo of the package for creating Q-Q and manhattan plots from GWAS results.

Installation

First, let's install and load the package. We can see more examples by viewing the package vignette.

# Install only once:
install.packages("qqman")

# Load every time you use it:
library(qqman)

The qqman package includes functions for creating manhattan plots (the manhattan() function) and Q-Q plots (with the qq() function) from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes in a format similar to the output from PLINK. Take a look at the data:

head(gwasResults)
  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191

Creating manhattan and Q-Q plots

Let's make a basic manhattan plot. If you're using results from PLINK where columns are named SNP, CHR, BP, and P, you only need to call the manhattan() function on the results data.frame you read in.

manhattan(gwasResults)

manhattan_01_basic

We can also change the colors, add a title, and remove the genome-wide significance and “suggestive” lines:

manhattan(gwasResults, col = c("blue4", "orange3"), main = "Results from simulated trait",
    genomewideline = FALSE, suggestiveline = FALSE)

manhattan_02_color_title_lines

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.

head(snpsOfInterest)
[1] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" "rs3006"
manhattan(gwasResults, highlight = snpsOfInterest)

manhattan_03_highlight

We can combine highlighting and limiting to a single chromosome to “zoom in” on an interesting chromosome or region:

manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3 Results")

manhattan_04_highlight_chr

Finally, creating Q-Q plots is straightforward – simply supply a vector of p-values to the qq() function. You can optionally provide a title.

qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")

qq

Read the blog post or check out the package vignette for more examples and options.

vignette("qqman")