SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Yinlong Xie, Gengxiong Wu, Jingbo Tang, Ruibang Luo, Jordan Patterson, Shanlin Liu, Weihua Huang, Guangzhu He, Shengchang Gu, Shengkang Li, Xin Zhou, Tak-Wah Lam, Yingrui Li, Xun Xu, Gane Ka-Shu Wong, Jun Wang
(Submitted on 29 May 2013)

Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences of many (but not all) of the genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popu-larity; but given the short reads (e.g. 2 * 90 bp paired ends), de novo assembly to recover complete full length gene sequences remains an algorithmic challenge.
Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on 2Gb and 5Gb of transcriptome data from mouse and rice. Using the known transcripts from these two well-annotated genomes as a benchmark, we assessed how SOAPdenovo-Trans and other competing software handle the practical issues of alterna-tive splicing and variable expression levels. Compared with other de novo transcriptome assemblers, SOAPdenovo-Trans provides high-er contiguity, lower redundancy, and faster execution.

Stochastic gene expression with delay

Stochastic gene expression with delay
Martin Jansen, Peter Pfaffelhuber
(Submitted on 28 May 2013)

The expression of genes usually follows a two-step procedure. First, a gene (encoded in the genome) is transcribed resulting in a strand of (messenger) RNA. Afterwards, the RNA is translated into protein. Classically, this gene expression is modeled using a Markov jump process including activation and deactivation of the gene, transcription and translation rates together with degradation of RNA and protein. We extend this model by adding delays (with arbitrary distributions) to transcription and translation. Such delays can e.g.\ mean that RNA has to be transported to a different part of a cell before translation can be initiated. Already in the classical model, production of RNA and protein come in bursts by activation and deactivation of the gene, resulting in a large variance of the number of RNA and proteins in equilibrium. We derive precise formulas for this second-order structure with the model including delay in equilibrium. As a general fact, the delay decreases the variance of the number of RNA and proteins.

Our paper: Genetic recombination is targeted towards gene promoter regions in dogs

This guest post is by Adam Auton (@adamauton) on his paper (along with coauthors) Genetic recombination is targeted towards gene promoter regions in dogs arXived here.

In this paper, we investigate the age-old question of how meiotic recombination is distributed in the genome of dogs. Before you stop reading, I’d like to spend a couple of paragraphs explaining why this is an interesting topic.

Recombination in mammalian genomes tends to occur in highly localized regions known as recombination hotspots. There are probably about 30,000 or so recombination hotspots in the human genome, each of which are about 2kb wide with recombination rates that can be thousands of times that of the surrounding region. Until a few years ago, the mechanism by which recombination hotspots are localized was largely unknown. This all began to change with the discovery of PRDM9 as the gene responsible for localizing hotspots [1-3]. The role of PRDM9 is to recognize and bind to specific DNA motifs in the genome, which are subsequently epigenetically marked as preferred locations of recombination.

PRDM9 turns out to be quite a fascinating gene. There is extensive variation in PRDM9 both within and across species, which points to strong selective pressures. Importantly, variation in PRDM9 can alter the recognized DNA motifs, thereby altering the locations of recombination hotspots in the genome. The high level of variation in PRDM9 between species appears to explain why recombination hotspots tend to not be shared between even closely related species, such as human and chimpanzees.

We’ve learnt much about the importance of PRDM9 from studies in mice. Knock-out of Prdm9 in mice results in infertility and, most interestingly of all, certain alleles of mouse Prdm9 appear to be incompatible with each other [4,5]. Specifically, Mus m. musculus / Mus m. domesticus hybrid male mice are infertile if they are heterozygotic for specific Prdm9 alleles. As such, Prdm9 has been called a ‘speciation gene’, as it has the potential to restrict gene flow between nascent species, and is the only known such example in mammals.

Given this importance, it was therefore surprising to note that dogs, uniquely amongst mammals, appear to carry a dysfunctional version of PRDM9 [6]. This therefore begs the question of how recombination occurs in dogs, and provides the motivation for our paper.

Estimating recombination rates directly is challenging and costly, as only a few dozen events occur during any given meiosis. Therefore, to characterize large numbers of recombination events on a genome-wide basis, large pedigrees need to be genotyped, which can be both laborious and costly to do in non-model organisms. Luckily, an experiment of this nature has been previously performed in dogs, which revealed a recombination landscape that was reasonably consistent with patterns observed in other mammals [7].

However, without enormous sample sizes, such methods can only investigate patterns at scales far greater than the scale of individual hotspots. In order to investigate fine-scale patterns on a genome-wide basis, one must turn to indirect statistical methods, and it is this approach that we have adopted in our study. First, we whole-genome sequenced a collection of 51 outbred dogs and used this data to call single nucleotide polymorphisms. Having done so, we used the statistical method, LDhat, which infers historical recombination rates via analysis of patterns of linkage disequilibrium. This is a similar approach that adopted by Axelsson et al. [8], who used microarrays to gain strong insights into canine recombination, although our use of sequencing allows us to investigate patterns at a much finer scale.

Our results agree nicely with the broad-scale experimental estimates, but reveal a quite unusual landscape at the fine scale. In particular, we find that canine recombination is strongly enriched in regions with high CpG content. As such, recombination rates are very high around the CpG-rich regions associated with gene promoters, and contrasts with other mammalian species in which recombination hotspots do not show any particularly strong affinity for gene promoter regions. However, it is also reminiscent of patterns seen in Prdm9 knock-out mice which, although infertile, still produce double-strand breaks that cluster in gene promoter regions [9].

Interestingly, the dog genome is known to have very high CpG content. It has previously been suggested that one potential mechanism by which this may have occurred is biased gene conversion, which can result in the preferential transmission of G-C alleles over A-T alleles in the vicinity of recombination events. To investigate this phenomenon, we also sequenced a related fox species, which allowed us to see if G-C alleles are being gained or lost around recombination hotspots. We see that dog recombination hotspots do indeed appear to be acquiring GC content. This could imply a runaway process, by which CpG-rich regions have become recombinogenic, and hence have started to acquire more GC content, and hence become more recombinogenic.

As such, our results show that recombination in the dog genome appears to have some quite interesting properties. However, questions remain. The loss of PRDM9 in dogs appears to have resulted in some qualitative features that are consistent with knock-out mice, and yet dogs somehow avoid the associated infertility. Perhaps canine meiosis manages to complete without a PRDM9 ortholog, or perhaps an as-yet-unknown gene in the dog genome has adopted the role of PRDM9. In either case, the investigation of recombination in dogs provides a valuable means for building our understanding of how recombination occurs and its importance in shaping the genome.

1. Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327: 836-840.
2. Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327: 876-879.
3. Parvanov ED, Petkov PM, Paigen K (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327: 835.
4. Flachs P, Mihola O, Simecek P, Gregorova S, Schimenti JC, et al. (2012) Interallelic and intergenic incompatibilities of the Prdm9 (Hst1) gene in mouse hybrid sterility. PLoS Genet 8: e1003044.
5. Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J (2009) A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323: 373-375.
6. Oliver PL, Goodstadt L, Bayes JJ, Birtle Z, Roach KC, et al. (2009) Accelerated evolution of the Prdm9 speciation gene across diverse metazoan taxa. PLoS Genet 5: e1000753.
7. Wong AK, Ruhe AL, Dumont BL, Robertson KR, Guerrero G, et al. (2010) A comprehensive linkage map of the dog genome. Genetics 184: 595-605.
8. Axelsson E, Webster MT, Ratnakumar A, Ponting CP, Lindblad-Toh K (2012) Death of PRDM9 coincides with stabilization of the recombination landscape in the dog genome. Genome Res 22: 51-63.
9. Brick K, Smagulova F, Khil P, Camerini-Otero RD, Petukhova GV (2012) Genetic recombination is directed away from functional genomic elements in mice. Nature 485: 642-645.

Genetic recombination is targeted towards gene promoter regions in dogs

Genetic recombination is targeted towards gene promoter regions in dogs
Adam Auton, Ying Rui Li, Jeffrey Kidd, Kyle Oliveira, Julie Nadel, J. Kim Holloway, Jessica J. Howard, Paula E. Cohen, John M. Greally, Jun Wang, Carlos D. Bustamante, Adam R. Boyko
(Submitted on 28 May 2013)

The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. Compared to genetic maps obtained in other mammalian species, the canine map is notably different at the fine-scale. While broad-scale patterns exhibit typical properties, our fine-scale estimates indicate that recombination is more uniformly distributed than has been observed in other mammalian species. In addition, highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. Finally, by comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq
Mark D. Robinson
(Submitted on 27 May 2013)

This letter is a response to a Divergent Views article entitled “The poor performance of TMM on microRNA-Seq” (Garmire and Subramaniam 2013), which was a response to our Divergent Views article entitled “miRNA-seq normalization comparisons need improvement” (Zhou et al. 2013). Using reproducible code examples, we showed that they incorrectly used our normalization method and highlighted additional concerns with their study. Here, I wish to debunk several untrue or misleading statements made by the authors (hereafter referred to as GS) in their response. Unlike GSs, my claims are supported by R code, citations and email correspondences. I finish by making a call for better practice.

The common ancestor process revisited

The common ancestor process revisited
Sandra Kluth, Thiemo Hustedt, Ellen Baake
(Submitted on 25 May 2013)

We consider the Moran model in continuous time with two types, mutation, and selection. We concentrate on the ancestral line and its stationary type distribution. Building on work by Fearnhead (J. Appl. Prob. 39 (2002), 38-54) and Taylor (Electron. J. Probab. 12 (2007), 808-847), we characterise this distribution via the fixation probability of the offspring of all individuals of favourable type (regardless of the offsprings’ types). We concentrate on a finite population and stay with the resulting discrete setting all the way through. This way, we extend previous results and gain new insight into the underlying particle picture.

Statistical properties of the site-frequency spectrum associated with Lambda-coalescents

Statistical properties of the site-frequency spectrum associated with Lambda-coalescents
Matthias Birkner, Jochen Blath, Bjarki Eldon
(Submitted on 26 May 2013)

Statistical properties of the site frequency spectrum associated with Lambda-coalescents are our objects of study. In particular, we derive recursions for the expected value, variance, and covariance of the spectrum, extending earlier results of Fu (1995) for the classical Kingman coalescent. Our focus is on estimating coalescent parameters introduced by certain Lambda-coalescents for datasets to large for full likelihood methods. The recursions for the expected values we obtain can be used to find the parameter values which give the best fit to the observed frequency spectrum. The expected values are also used to approximate the probability a (derived) mutation arises on a branch subtending a given number of leaves (DNA sequences), allowing us to apply a pseudo-likelihood inference to estimate coalescence parameters associated with certain subclasses of Lambda coalescents. The properties of the pseudo-likelihood approach are investigated on real and simulated datasets. Our results for two subclasses of Lambda coalescents show that one can distinguish these subclasses from the Kingman coalescent, as well as between the Lambda-subclasses. In addition, our results yield further support for multiple merger coalescents as an appropriate `null’ model at the mitochondrial DNA level for high-fecundity Atlantic cod (\emph{Gadus morhua}).