SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads

SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads
Yinlong Xie, Gengxiong Wu, Jingbo Tang, Ruibang Luo, Jordan Patterson, Shanlin Liu, Weihua Huang, Guangzhu He, Shengchang Gu, Shengkang Li, Xin Zhou, Tak-Wah Lam, Yingrui Li, Xun Xu, Gane Ka-Shu Wong, Jun Wang
(Submitted on 29 May 2013)

Motivation: Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining the sequences of many (but not all) of the genes from an organism with no reference genome. With the rapidly increasing throughputs and decreasing costs of next generation sequencing, RNA-Seq has gained in popu-larity; but given the short reads (e.g. 2 * 90 bp paired ends), de novo assembly to recover complete full length gene sequences remains an algorithmic challenge.
Results: We present SOAPdenovo-Trans, a de novo transcriptome assembler designed specifically for RNA-Seq. Its performance was evaluated on 2Gb and 5Gb of transcriptome data from mouse and rice. Using the known transcripts from these two well-annotated genomes as a benchmark, we assessed how SOAPdenovo-Trans and other competing software handle the practical issues of alterna-tive splicing and variable expression levels. Compared with other de novo transcriptome assemblers, SOAPdenovo-Trans provides high-er contiguity, lower redundancy, and faster execution.

Stochastic gene expression with delay

Stochastic gene expression with delay
Martin Jansen, Peter Pfaffelhuber
(Submitted on 28 May 2013)

The expression of genes usually follows a two-step procedure. First, a gene (encoded in the genome) is transcribed resulting in a strand of (messenger) RNA. Afterwards, the RNA is translated into protein. Classically, this gene expression is modeled using a Markov jump process including activation and deactivation of the gene, transcription and translation rates together with degradation of RNA and protein. We extend this model by adding delays (with arbitrary distributions) to transcription and translation. Such delays can e.g.\ mean that RNA has to be transported to a different part of a cell before translation can be initiated. Already in the classical model, production of RNA and protein come in bursts by activation and deactivation of the gene, resulting in a large variance of the number of RNA and proteins in equilibrium. We derive precise formulas for this second-order structure with the model including delay in equilibrium. As a general fact, the delay decreases the variance of the number of RNA and proteins.

Our paper: Genetic recombination is targeted towards gene promoter regions in dogs

This guest post is by Adam Auton (@adamauton) on his paper (along with coauthors) Genetic recombination is targeted towards gene promoter regions in dogs arXived here.

In this paper, we investigate the age-old question of how meiotic recombination is distributed in the genome of dogs. Before you stop reading, I’d like to spend a couple of paragraphs explaining why this is an interesting topic.

Recombination in mammalian genomes tends to occur in highly localized regions known as recombination hotspots. There are probably about 30,000 or so recombination hotspots in the human genome, each of which are about 2kb wide with recombination rates that can be thousands of times that of the surrounding region. Until a few years ago, the mechanism by which recombination hotspots are localized was largely unknown. This all began to change with the discovery of PRDM9 as the gene responsible for localizing hotspots [1-3]. The role of PRDM9 is to recognize and bind to specific DNA motifs in the genome, which are subsequently epigenetically marked as preferred locations of recombination.

PRDM9 turns out to be quite a fascinating gene. There is extensive variation in PRDM9 both within and across species, which points to strong selective pressures. Importantly, variation in PRDM9 can alter the recognized DNA motifs, thereby altering the locations of recombination hotspots in the genome. The high level of variation in PRDM9 between species appears to explain why recombination hotspots tend to not be shared between even closely related species, such as human and chimpanzees.

We’ve learnt much about the importance of PRDM9 from studies in mice. Knock-out of Prdm9 in mice results in infertility and, most interestingly of all, certain alleles of mouse Prdm9 appear to be incompatible with each other [4,5]. Specifically, Mus m. musculus / Mus m. domesticus hybrid male mice are infertile if they are heterozygotic for specific Prdm9 alleles. As such, Prdm9 has been called a ‘speciation gene’, as it has the potential to restrict gene flow between nascent species, and is the only known such example in mammals.

Given this importance, it was therefore surprising to note that dogs, uniquely amongst mammals, appear to carry a dysfunctional version of PRDM9 [6]. This therefore begs the question of how recombination occurs in dogs, and provides the motivation for our paper.

Estimating recombination rates directly is challenging and costly, as only a few dozen events occur during any given meiosis. Therefore, to characterize large numbers of recombination events on a genome-wide basis, large pedigrees need to be genotyped, which can be both laborious and costly to do in non-model organisms. Luckily, an experiment of this nature has been previously performed in dogs, which revealed a recombination landscape that was reasonably consistent with patterns observed in other mammals [7].

However, without enormous sample sizes, such methods can only investigate patterns at scales far greater than the scale of individual hotspots. In order to investigate fine-scale patterns on a genome-wide basis, one must turn to indirect statistical methods, and it is this approach that we have adopted in our study. First, we whole-genome sequenced a collection of 51 outbred dogs and used this data to call single nucleotide polymorphisms. Having done so, we used the statistical method, LDhat, which infers historical recombination rates via analysis of patterns of linkage disequilibrium. This is a similar approach that adopted by Axelsson et al. [8], who used microarrays to gain strong insights into canine recombination, although our use of sequencing allows us to investigate patterns at a much finer scale.

Our results agree nicely with the broad-scale experimental estimates, but reveal a quite unusual landscape at the fine scale. In particular, we find that canine recombination is strongly enriched in regions with high CpG content. As such, recombination rates are very high around the CpG-rich regions associated with gene promoters, and contrasts with other mammalian species in which recombination hotspots do not show any particularly strong affinity for gene promoter regions. However, it is also reminiscent of patterns seen in Prdm9 knock-out mice which, although infertile, still produce double-strand breaks that cluster in gene promoter regions [9].

Interestingly, the dog genome is known to have very high CpG content. It has previously been suggested that one potential mechanism by which this may have occurred is biased gene conversion, which can result in the preferential transmission of G-C alleles over A-T alleles in the vicinity of recombination events. To investigate this phenomenon, we also sequenced a related fox species, which allowed us to see if G-C alleles are being gained or lost around recombination hotspots. We see that dog recombination hotspots do indeed appear to be acquiring GC content. This could imply a runaway process, by which CpG-rich regions have become recombinogenic, and hence have started to acquire more GC content, and hence become more recombinogenic.

As such, our results show that recombination in the dog genome appears to have some quite interesting properties. However, questions remain. The loss of PRDM9 in dogs appears to have resulted in some qualitative features that are consistent with knock-out mice, and yet dogs somehow avoid the associated infertility. Perhaps canine meiosis manages to complete without a PRDM9 ortholog, or perhaps an as-yet-unknown gene in the dog genome has adopted the role of PRDM9. In either case, the investigation of recombination in dogs provides a valuable means for building our understanding of how recombination occurs and its importance in shaping the genome.

1. Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327: 836-840.
2. Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327: 876-879.
3. Parvanov ED, Petkov PM, Paigen K (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327: 835.
4. Flachs P, Mihola O, Simecek P, Gregorova S, Schimenti JC, et al. (2012) Interallelic and intergenic incompatibilities of the Prdm9 (Hst1) gene in mouse hybrid sterility. PLoS Genet 8: e1003044.
5. Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J (2009) A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323: 373-375.
6. Oliver PL, Goodstadt L, Bayes JJ, Birtle Z, Roach KC, et al. (2009) Accelerated evolution of the Prdm9 speciation gene across diverse metazoan taxa. PLoS Genet 5: e1000753.
7. Wong AK, Ruhe AL, Dumont BL, Robertson KR, Guerrero G, et al. (2010) A comprehensive linkage map of the dog genome. Genetics 184: 595-605.
8. Axelsson E, Webster MT, Ratnakumar A, Ponting CP, Lindblad-Toh K (2012) Death of PRDM9 coincides with stabilization of the recombination landscape in the dog genome. Genome Res 22: 51-63.
9. Brick K, Smagulova F, Khil P, Camerini-Otero RD, Petukhova GV (2012) Genetic recombination is directed away from functional genomic elements in mice. Nature 485: 642-645.

Genetic recombination is targeted towards gene promoter regions in dogs

Genetic recombination is targeted towards gene promoter regions in dogs
Adam Auton, Ying Rui Li, Jeffrey Kidd, Kyle Oliveira, Julie Nadel, J. Kim Holloway, Jessica J. Howard, Paula E. Cohen, John M. Greally, Jun Wang, Carlos D. Bustamante, Adam R. Boyko
(Submitted on 28 May 2013)

The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. Compared to genetic maps obtained in other mammalian species, the canine map is notably different at the fine-scale. While broad-scale patterns exhibit typical properties, our fine-scale estimates indicate that recombination is more uniformly distributed than has been observed in other mammalian species. In addition, highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. Finally, by comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq
Mark D. Robinson
(Submitted on 27 May 2013)

This letter is a response to a Divergent Views article entitled “The poor performance of TMM on microRNA-Seq” (Garmire and Subramaniam 2013), which was a response to our Divergent Views article entitled “miRNA-seq normalization comparisons need improvement” (Zhou et al. 2013). Using reproducible code examples, we showed that they incorrectly used our normalization method and highlighted additional concerns with their study. Here, I wish to debunk several untrue or misleading statements made by the authors (hereafter referred to as GS) in their response. Unlike GSs, my claims are supported by R code, citations and email correspondences. I finish by making a call for better practice.

The common ancestor process revisited

The common ancestor process revisited
Sandra Kluth, Thiemo Hustedt, Ellen Baake
(Submitted on 25 May 2013)

We consider the Moran model in continuous time with two types, mutation, and selection. We concentrate on the ancestral line and its stationary type distribution. Building on work by Fearnhead (J. Appl. Prob. 39 (2002), 38-54) and Taylor (Electron. J. Probab. 12 (2007), 808-847), we characterise this distribution via the fixation probability of the offspring of all individuals of favourable type (regardless of the offsprings’ types). We concentrate on a finite population and stay with the resulting discrete setting all the way through. This way, we extend previous results and gain new insight into the underlying particle picture.

Statistical properties of the site-frequency spectrum associated with Lambda-coalescents

Statistical properties of the site-frequency spectrum associated with Lambda-coalescents
Matthias Birkner, Jochen Blath, Bjarki Eldon
(Submitted on 26 May 2013)

Statistical properties of the site frequency spectrum associated with Lambda-coalescents are our objects of study. In particular, we derive recursions for the expected value, variance, and covariance of the spectrum, extending earlier results of Fu (1995) for the classical Kingman coalescent. Our focus is on estimating coalescent parameters introduced by certain Lambda-coalescents for datasets to large for full likelihood methods. The recursions for the expected values we obtain can be used to find the parameter values which give the best fit to the observed frequency spectrum. The expected values are also used to approximate the probability a (derived) mutation arises on a branch subtending a given number of leaves (DNA sequences), allowing us to apply a pseudo-likelihood inference to estimate coalescence parameters associated with certain subclasses of Lambda coalescents. The properties of the pseudo-likelihood approach are investigated on real and simulated datasets. Our results for two subclasses of Lambda coalescents show that one can distinguish these subclasses from the Kingman coalescent, as well as between the Lambda-subclasses. In addition, our results yield further support for multiple merger coalescents as an appropriate `null’ model at the mitochondrial DNA level for high-fecundity Atlantic cod (\emph{Gadus morhua}).

Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin
Bin Z. He, Michael Z. Ludwig, Desiree A. Dickerson, Levi Barse, Bharath Arun, Soo Young Park, Natalia A. Tamarina, Scott B. Selleck, Patricia Wittkopp, Graeme I. Bell, Martin Kreitman
(Submitted on 23 May 2013)

The identification and validation of gene-gene interactions is a major challenge in human studies. Here, we explore an approach for studying epistasis in humans using a Drosophila melanogaster model of neonatal diabetes mellitus. Expression of mutant preproinsulin, hINSC96Y, in the eye imaginal disc mimics the human disease activating conserved cell stress response pathways leading to cell death and reduction in eye area. Dominant-acting variants in wild-derived inbred lines from the Drosophila Genetics Reference Panel produce a continuous, highly heritable, distribution of eye degeneration phenotypes. A genome-wide association study (GWAS) in 154 sequenced lines identified 29 candidate SNPs in 16 loci with P 7.62). RNAi knock-downs of sfl enhanced the eye degeneration phenotype in a mutant-hINS-dependent manner. sfl encodes a protein required for sulfation of the glycosaminoglycan, heparan sulfate. Two additional genes in the heparan sulfate (HS) biosynthetic pathway (tout velu, ttv and brother of tout velu, botv) also modified the eye phenotype, suggesting a link between HS-modified proteins and cellular responses to misfolded proteins. Finally, intronic variants marking the QTL were associated with decreased sfl expression, a result consistent with that predicted by RNAi studies. The ability to create a model of human genetic disease in the fly, map a QTL by GWAS to a specific gene (and noncoding variant), validate its contribution to disease with available genetic resources, and experimentally link the variant to a molecular mechanism, demonstrate the many advantages Drosophila holds in determining the genetic underpinnings of human disease.

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS
David Golan, Saharon Rosset
(Submitted on 23 May 2013)

One of the major developments in recent years in the search for missing heritability of human phenotypes is the adoption of linear mixed-effects models (LMMs) to estimate heritability due to genetic variants which are not significantly associated with the phenotype. A variant of the LMM approach has been adapted to case-control studies and applied to many major diseases by Lee et al. (2011), successfully accounting for a considerable portion of the missing heritability. For example, for Crohn’s disease their estimated heritability was 22% compared to 50-60% from family studies. In this letter we propose to estimate heritability of disease directly by regression of phenotype similarities on genotype correlations, corrected to account for ascertainment. We refer to this method as genetic correlation regression (GCR). Using GCR we estimate the heritability of Crohn’s disease at 34% using the same data. We demonstrate through extensive simulation that our method yields unbiased heritability estimates, which are consistently higher than LMM estimates. Moreover, we develop a heuristic correction to LMM estimates, which can be applied to published LMM results. Applying our heuristic correction increases the estimated heritability of multiple sclerosis from 30% to 52.6%.

Our paper: The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

This guest post is by Detlef Weigel (@WeigelWorld) and Hernán A. Burbano on their arXived paper [with coauthors] Yoshida et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. arXived here and in press at eLife [to appear here].

This paper is the result of a great collaboration between a lab that specializes in ancient DNA (that of Johannes Krause from the University of Tübingen), an expert in pathogen systematics (the group of Marco Thines from the Senckenberg Museum and Goethe University in Frankfurt), two pathogen genomics labs (those of Sophien Kamoun from the Sainsbury Laboratory in Norwich and Frank Martin from the USDA in California), and our evolutionary genomics group at the Max Planck Institute in Tübingen (Hernán A. Burbano and Detlef Weigel).


Phytophthora infestans made history when it destroyed large parts of the European potato crop, beginning in 1845. Potato has its origin in the Andes, in the Southeast of modern Peru and Northwest of Bolivia, while the center of diversity of P. infestans is several thousand kilometers further north, in Mexico’s Toluca Valley. There, other Phytophthora species live on a broad range of host plants. At some point in its history, evolutionary events associated with repeat-driven genome expansion [1,2] endowed P. infestans with the genetic arsenal required to infect potato. The pathogen was introduced to Europe in 1845 via infected potato tuber from the United States, where potato blight had made its first appearance in 1843. In the ensuing European blight epidemic, Ireland was hit especially hard, because the virtual absence of independent farmers and a restrictive customs policy conspired with the disease caused by P. infestans, potato blight, to have disproportionately devastating effects. The Great Famine that struck Ireland was a decisive event in both European and American history. One million Irish died of starvation, and at least another million left the country – most of them to the USA.


This part of P. infestans history has been clear, but the relationship of the strain(s) that caused the nineteenth century epidemic to modern strains has been controversial. Before a range of genetically quite distinct P. infestans strains made their debut throughout the world some 40 years ago, the global population outside Mexico was dominated by a single strain, called US-1. Because of its prevalence, US-1 was long thought to have been the cause of the fatal outbreak in the nineteenth century. From the analysis of a single SNP in the mitochondrial genome, it was, however, concluded in 2001 that the nineteenth century strains were more closely related to the modern strains that prevail today [3].


In our new paper, we resolve this paradoxical view: While the historical pathogen strain, which we call HERB-1, indeed differs at this one position from US-1, which has a derived allele, HERB-1 is far more closely related to US-1 than to other modern strains. Molecular clock analyses show that both strains probably separated from each other only a few years before the major European outbreak. HERB-1 seems to have dominated the global population without many genetic changes, and only in the twentieth century, after new potato varieties were introduced, was HERB-1 replaced by US-1 as the most successful P. infestans strain. We do not know for sure why HERB-1 was replaced, but we noted that the modern strains tend to be polyploid, while HERB-1 was diploid. We speculate that the increased genetic diversity in polyploid lineages were important for the success of US-1 (and other modern strains).


Our conclusions are based on Illumina sequencing of 11 herbarium samples of infected potato and tomato leaves collected in Ireland, the UK, Continental Europe and North America and preserved in the herbaria of the Botanical State Collection Munich and the Kew Gardens in London. Both herbaria placed a great deal of confidence in our abilities and were very generous in providing the dried plants. The degree of DNA preservation in the herbarium samples was impressive, much higher than in other examples of ancient DNA, and the majority of recovered DNA was from the host plant, with some samples having in addition over 20% pathogen DNA. In contrast to recent studies of historic human pathogens, no target DNA enrichment was required. We compared the historic samples with modern strains from Europe, Africa and North and South America as well as two closely related Phytophthora species. Due to the 150-year long period over which the individual samples had been collected, we were able to estimate with great confidence when the various P. infestans strains had emerged during evolutionary time. Here, too, we found connections with historic events: the first contact between Europeans and Americans in Mexico falls exactly into the time window in which the genetic diversity of P. infestans experienced a remarkable increase. Presumably, the social upheaval following the arrival of the Europeans somehow led to a spread of the pathogen at the beginning of the sixteenth century, which in turn accelerated its evolution.


The historical HERB-1 type is so far not known from modern collections, but we now have many diagnostic markers with which we can type the hundreds of modern isolates to determine whether perhaps there is somewhere a reservoir of HERB-1. In addition, our work highlights that herbaria constitute a rich, so far untapped source for investigating real-time evolution.


Detlef Weigel,

Hernán A. Burbano,


Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany



1.         Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393-398.

2.         Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, et al. (2010) Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 330: 1540-1543.

3.         Ristaino JB, Groves CT, Parra GR (2001) PCR amplification of the Irish potato famine pathogen from historic specimens. Nature 411: 695-697.