Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads

This guest post is by Matt MacManes on his preprint with Michael Eisen, “Improving transcriptome assembly through error correction of high-throughput sequence reads“, arXived here. This is cross-posted from his blog.

I am writing this blog post in support of a paper that I have just submitted to arXiv: Improving transcriptome assembly through error correction of high-throughput sequence reads. My goal is not to talk about the nuts and bolts of the paper so much as it is to ramble about its motivation and the writing process.

First, a little bit about me, as this is my 1st paper with my postdoctoral advisor, Mike Eisen. In short, I am a evolutionary biologist by training, having done my PhD on the relationship between mating systems and immunogenes in wild rodents. My postdoc work focuses on adaptation to desert life in rodents- I work on Peromyscus rodents in the Southern California deserts, combining field work and genomics. My overarching goals include the ability to operate in multiple domains– genomics, field biology, evolutionary biology to better understand basic questions– the links between genotype and phenotype, adaptation, etc… OK, enough.. on the the paper.

Abstract:

The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

For the past couple of years, I have had an interest in better understanding the dynamics of de novo transcriptome assembly.. I had mostly selfish/practical reasons for wanting to understand–a large amount of my work depends on getting these assemblies ‘right’.. It was quickly evident that much of the computational research is directed at assembly itself, and very little on the pre- and post-assembly processes.. We know these things are important, but often an understanding of their effects is lacking…

How error correction of sequencing reads affects assembly accuracy has been one of the specific ideas I’ve been interested in thinking about for the past several months. The idea of simulating RNAseq reads, applying various error corrections, then understanding their effects is logical– so much so that I was really surprised that this has not been done before. So off I went..

I wrote this paper over the coarse of a couple of weeks. It is a short and simple paper, and was quite easy to write. Of note, about 75% of the paper was written on the playground in the UC Berkeley University Village, while (loosely) providing supervision for my 2 youngest daughters. How is that for work-life balance!

The read data will be available on Figshare, and I owe thanks to those guys for lifting the upload limit- the read file is 2.6Gb with .bz2 compression, so not huge, but not small either. The winning (AllPathsLG corrected) assembly is there as well.

This type of work is inspired, in a very real sense, by C. Titus Brown, who is quickly becoming to be the go-to guy for understanding the nuts and bolts of genome assembly (and also got tenure based on his klout score HA!). His post and paper on The challenges of mRNAseq analysis is the type of stuff that I aspire to…

Anyway, I’d be really interested in hearing what you all think of the paper, so read, enjoy, comment– and get to error correcting those reads!

Coolio. This is an interesting situation for me — I thought I knew enough about error correction to tell you that genomic error correction like Quake simply wouldn’t work on mRNAseq, so I never bothered running it. (We’re developing our own techniques to deal with both metagenomic and mRNAseq error correct, based on ivory.idyll.org/blog/error-detection-for-metagenome-and-mrnaseq.html). So this is neat to read, because it at least seems to work at some useful level.

I particularly like the discussion of the runtime constraints, because, as you know, I like things that can run in a reasonable amount of time!

Three concerns that I think should be addressed in the paper

1) mRNAseq has variable coverage, which should throw off coverage estimates critical to most error correction algorithms. I would expect to see both miscorrected reads and uncorrected reads resulting from this. Do you see high coverage reads that were uncorrected or miscorrected, and do you see low coverage reads that were corrected?

2) Splice variants. I would expect error correction to mess with splice junctions where you have a high-coverage exon joined to both another high-coverage exon OR a low-coverage exon, because those should look like errors. I haven’t thought about how to analyze this so it would be a real contribution if you could figure that out for me 🙂

3) There is an mRNAseq error correction algorithm (SEECER). What happens when you run that? It may not be enough to say “we couldn’t get it to run” since the authors of it are virtually guaranteed to be reviewers on your paper!

Finally, I would expect digital normalization to be a really effective preprocessor that might *improve* your results because it smooths out the coverage. Just a thought & not something I would ask you to run even if I were a reviewer, but maybe a good idea anyway.

Reply ↓

2 thoughts on “Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads”

Pingback: Improving transcriptome assembly through error correction of high-throughput sequence reads | MacManes Research
Titus Brown (@ctitusbrown) on April 6, 2013 at 3:24 pm said:

Coolio. This is an interesting situation for me — I thought I knew enough about error correction to tell you that genomic error correction like Quake simply wouldn’t work on mRNAseq, so I never bothered running it. (We’re developing our own techniques to deal with both metagenomic and mRNAseq error correct, based on ivory.idyll.org/blog/error-detection-for-metagenome-and-mrnaseq.html). So this is neat to read, because it at least seems to work at some useful level.

I particularly like the discussion of the runtime constraints, because, as you know, I like things that can run in a reasonable amount of time!

Three concerns that I think should be addressed in the paper

1) mRNAseq has variable coverage, which should throw off coverage estimates critical to most error correction algorithms. I would expect to see both miscorrected reads and uncorrected reads resulting from this. Do you see high coverage reads that were uncorrected or miscorrected, and do you see low coverage reads that were corrected?

2) Splice variants. I would expect error correction to mess with splice junctions where you have a high-coverage exon joined to both another high-coverage exon OR a low-coverage exon, because those should look like errors. I haven’t thought about how to analyze this so it would be a real contribution if you could figure that out for me 🙂

3) There is an mRNAseq error correction algorithm (SEECER). What happens when you run that? It may not be enough to say “we couldn’t get it to run” since the authors of it are virtually guaranteed to be reviewers on your paper!

Finally, I would expect digital normalization to be a really effective preprocessor that might *improve* your results because it smooths out the coverage. Just a thought & not something I would ask you to run even if I were a reviewer, but maybe a good idea anyway.

Reply ↓

Haldane's Sieve

Discussing preprints in population and evolutionary genetics

Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads

2 thoughts on “Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads”

Leave a comment Cancel reply