RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)
Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.
My initial knee-jerk reaction to this paper is that you must be careful what you call the “reference transcriptome”. There are of course many – Ensembl gene sets are used in this paper, but other reference transcriptomes exist including RefSeq and UniGene. Just because a transcriptome is incomplete in Ensembl, it doesn’t mean it is incomplete everywhere else – in many cases, ESTs and full length cDNAs exist for genes that are not in the assembled genome, or that are not assembled well. A good example is the IFITM locus in pigs – IFITM1 is poorly assembled, and IFITM2 and IFITM3 are missing or fragmented in the genome – however, full length cDNAs exist in GenBank for all three genes.
The overall message of the paper is quite clear, and it is also a well recognised problem (by bioinformaticians) – many papers exist on methods for e.g. multi-mapped reads e.g. http://bioinformatics.oxfordjournals.org/content/25/19/2613.full
If I was peer reviewing this paper, I’d definitely bring up the “but what about EST/cDNA evidence that is not in the genome” issue. I don’t actually think the overall message will change – IFITM1, 2 and 3 are 90%+ identical over long regions and so obtaining accurate measures of expression is difficult with short sequence reads – but the paper may need a section on using cDNA/EST evidence rather than annotated genome evidence.
Pingback: Our paper: RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates | Haldane's Sieve
Pingback: Most viewed on Haldane’s Sieve, March 2013 | Haldane's Sieve