RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)
Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.