RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates
Alexis Black Pyrkosz, Hans Cheng, C. Titus Brown
(Submitted on 11 Mar 2013)
Whole transcriptome sequencing is increasingly being used as a functional genomics tool to study non- model organisms. However, when the reference transcriptome used to calculate differential expression is incomplete, significant error in the inferred expression levels can result. In this study, we use simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measure the error resulting from using an incomplete transcriptome. We show that the two primary sources of count- ing error are 1) alternative splice variants that share reads and 2) missing transcripts from the reference. Alternative splice variants increase the false positive rate of mapping while incomplete reference tran- scriptomes decrease the true positive rate, leading to inaccurate transcript expression levels. Grouping transcripts by gene or read sharing (similar to mapping to a reference genome) significantly decreases false positives, but only by improving the reference transcriptome itself can the missing transcript problem be addressed. We also demonstrate that employing different mapping software does not yield substantial increases in accuracy on simulated data. Finally, we show that read lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.
Equitability, mutual information, and the maximal information coefficient
Justin B. Kinney, Gurinder S. Atwal
(Submitted on 31 Jan 2013)
Reshef et al. recently proposed a new statistical measure, the “maximal information coefficient” (MIC), for quantifying arbitrary dependencies between pairs of stochastic quantities. MIC is based on mutual information, a fundamental quantity in information theory that is widely understood to serve this need. MIC, however, is not an estimate of mutual information. Indeed, it was claimed that MIC possesses a desirable mathematical property called “equitability” that mutual information lacks. This was not proven; instead it was argued solely through the analysis of simulated data. Here we show that this claim, in fact, is incorrect. First we offer mathematical proof that no (non-trivial) dependence measure satisfies the definition of equitability proposed by Reshef et al.. We then propose a self-consistent and more general definition of equitability that follows naturally from the Data Processing Inequality. Mutual information satisfies this new definition of equitability while MIC does not. Finally, we show that the simulation evidence offered by Reshef et al. was artifactual. We conclude that estimating mutual information is not only practical for many real-world applications, but also provides a natural solution to the problem of quantifying associations in large data sets.
Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression
Troels T. Marstrand, John D. Storey
(Submitted on 11 Oct 2012)
A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show that ~25% of genes show cell-type specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., +/- 200kb) capture more genes with this relationship than local regions (e.g., +/- 2.5kb), yet the local regions show a more pronounced effect. By exploiting variation across cell-types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type specific expression, which we show have a range of contextual usages. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene expression variation.
Outside our usual remit, but likely of interest to many of our readers. See here for online peer review.
Best Practices for Scientific Computing
D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian Mitchell, Mark Plumbley, Ben Waugh, Ethan P. White, Greg Wilson, Paul Wilson
(Submitted on 1 Oct 2012)
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.