Biological Averaging in RNA-Seq
Surojit Biswas, Yash N. Agrawal, Tatiana S. Mucyn, Jeffery L. Dangl, Corbin D. Jones
(Submitted on 3 Sep 2013)
RNA-seq has become a de facto standard for measuring gene expression. Traditionally, RNA-seq experiments are mathematically averaged — they sequence the mRNA of individuals from different treatment groups, hoping to correlate phenotype with differences in arithmetic read count averages at shared loci of interest. Alternatively, the tissue from the same individuals may be pooled prior to sequencing in what we refer to as a biologically averaged design. As mathematical averaging sequences all individuals it controls for both biological and technical variation; however, is the statistical resolution gained always worth the additional cost? To compare biological and mathematical averaging, we examined theoretical and empirical estimates of statistical efficiency and relative cost efficiency. Though less efficient at a fixed sample size, we found that biological averaging can be more cost efficient than mathematical averaging. With this motivation, we developed a differential expression classifier, ICRBC, to can detect alternatively expressed genes between biologically averaged samples. In simulation studies, we found that biological averaging and subsequent analysis with our classifier performed comparably to existing methods, such as ASC, edgeR, and DESeq, especially when individuals were pooled evenly and less than 20% of the regulome was expected to be differentially regulated. In two technically distinct mouse datasets and one plant dataset, we found that our method was over 87% concordant with edgeR for the 100 most significant features. We therefore conclude biological averaging may sufficiently control biological variation to a level that differences in gene expression may be detectable. In such situations, ICRBC can enable reliable exploratory analysis at a fraction of the cost, especially when interest lies in the most differentially expressed loci.