Michael Gilchrist, Wei-Chen Chen, Premal Shah, Russell Zaretzki
The time and cost of generating a genomic dataset is expected to continue to decline dramatically in the upcoming years. As a result, extracting biologically meaningful information from this continuing flood of data is a major challenge in biology. In response, we present a powerful Bayesian MCMC method based on a nested model of protein synthesis and population genetics. Analyzing the patterns of codon usage observed within a genome, our algorithm extracts and decouples information on codon specific translational efficiencies and mutation biases as well as gene specific expression levels for all coding sequences. This information can be combined to generate gene and codon specific estimates of selection on synonymous substitutions. One major advance over previous work is that our method can be used without independent measurements of gene expression. Using the Saccharomyces cerevisiae S288c genome, we compare our model fits with and without independent gene expression measurements and observe an exceptionally high correlation between our codon specific parameters and gene specific expression levels (ρ > 0.99 in all cases). We also observe robust correlations between our predictions generated without independent expression measurements and previously published estimates of mutation bias, ribosome pausing time, and empirical estimates of mRNA abundance (ρ=0.53-0.72). Our results indicate that failing to take mutation bias into account can lead to the misidentification of an amino acid’s `optimal’ codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage and this information can be accessed through carefully formulated, biologically based models.