This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.
We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.
This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.
Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.
Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.
We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me: