Methods for Joint Imaging and RNA-seq Data Analysis

Methods for Joint Imaging and RNA-seq Data Analysis

Junhai Jiang, Nan Lin, Shicheng Guo, Jinyun Chen, Momiao Xiong
(Submitted on 13 Sep 2014)

Emerging integrative analysis of genomic and anatomical imaging data which has not been well developed, provides invaluable information for the holistic discovery of the genomic structure of disease and has the potential to open a new avenue for discovering novel disease susceptibility genes which cannot be identified if they are analyzed separately. A key issue to the success of imaging and genomic data analysis is how to reduce their dimensions. Most previous methods for imaging information extraction and RNA-seq data reduction do not explore imaging spatial information and often ignore gene expression variation at genomic positional level. To overcome these limitations, we extend functional principle component analysis from one dimension to two dimension (2DFPCA) for representing imaging data and develop a multiple functional linear model (MFLM) in which functional principal scores of images are taken as multiple quantitative traits and RNA-seq profile across a gene is taken as a function predictor for assessing the association of gene expression with images. The developed method has been applied to image and RNA-seq data of ovarian cancer and KIRC studies. We identified 24 and 84 genes whose expressions were associated with imaging variations in ovarian cancer and KIRC studies, respectively. Our results showed that many significantly associated genes with images were not differentially expressed, but revealed their morphological and metabolic functions. The results also demonstrated that the peaks of the estimated regression coefficient function in the MFLM often allowed the discovery of splicing sites and multiple isoform of gene expressions.

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Characterization of the transcriptome, nucleotide sequence polymorphism, and natural selection in the desert adapted mouse Peromyscus eremicus

Matthew D MacManes, Michael B Eisen
doi: http://dx.doi.org/10.1101/009134

As a direct result of intense heat and aridity, deserts are thought to be among the most harsh of environments, particularly for their mammalian inhabitants. Given that osmoregulation can be challenging for these animals, with failure resulting in death, strong selection should be observed on genes related to the maintenance of water and solute balance. One such animal, Peromyscus eremicus, is native to the desert regions of the southwest United States and may live its entire life without oral fluid intake. As a first step toward understanding the genetics that underlie this phenotype, we present a characterization of the P. eremicus transcriptome. We assay four tissues (kidney, liver, brain, testes) from a single individual and supplement this with population level renal transcriptome sequencing from 15 additional animals. We identified a set of transcripts undergoing both purifying and balancing selection based on estimates of Tajima’s D. In addition, we used the branch-site test to identify a transcript – Slc2a9, likely related to desert osmoregulation – undergoing enhanced selection in P. eremicus relative to a set of related non-desert rodents.

Author post: Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics

Thus guest post is by Irene Gallego Romero (@ee_reh_neh) on her paper Gallego Romero et al “Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics” bioRxived here.

Genetic divergence in protein coding regions between humans and chimpanzees cannot explain phenotypic differences between the two species, or, more broadly, between other closely related groups. Although we have known this since the early days of genetic sequencing, it has been very hard to formally test the hypothesis that follows logically – that it may be changes in gene expression and regulation that underlie the divergence in phenotypes. This is especially true in the great apes, where there are plenty of ethical and practical impediments to experimentation. For instance, our ability to carry out functional studies and really decode cellular mechanisms is restricted to tissues that can be sampled non-invasively. To date, this has mostly meant fibroblasts and immortalised lymphoblastoid cell lines. The rest of comparative work in primates tends to be done in tissue samples collected post-mortem, where experimental manipulation is not a possibility.

Together, these limitations provided the impetus for us to develop a panel of high-quality induced pluripotent stem cell (iPSC) lines from chimpanzees. The promise of this panel lies, of course, not just in insights into the pluripotent state in chimpanzees (although that is certainly a worthy subject) but in how it opens the door to a tantalizing number of previously inaccessible questions, when we combine it with any of the many protocols available for differentiating iPSCs into particular somatic cell types that have remained out of reach until now.

The amount of work that went into developing an effective reprogramming protocol is not readily apparent in our preprint, but it was exhaustive – and exhausting! We began by using retroviral vectors to deliver the four factors that are commonly used to reprogram somatic cells to pluripotency, but soon encountered two fairly sizable problems with that approach. First, these viral vectors are integrated into the host genome during the course of reprogramming, and one never knows what they’re going to disrupt. This is an issue that everyone using retro- or lentiviral vectors has to contend with, and indeed, when we began working on the project three and a half years ago they were the most reliable and established reprogramming method around, so we were prepared to take our chances and scan the resulting lines to determine insertion sites. Regardless, the thought of random insertions of pluripotency genes set us somewhat on edge!

However, for reasons that we never fully understood, those chimpanzee lines had a lot of trouble silencing the retroviral vectors and maintaining pluripotency solely through endogenous mechanisms, as we show in one of our supplemental figures. At the time, we were making human iPSC lines in tandem using exactly the same vector stocks. While the human lines would lose most exogenous vector expression after 12 to 15 passages, in chimpanzee iPSCs of the same age we would generally find that expression of at least one, if not more, exongenous genes was as high as it had been on day one. This did not bode well for the lines, or for our ability to do interesting things with them! So we scrapped the integrating approach, and began optimizing protocols all over again. Fortunately for us, Shinya Yamanaka’s group had just published a very thorough protocol on reprogramming cells using non-integrating episomal vectors, which ended up laying the foundations of the one we present in our preprint.

The lines we have generated with it are of fantastic quality, and they have passed every test we have thrown at them with flying colours. Pluripotency is being endogenously maintained, they’re karyotypically normal, and they differentiate into all three germ layers both spontaneously as embryoid bodies and teratomas when injected into mice, and when we use directed protocols to push them towards a particular fate.

We were very interested in quantifying how human and chimpanzee iPSC lines differ from each other. To this end, we collected RNA-sequencing and methylation data from the chimpanzee iPSCs and the fibroblast lines they were generated from, as well as from seven human iPSC lines from various ethnic and cellular origins and their precursors, and compared them to one another. We find large numbers of inter-species differences both before and after reprogramming, but crucially, most of them are not the same differences. Of all the genes with strong evidence for differential expression between species at the iPSC stage, only 38% are also differentially expressed before reprogramming, and the situation is quite similar with regards to methylation.

Another thing we have found very striking in the data is the very clear increase in homogeneity within (and possibly between, although our design makes that harder to effectively quantify) species at the iPSC level relative to the precursor cells, both in gene expression levels and in DNA methylation. This finding will be very interesting to keep in mind as we go forward and differentiate the iPSCs into a suite of somatic cell types and see how these measures fluctuate through differentiation.

Ultimately, however, where the biggest significance of this work lies for us is in the fact that the lines are not just for our own use. They’re available to other researchers, and this is something we have had in mind from the earliest stages of the work. There is no possible way for our lab to even begin to tackle all the questions that these lines can be used to answer. So if you want to work with our chimpanzee iPSC lines, get in touch.

Population genomic analysis uncovers African and European admixture in Drosophila melanogaster populations from the southeastern United States and Caribbean Islands

Population genomic analysis uncovers African and European admixture in Drosophila melanogaster populations from the southeastern United States and Caribbean Islands

Joyce Y Kao, Asif Zubair, Matthew P Salomon, Sergey V Nuzhdin, Daniel Campo
doi: http://dx.doi.org/10.1101/009092

Genome sequences from North American Drosophila melanogaster populations have become available to the scientific community. Deciphering the underlying population structure of these resources is crucial to make the most of these population genomic resources. Accepted models of North American colonization generally purport that several hundred years ago, flies from Africa and Europe were transported to the east coast United States and the Caribbean Islands respectively and thus current east coast US and Caribbean populations are an admixture of African and European ancestry. Theses models have been constructed based on phenotypes and limited genetic data. In our study, we have sequenced individual whole genomes of flies from populations in the southeast US and Caribbean Islands and examined these populations in conjunction with population sequences from Winters, CA, (USA); Raleigh, NC (USA); Cameroon (Africa); and Montpellier (France) to uncover the underlying population structure of North American populations. We find that west coast US populations are most like European populations likely reflecting a rapid westward expansion upon first settlements into North America. We also find genomic evidence of African and European admixture in east coast US and Caribbean populations, with a clinal pattern of decreasing proportions of African ancestry with higher latitude further supporting the proposed demographic model of Caribbean flies being established by African ancestors. Our genomic analysis of Caribbean flies is the first study that exposes the source of previously reported novel African alleles found in east coast US populations.

Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster

Secondary contact and local adaptation contribute to genome-wide patterns of clinal variation in Drosophila melanogaster

Alan O. Bergland, Ray Tobler, Josefa Gonzalez, Paul Schmidt, Dmitri Petrov
doi: http://dx.doi.org/10.1101/009084

Populations arrayed along broad latitudinal gradients often show patterns of clinal variation in phenotype and genotype. Such population differentiation can be generated and maintained by a combination of demographic events and adaptive evolutionary processes. Here, we investigate the evolutionary forces that generated and maintain clinal variation genome-wide among populations of Drosophila melanogaster sampled in North America and Australia. We contrast patterns of clinal variation in these continents with patterns of differentiation among ancestral European and African populations. We show that recently derived North America and Australia populations were likely founded by both European and African lineages and that this admixture event generated genome-wide patterns of parallel clinal variation. The pervasive effects of admixture meant that only a handful of loci could be attributed to the operation of spatially varying selection using an FST outlier approach. Our results provide novel insight into a well-studied system of clinal differentiation and provide a context for future studies seeking to identify loci contributing to local adaptation in D. melanogaster.

Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the human microbiome

Stephen Nayfach, Katherine S Pollard
doi: http://dx.doi.org/10.1101/009001

Average genome size (AGS) is an important, yet often overlooked property of microbial communities. We developed MicrobeCensus to rapidly and accurately estimate AGS from short-read metagenomics data and applied our tool to over 1,300 human microbiome samples. We found that AGS differs significantly within and between body sites and tracks with major functional and taxonomic differences. For example, in the gut, AGS ranges from 2.5 to 5.8 megabases and is positively correlated with the abundance of Bacteroides and polysaccharide metabolism. Furthermore, we found that AGS variation can bias comparative analyses, and that normalization improves detection of differentially abundant genes.

Increasing evolvability of local adaptation during range expansion.

Increasing evolvability of local adaptation during range expansion.
Marleen M. P. Cobben, Alexander Kubisch
doi: http://dx.doi.org/10.1101/008979

Increasing dispersal under range expansion increases invasion speed, which implies that a species needs to adapt more rapidly to newly experienced local conditions. However, due to iterated founder effects, local genetic diversity under range expansion is low. Evolvability (the evolution of mutation rates) has been reported to possibly be an adaptive trait itself. Thus, we expect that increased dispersal during range expansion may raise the evolvability of local adaptation, and thus increase the survival of expanding populations. We have studied this phenomenon with a spatially explicit individual-based metapopulation model of a sexually reproducing species with discrete generations, expanding into an elevational gradient. Our results show that evolvability is likely to evolve as a result of spatial variation experienced under range expansion. In addition, we show that different spatial phenomena associated with range expansion, in this case spatial sorting / kin selection and priority effects, can enforce each other.

Behavioral individuality reveals genetic control of phenotypic variability

Behavioral individuality reveals genetic control of phenotypic variability

Julien F Ayroles, Sean M Buchanan, Chelsea Jenney, Kyobi Skutt-Kakaria, Jennifer Grenier, Andrew G Clark, Daniel L Hartl, Benjamin L de Bivort
doi: http://dx.doi.org/10.1101/009027

Variability is ubiquitous in nature and a fundamental feature of complex systems. Few studies, however, have investigated variance itself as a trait under genetic control. By focusing primarily on trait means and ignoring the effect of alternative alleles on trait variability, we may be missing an important axis of genetic variation contributing to phenotypic differences among individuals. To study genetic effects on individual-to-individual phenotypic variability (or intragenotypic variability), we used a panel of Drosophila inbred lines and focused on locomotor handedness, in an assay optimized to measure variability. We discovered that some lines had consistently high levels of intragenotypic variability among individuals while others had low levels. We demonstrate that the degree of variability is itself heritable. Using a genome-wide association study (GWAS) for the degree of intragenotypic variability as the phenotype across lines, we identified several genes expressed in the brain that affect variability in handedness without affecting the mean. One of these genes, Ten-a implicated a neuropil in the central complex of the fly brain as influencing the magnitude of behavioral variability, a brain region involved in sensory integration and locomotor coordination6. We have validated these results using genetic deficiencies, null alleles, and inducible RNAi transgenes. This study reveals the constellation of phenotypes that can arise from a single genotype and it shows that different genetic backgrounds differ dramatically in their propensity for phenotypic variabililty. Because traditional mean-focused GWASs ignore the contribution of variability to overall phenotypic variation, current methods may miss important links between genotype and phenotype.

Author post: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

This guest post is by Adam Siepel on his group’s paper Gulko et al. Probabilities of Fitness Consequences for Point Mutations Across the Human Genome, bioRxived here.

Four Genomicists in a Subaru

The idea for this paper emerged during a long drive across New York State, from Cold Spring Harbor to Ithaca, after the 2011 Biology of Genomes meeting. Two postdocs in my research group, Ilan Gronau and Leo Arbiza, were riding with me in my old Subaru, trying not to express too much alarm at my distracted driving. Also with us in the car was Ran Blekhman, who was at the time a postdoc with Andy Clark. (Ran is now an assistant professor at the University of Minnesota.)

Our conversation turned to important open questions in computational genomics, and in particular, to ways of making better use of the vast quantities of functional genomic data being pumped out of projects such as ENCODE. At the time, Ilan, Leo, and I were thinking a lot about how to use patterns of within-species polymorphism and between-species divergence to shed light on the influence of natural selection on regulatory sequences. Along these lines, we had spent much of the spring developing our new INSIGHT method [1,2], and we had just presented this work for the first time at Biology of Genomes. Ran, however, was pushing us to think less about abstract evolutionary questions and more about genomic function and disease association. He made a strong case that the biggest obstacle to progress in medically related human genomics was the absence of adequate functional annotations in noncoding regions of the human genome.

For a while the conversation went in circles, as we grasped for ways of measuring “functional potential” across the genome that would make use of genomic data yet be grounded in evolutionary theory. Then it suddenly dawned on us that we already had in hand a key piece of what we needed. The INSIGHT program was designed to estimate, for any collection of nucleotide positions across the genome, the fraction (denoted ρ) of those positions that were directly influenced by natural selection, in the sense that point mutations at those positions tended either to increase or to decrease the fitness of an organism. We realized that an alternative way of interpreting ρ was as a probability that the nucleotide at each position in the analyzed collection influenced fitness, assuming exchangeability of sites (as INSIGHT does).

All that we needed, therefore, was a general way of partitioning nucleotide positions from across the genome into distinct classes that were reasonably homogeneous in their functional roles. We could then estimate ρ for each class using INSIGHT, and assign to each genomic position the estimate of ρ for the partition to which that position belonged. This procedure would produce a “score” across the genome that looked roughly like widely used evolutionary conservation scores, but instead of representing local divergence patterns across the mammalian phylogeny, the score at each position would be estimated from groupwise patterns of polymorphism and divergence and would be directly interpretable as a probability of fitness consequences. Later Ilan would dub these “fitCons” scores, to emphasize this fitness-related interpretation. (“FitCons” also nicely parallels “phastCons,” our first conservation-scoring method.)

Because INSIGHT measures selection on recent time scales, fitCons scores would circumvent a major shortcoming of standard evolutionary conservation scores—that they require functional roles to have remained consistent over very long evolutionary time periods (tens to hundreds of millions of years) in order to be detectable in divergence patterns at individual sites or small loci. In principle, fitCons scores should be able to detect selection (hence potential function) at sites whose functional role had emerged quite recently, perhaps even along the human lineage.

The Problem of Grouping

The piece that was still missing in our plan was a particular scheme for grouping together similar genomic sites from across the genome. We did not get to the point of working this problem out in any detail during our revelatory drive to Ithaca, and, as it happened, it took several more months to settle on a solution. By this time, a Ph.D. student from Computer Science, Brad Gulko, had joined the project and assumed the lead in implementing a prototype of the scores.

At first, Brad, Ilan, Leo, and I spent some time thinking about fancy algorithms for clustering genomic sites that would consider functional and evolutionary information jointly. However, it did not take long to realize that this was a hard problem. Eventually, we decided to move forward with a simple grouping scheme, based on functional genomic data alone. This would allow us to cluster genomic sites in a pre-processing step and avoid the need for an iterative solution. Our hunch was that the scores would not be too sensitive to the grouping scheme as long as it was reasonable. As we discuss in our article, it may be worthwhile to revisit this clustering approach eventually, but it appears to be adequate for our current purposes. (I hope to convince Brad to discuss some of the technical issues with the clustering problem in an upcoming blog post.)

Relevance to the “Share Under Selection” in the Human Genome

By the fall of 2012, we had finally settled on an initial set of fitCons tracks and were beginning to observe decent prediction performance for cis-regulatory elements, when the human genomics community was thrown into a frenzy by a deluge of publications and accompanying press releases from the ENCODE Consortium. This event led to the now-famous controversy over what fraction of the genome is truly “functional” and whether ENCODE’s measures of “reproducible biochemical activity” (which apply to over 80% of the genome) were comparable in any meaningful way to the “share under selection” (SUS) estimated from comparative genomics (which generally came out to 5–10%).

I do not wish to rehash the familiar terms of this debate here, but I do want to focus on one aspect of it that was particularly relevant to our work. Many of the criticisms of ENCODE reminded readers that comparative genomic analyses pointed to a SUS of ~5–10%, suggesting that 80% might be a gross over-estimate of the functional content of the genome. However, others pointed out that these comparative-genomic estimates applied only to the fraction of the genome that had been under long-term selective constraint, because evolutionary turnover of functional elements—if it occurred at appreciable rates—could bias estimates based on long-term genomic divergence substantially downward. (For the latest chapter in this saga, see a recent paper by Gerton Lunter, Chris Ponting, and colleagues [3].)

We realized that the fitCons scores could help address aspects of this controversy, because they were based on patterns of variation over much more recent time scales and should therefore be much less sensitive to turnover than scores based on divergence patterns across the mammalian phylogeny. Moreover, the fitCons scores, by making use of INSIGHT to interpret patterns of polymorphism and divergence, might provide substantially better estimates of the quantities of interest than simple analyses of SNP densities or allele frequencies, a few of which had appeared among the ENCODE publications. Finally, the INSIGHT-based estimates are unique in that they directly predict the SUS, without the need for separate thresholding, mixture deconvolutions, or enrichment analyses.

Somewhat surprisingly, when we estimated the SUS based on fitCons scores, we obtained values (4.2–7.5%) that were quite similar to those based on conservation patterns in mammals. There are a number of tricky technical issues involved in this type of estimation—for example, concerning the corrections for local mutation rates and coalescence times—but violations of our modeling assumptions generally should tend to push our estimated upper bound (7.5%) to conservatively high values, implying that the true value is lower than 7.5%. In addition, the correction we have applied to obtain our lower bound (4.2%) is quite conservative, making it likely that the true value is higher than 4.2%. Therefore we have high confidence that the fraction of the genome under detectable selection from the available polymorphism and divergence data is indeed fairly close to 5%. As we discuss in the paper, it is important to bear in mind that the absolute values of these estimates reflect constraint on the identities of individual nucleotides only, and do not take into consideration higher order constraints, for example, on element lengths or spacing. Nevertheless, the similarity of the estimates based on mammalian divergence and human polymorphism suggest that evolutionary turnover has not produced a major downward bias in conservation-based estimates of the SUS.

Ilan and I soon realized that we could go a step further in this analysis and compare the fitCons-based estimates with parallel estimates based on the same functional categories but a measure of natural selection based on divergence only. The idea here was to perform a direct “apples to apples” comparison of the fraction of the genome under selection as measured on two different time scales: the 1–5 million-year time scale measured by fitCons and the ~30 million-year time scale measured by an analogous method based on divergence patterns in four primate genomes (human, chimpanzee, orangutan, and rhesus macaque), which we called “fitConsD” (the “D” is for “divergence”). I won’t attempt to describe this analysis in detail here, but our general conclusion is that the estimates of selection are highly similar on these two different time scales, suggesting further that evolutionary turnover has not had a dramatic effect on the functional content of the human genome over the past 30 million years or so. It is worth noting that Lunter and colleagues’ recent analysis is not strictly incompatible with ours (they estimate 7.1–9.2% constraint at present and focus on turnover over longer time scales) but their qualitative interpretation suggests large amounts of turnover, while ours suggests modest amounts.

Scooped by CADD… or Perhaps Not

As grant proposals, other manuscripts, and job searches led to delays in writing up our work through 2013, we began to hear rumblings on social media about a method called CADD, developed by Greg Cooper and Jay Shendure’s groups, that sounded alarmingly similar to fitCons. Then, in early 2014, a paper by Kircher, Witten et al. describing CADD appeared in Nature Genetics [4]. When we saw this paper, our initial impression was that we had been scooped by sitting on a good idea for too long. CADD was described as a method that integrated functional and evolutionary data and produced a measure of “relative pathogenicity” across the entire genome, and it was motivated, in part, by its potential usefulness in noncoding as well as coding regions. The paper included several impressive-looking ROC plots in which CADD apparently outperformed conservation based methods such as phastCons, phyloP, and GERP by a significant margin. In addition, CADD made use of a support vector machine (SVM), which was potentially a highly flexible and powerful means for considering large numbers of covariates with arbitrarily complex correlations.

We decided, with a certain amount of dread, that we needed to add CADD to our empirical performance comparisons on putative regulatory elements. At the time, fitCons was showing clear advantages in predictive power for cis regulatory elements compared with conservation-based methods and a functional annotation database called RegulomeDB. FitCons had several potential advantages over CADD—for example, it made direct use of polymorphism data for prediction, it considered covariates in a cell-type-specific manner, and it avoided a need for brute-force simulations through its use of INSIGHT for inference—but we thought that the use of the SVM in CADD made it unlikely that fitCons could compete with it in a pure classification task. Nevertheless, Brad dutifully downloaded the CADD scores, added them to his experiments, and displayed curves for CADD in his ROC plots for three types of regulatory elements (ChIP-seq-supported transcription factor binding sites, eQTLs, and enhancer predictions based on chromatin marks).

To our surprise, fitCons significantly outperformed CADD in all of these tests. This was true for three different types of putative regulatory elements, and true whether or not we considered cell-type-specific test sets. In fact, CADD performed essentially no better than conventional conservation scores in these tests, in apparent contradiction to the results presented in the CADD paper.

A closer reading of the CADD paper revealed a possible explanation for these observations. While the method was motivated, in part, by its applicability to the entire genome, the authors’ validation experiments heavily emphasized coding regions. In fact, it appears that even the ROC plot for “genome-wide” results (Figure 3a in the paper) is actually based almost exclusively (>92% by our interpretation of the paper) on missense variants in coding regions. The experiments that included substantial numbers of noncoding sites, in turn, were much more indirect—for example, by showing correlations with derived allele frequencies (Figure 2), known disease-causing status (Figure 4), and changes in expression in saturation mutagenesis experiments at two enhancers and one promoter. It is possible to have correlations of this kind without having substantial predictive power for regulatory variants.

When Greg Cooper saw our initial preprint on bioRxiv, he raised two major objections to our validation experiments. First, he pointed out that we were measuring the sensitivity of the fitCons scores in terms of bulk coverage of elements, when those elements actually consist of a mixture of sites at which mutations are deleterious and sites at which mutations are neutral or nearly neutral (such as degenerate positions in transcription factor binding sites). This approach to measuring sensitivity may be overly generous to the fitCons scores, which are relatively “blocky” along the genome, varying little from one site to the next, in comparison to higher-resolution prediction methods that properly distinguish between functionally important and neutral sites within elements. Second, Greg pointed out that we were using a naive genome-wide background set for our eQTL, which did not properly account for the ascertainment scheme used for eQTL identification.

We felt that these were fair and reasonable criticisms, and needed to be addressed. Therefore, we revised our validation experiments to consider only high-information-content positions in transcription factor binding sites (a proxy for functionally important nucleotides), and to use a more appropriate control for eQTL. The details of these follow-up experiments are described in our revised preprint (now on bioRxiv), but the bottom line was that they had almost no effect on our ROC plots. In other words, the apparent performance advantages of fitCons over CADD and other divergence-based methods is not an artifact of our experimental design but appears to reflect real advantages of the method. While Greg is correct that the coarse, “blocky” nature of the fitCons scores is a limitation of our current methods, the method still appears to perform significantly better than any competing method in distinguishing putatively functional regulatory nucleotides from background sequence. In other words, while scores that exhibit more variation from one nucleotide to the next—such as CADD, GERP, and phyloP—may appear on the surface to have higher predictive resolution, much of that variation is uninformative about regulatory function, and, on balance, the “blocky” fitCons scores are more useful in prediction.

We have spent some time trying to understand the differences in performance between fitCons and CADD, and believe we have some insights into why fitCons performs significantly better on regulatory elements. (What follows are our conjectures only; the authors of CADD do not agree with our analysis.) While the SVM in CADD is potentially a strength, we believe that it is substantially limited in this case by the use of a linear kernel and by pooling features across cell types, rather than focusing separately on each cell type of interest. In addition, we think there is a fundamental problem with the optimization scheme used by CADD. The SVM in CADD is trained by a global strategy, in the sense that a single set of parameter values is selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as expected, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution toward the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, and this may explain CADD’s superior performance in coding regions and its weaker performance in noncoding regions. FitCons avoids this problem by applying INSIGHT separately to each class of sites.

These observations raise the interesting possibility of a modified CADD that addresses some of these limitations. There is no reason why CADD couldn’t be trained separately on noncoding and coding regions, perhaps with different sets of covariates for each type of sites. Moreover, regulation-associated covariates could be treated in a cell-type-specific manner. A modified CADD designed along these lines (regulatory CADD, or rCADD?) could provide an interesting alternative to fitCons.

Summary

When we first discussed the idea for the fitCons scores during our drive across New York State three years ago, I envisioned a quick spin-off project that could be completed in perhaps half a year. As so often happens in research, several unanticipated challenges arose in completing this work, but we also found unexpected opportunities to connect our analysis with important open questions in the field. In addition, we were stimulated to think about the problem of combining functional and evolutionary data in new and deeper ways by another paper that addressed a similar problem but in a fundamentally different way. The end result is a paper I am quite proud of—one that provides what I think will be a useful resource to the genomics community and that also offers new insights into longstanding evolutionary questions.

References

[1] Gronau, I., Arbiza, L., Mohammed, J., & Siepel, A. (2013). Inference of Natural Selection from Interspersed Genomic Elements Based on Polymorphism and Divergence. Molecular Biology and Evolution. doi:10.1093/molbev/mst019

[2] Arbiza, L., Gronau, I., Aksoy, B. A., Hubisz, M. J., Gulko, B., Keinan, A., & Siepel, A. (2013). Genome-wide inference of natural selection on human transcription factor binding sites. Nature Genetics, 45(7), 723–729. doi:10.1038/ng.2658

[3] Rands, C. M., Meader, S., Ponting, C. P., & Lunter, G. (2014). 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genetics, 10(7), e1004525. doi:10.1371/journal.pgen.1004525

[4] Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics. doi:10.1038/ng.2892

The Sea Lamprey Meiotic Map Resolves Ancient Vertebrate Genome Duplications

The Sea Lamprey Meiotic Map Resolves Ancient Vertebrate Genome Duplications
Jeramiah Smith
doi: http://dx.doi.org/10.1101/008953

Gene and genome duplications serve as an important reservoir of material for the evolution of new biological functions. It is generally accepted that many genes present in vertebrate genomes owe their origin to two whole genome duplications that occurred deep in the ancestry of the vertebrate lineage. However, details regarding the timing and outcome of these duplications are not well resolved. We present high-density meiotic and comparative genomic maps for the sea lamprey, a representative of an ancient lineage that diverged from all other vertebrates approximately 550 million years ago. Linkage analyses yielded a total of 95 linkage groups, similar to the estimated number of germline chromosomes (1N ~ 99), spanning a total of 5,570.25 cM. Comparative mapping data yield strong support for one ancient whole genome duplication but do not strongly support a hypothetical second event. Rather, these comparative maps reveal several evolutionary independent segmental duplications occurring over the last 600+ million years of chordate evolution. This refined history of vertebrate genome duplication should permit more precise investigations into the evolution of vertebrate gene functions.