Our paper: RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

This guest post is by Titus Brown on his group’s preprint, RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates, arXived here. This post is cross-posted from his blog

I’m worried about our current mRNAseq analysis strategies.

I recently posted a draft paper of ours to arXiv entitled RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates; the link is to the Haldane’s Sieve discussion of the paper. Graham Coop said I should write something quick, and so I am (sort of — I mean to write this post a few days ago, but teaching. family. etc.)

I decided to post the paper — although we haven’t submitted it yet — because I wanted to solicit feedback and gauge where the disagreements or misunderstandings were likely to pop up. I suspect reviewers are going to have a hate-hate relationship with this paper, too, so I want to get rid of any obvious mistakes by crowdsourcing some early reviews ;).

Before I talk about the paper, let me mention a few related factoids that crossed my ‘net firehose and that tie into the paper; I’ll weave them into the paper discussion below.

  1. Dan Zerbino (one of the Assembly Gods?) wrote a good e-mail about how Oases can’t really leverage coverage information when reconstructing transcripts
    on the oases-users mailing list.
  2. I read and reviewed the SASeq paper on mRNAseq quantification, which talks about how many of our transcript models are almost certainly not expressed.
  3. The lamprey genome paper came out, pointing out that the genome is probably missing ~20-30% of the germline content.
  4. A paper on evaluating mRNAseq expression analysis software hit arXiv.

OK, so what did our paper say?

The paper’s genesis is this: for years I’d been explaining to collaborators that mRNAseq was unreliable on any organism without a pretty good reference transcriptome, and that constructing the reference transcriptome relied on either good mRNAseq assembly tools OR on having a good reference genome. So that was what my lab was working on doing with their data.

Since we tend to work on weird organisms like chick (which has an incomplete genome and a notoriously poor gene annotation), lamprey (see above — missing 20-30% of its genome), and Molgulid ascidians (for which we are constructing the genome), we needed to go the de novo mRNAseq assembly route.

However, we also quickly realized that there were a few problems with de novo mRNAseq assembly and expression evaluation: lots of mRNAseq needed to be crunched in order to get good reconstruction, which was computationally intensive (see: diginorm); splice variants would confuse mapping (see: this paper); we were uncertain of how well splice variants could be assembled with or without a reference (see: this paper); and we weren’t sure how to evaluate mapping tools.

When a new postdoc (Alexis Black Pyrkosz, the first author) joined the lab and wanted a fairly simple project to get started, I suggested that she evaluate the extent to which an incomplete reference transcriptome would screw things up, and perhaps try out a few different mappers. This took her about a year to work through, in addition to all of her other projects. She built a straightforward simulation system, tried it out on chick and mouse, and got some purely computational results that place (in my view) fundamental limits on what you can accomplish with certainty using current mRNAseq technology.

Incidentally, one piece of feedback she got at GLBio from a prof was (paraphrased) "I hope this isn’t your real project, because it’s not very interesting."

The results, in short, are:

  1. What mapper you use doesn’t really matter, except to within a few percent; they all perform fine, and they all degrade fairly fast with sequencing errors.

  2. Incomplete reference transcriptomes matter, a lot. There are two entirely obvious reasons: if you have a splice variant A that is in your reference but not present in your mRNAseq, and a splice variant B that is not in your reference but is actually transcribed and in your mRNAseq, the reads for B will get mapped to the wrong transcript; and (the even more obvious one) you can’t measure the expression of something that’s not in your reference via mapping.

    The SASeq paper does a nice job of pointing out that there’s something rather seriously wrong with current mRNAseq references, even in mouse, and they provide a way to minimize misestimation in the context of the reference.

  3. Direct splice variant reconstruction and measurement is, technically, impossible for about 30% of the transcripts in mouse. For standard paired-end sequencing, it turns out that you cannot claim that exon A and exon Z are present in the same isoform for about 30% of the isoforms.

  4. The slightly surprising conclusion that we reached from this is that mRNAseq assembly is also, generally speaking, impossible: you cannot unambiguously construct a reasonably large proportion of observed isoforms via assembly, since the information to connect the exons is not there.

    Until recently, I had held out a forlorn hope that Clever Things were being done with coverage. Then I saw Dan Zerbino’s e-mail, point A, above.

And yet, judging by the Oases and Trinity publications, assembly works! What’s up?

There’s something a bit weird going on. Tools like Oases and Trinity can clearly construct a fair proportion of previously observed transcripts, even though the information to do so from direct observation isn’t there and they can’t necessarily use coverage inference reliably. My guess (see paper D, above) is that this is because biology is mostly cooperating with us by giving us one dominant isoform in many or most circumstances; this matches what Joe Pickrell et al. observed in their truly excellent noisy splicing paper. But I’d be interested in hearing alternate theories.

At this point, my friend and colleague Erich Schwarz tends to get unhappy with me and say "what would you have me do with my mRNAseq, then? Ignore it until you come up with a solution, which you claim is impossible anyway?" Good question! My answer is (a) "explore the extent to which we can place error bars or uncertainty on isoform abundance calculations", (b) "figure out where interesting isoform misestimation is likely to lie in the data sets", and (c) "look at exon-exon junctions and exon presence/absence instead of whole isoform abundance." But the tools to do this are still rather immature, I think, and people mostly ignore the issue or hope it doesn’t bugger up their results. (Please correct me if I’m wrong – I’d love pointers!)

In my lab, we are starting to explore ways to determine what mis- or un-assembled isoforms there might be in a given transcriptome. We’re also looking at non-reference-based ways of doing mRNAseq quantification and differential expression (technically, graph-based methods for mRNAseq). We are also deeply skeptical of many of the normalization approaches being used, largely because every time we evaluate them in the context of our actual data, our data seems to violate a number of their assumptions… Watch This Space.

Paper reactions

What’s the reaction to the paper been, so far?

Well, on Twitter, I’ve been getting a fair bit of "ahh, a good reference genome will solve your problems!" But I think points #3 and #4 above stand. Plus, invoking solving a harder and more expensive problem to solve what you think is a simpler problem is an interesting approach :). And since we don’t have good references (and won’t for a while) it’s not really a solution for us.

I’ve also been getting "this is a well known problem and not worth publishing!" from a few people. Well, OK, fair enough. I’ve been skimming papers with an eye to this for a while, but it’s entirely possible I’ve missed this part of the literature. I’d love to read and cite such a paper in this one (and even rely on it to make our points, if it truly has significant overlap). Please post links in the comments, I’d really appreciate it!

It is clear that the paper needs some reshaping in light of some of the comments, and I’d like to especially thank Mick Watson for his open comments.

Concluding thoughts

If our results are right, then our current approaches to mRNAseq have some potentially serious problems, especially in the area of isoform expression analysis. Worse, these problems aren’t readily addressible by doing qPCR confirmation or replicate sequencing. I’m not entirely sure what the ramifications are but it seems like a worthwhile thing that someone should point out.

–titus

p.s. Our simulations are fairly simple, BTW. We’ll put the code out there soon for you to play with.

p.p.s. Likit Preeyanon, a graduate student in my lab, was one of the first people in my lab to look really critically at mRNAseq. Watch for his paper soon.

Our paper: Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

This guest post is by Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, and Dmitri A. Petrov, on their paper
 Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster, arXived here

We typically think of adaptive events as arising from single de novo mutations that sweep through the population one at a time. In this scenario, one expects to observe the signatures of hard selective sweeps, where a single haplotype rises to very high frequencies, removing variation in linked genomic regions. It is also possible, however, that adaptation could lead to signatures of soft sweeps. Soft sweeps are generated by multiple adaptive haplotypes rising in frequency at the same time, either because (i) the adaptive mutation comes from standing variation and thus had time to recombine onto multiple haplotypes, or (ii) because multiple de novo mutations arise virtually simultaneously. The second mode is likely in large populations or when the adaptive mutation rate per locus is high.

Soft sweeps have generally been considered a mere curiosity and most scans for adaptation focus on the hard sweep scenario. Despite this prevailing view, the three best-studied cases of adaptation in Drosophila at the loci Ace, CHKov1, and Cyp6g1 all show signatures of soft sweeps. In two cases (Ace and Cyp6g1), soft sweeps were generated by de novo mutations indicating that the population size in D. melanogaster relevant to adaptation is on the order of billions or larger. In one case (CHKov1), soft sweeps arose from standing variation. Surprisingly, we do not have very convincing cases of recent adaptation in Drosophila that generated hard sweeps.

Nevertheless, it remained an open question of whether these three cases were the exception or the norm. They are all related to pesticide or viral resistance and it is entirely possible that much adaptation unrelated to human disturbance or immunity proceeds differently and might generate hard sweeps.

In this paper, we developed two haplotype statistics that allowed us to systematically identify hard and soft sweeps with similar power and then to differentiate them from each other. We applied these statistics to the Drosophila polymorphism data of ~150 fully sequenced, inbred strains available through the Drosophila Genetic Reference Panel (DGRP).

We found abundant signatures of recent and strong sweeps in the Drosophila genome with haplotype structure often extending over tens or even hundreds of kb. However, to our surprise, when we looked at the top 50 peaks, all of them showed signatures of soft sweeps, while we could not convincingly demonstrate the existence of any hard sweeps.

Our results suggest that hard sweeps might be exceedingly rare in Drosophila. Instead, it appears that adaptation in Drosophila primarily proceeds via soft sweeps and thus often involves standing genetic variation or recurrent de novo mutations. There are two caveats, however: One is that we were only able to study strong and recent adaptation. Such strong adaptation should “feel” recent population sizes that are close to the census size, whereas it should be insensitive to bottlenecks that have occurred in the distant past. Weaker adaptation, on the other hand, might take longer and thus would be sensitive to ancient bottlenecks or interference from other sweeps. Whether weak adaptation thus proceeds via hard sweeps remains to be seen. The second caveat is that much of adaptation might involve sweeps that are so soft and move so many haplotypes up in frequency that we cannot detect them. Similarly, adaptation could often be polygenic involving very subtle shifts in allele frequency at many loci. These modes would hardly leave any signatures of sweeps at all. Whichever way it is, it is becoming increasingly clear that adaptation in Drosophila and many other organisms is likely to be much more complex, much more common, and in many ways a much more turbulent process than we usually tend to think.

From Many, One: Genetic Control of Prolificacy during Maize Domestication

From Many, One: Genetic Control of Prolificacy during Maize Domestication
David M. Wills, Clinton Whipple, Shohei Takuno, Lisa E. Kursel, Laura M. Shannon, Jeffrey Ross-Ibarra, John F. Doebley
(Submitted on 4 Mar 2013)

A reduction in number and an increase in size of inflorescences is a common aspect of plant domestication. When maize was domesticated from teosinte, the number and arrangement of ears changed dramatically. Teosinte has long lateral branches that bear multiple small ears at their nodes and tassels at their tips. Maize has much shorter lateral branches that are tipped by a single large ear with no additional ears at the branch nodes. To investigate the genetic basis of this difference in prolificacy (the number of ears on a plant), we performed a genome-wide QTL scan. A large effect QTL for prolificacy (prol1.1) was detected on the short arm of chromosome one in a location that has previously been shown to influence multiple domestication traits. We fine-mapped prol1.1 to a 2.7 kb interval or causative region upstream of the grassy tillers1 gene, which encodes a homeodomain leucine zipper transcription factor. Tissue in situ hybridizations reveal that the maize allele of prol1.1 is associated with up-regulation of gt1 expression in the nodal plexus. Given that maize does not initiate secondary ear buds, the expression of gt1 in the nodal plexus in maize may suppress their initiation. Population genetic analyses indicate positive selection on the maize allele of prol1.1, causing a partial sweep that fixed the maize allele throughout most of domesticated maize. This work shows how a subtle cis-regulatory change in tissue specific gene expression altered plant architecture in a way that improved the harvestability of maize.

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster
Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, Dmitri A. Petrov
(Submitted on 5 Mar 2013)

Adaptation is often thought to leave the signature of a hard selective sweep, in which a single haplotype bearing the beneficial allele reaches high population frequency. However, an alternative and often-overlooked scenario is that of a soft selective sweep, in which multiple adaptive haplotypes sweep through the population simultaneously. Soft selective sweeps are likely either when adaptation proceeds from standing genetic variation or in large populations where adaptation is not mutation-limited. Current statistical methods are not well designed to test for soft sweeps, and thus are likely to miss these possibly numerous adaptive events because they look for characteristic reductions in heterozygosity. Here, we developed a statistical test based on a haplotype statistic, H12, capable of detecting both hard and soft sweeps with similar power. We used H12 to identify multiple genomic regions that have undergone recent and strong adaptation using a population sample of fully sequenced Drosophila melanogaster strains (DGRP). We then developed a second statistical test based on a statistic H2/H1 | H12, to test whether a given selective sweep detected by H12 is hard or soft. Surprisingly, when applying the test based on H2/H1 | H12 to the top 50 most extreme H12 candidates in the DGRP data, we reject the hard sweep hypothesis in every case. In contrast, all 50 cases show strong support (Bayes Factor >10) for a soft sweep model. Our results suggest that recent adaptation in North American populations of D. melanogaster has led primarily to soft sweeps either because it utilized standing genetic variation or because the short-term effective population size in D. melanogaster is on the order of billions or larger.

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis

Comprehensive Detection of Genes Causing a Phenotype using Phenotype Sequencing and Pathway Analysis
Marc Harper, Luisa Gronenberg, James Liao, Christopher Lee
(Submitted on 3 Mar 2013)

Discovering all the genetic causes of a phenotype is an important goal in functional genomics. In this paper we combine an experimental design for multiple independent detections of the genetic causes of a phenotype, with a high-throughput sequencing analysis that maximizes sensitivity for comprehensively identifying them. Testing this approach on a set of 24 mutant strains generated for a metabolic phenotype with many known genetic causes, we show that this pathway-based phenotype sequencing analysis greatly improves sensitivity of detection compared with previous methods, and reveals a wide range of pathways that can cause this phenotype. We demonstrate our approach on a metabolic re-engineering phenotype, the PEP/OAA metabolic node in E. coli, which is crucial to a substantial number of metabolic pathways and under renewed interest for biofuel research. Out of 2157 mutations in these strains, pathway-phenoseq discriminated just five gene groups (12 genes) as statistically significant causes of the phenotype. Experimentally, these five gene groups, and the next two high-scoring pathway-phenoseq groups, either have a clear connection to the PEP metabolite level or offer an alternative path of producing oxaloacetate (OAA), and thus clearly explain the phenotype. These high-scoring gene groups also show strong evidence of positive selection pressure, compared with strictly neutral selection in the rest of the genome.

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees
Sha Zhu, James H Degnan, Bjarki Eldon
(Submitted on 4 Mar 2013)

Hybrid-Lambda is a software package that simulates gene trees under Kingman or two Lambda-coalescent processes within species networks or species trees. It is written in C++, and re- leased under GNU General Public License (GPL) version 3. Users can modify and make new dis- tribution under the terms of this license. For details of this license, visit this http URL. Hybrid Lambda is available at this https URL.

Deleterious synonymous mutations hitchhike to high frequency in HIV-1 env evolution

Deleterious synonymous mutations hitchhike to high frequency in HIV-1 env evolution
Fabio Zanini, Richard A. Neher
(Submitted on 4 Mar 2013)

Intrapatient HIV-1 evolution is dominated by selection on the protein level in the arms race with the adaptive immune system. When cytotoxic CD8+ T-cells or neutralizing antibodies target a new epitope, the virus often escapes via nonsynonymous mutations that impair recognition. Synonymous mutations do not affect this interplay and are often assumed to be neutral. We analyze longitudinal intrapatient data from the C2-V5 part of the envelope gene (env) and observe that synonymous derived alleles rarely fix even though they often reach high frequencies in the viral population. We find that synonymous mutations that disrupt base pairs in RNA stems flanking the variable loops of gp120 are more likely to be lost than other synonymous changes, hinting at a direct fitness effect of these stem-loop structures in the HIV-1 RNA. Computational modeling indicates that these synonymous mutations have a (Malthusian) selection coefficient of the order of -0.002 and that they are brought up to high frequency by hitchhiking on neighboring beneficial nonsynonymous alleles. The patterns of fixation of nonsynonymous mutations estimated from the longitudinal data and comparisons with computer models suggest that escape mutations in C2-V5 are only transiently beneficial, either because the immune system is catching up or because of competition between equivalent escapes.

Most viewed on Haldane’s Sieve: February 2013

The most viewed preprints on Haldane’s Sieve in February 2013 were:

SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner

SOAP3-dp: Fast, Accurate and Sensitive GPU-based Short Read Aligner
Ruibang Luo, Thomas Wong, Jianqiao Zhu, Chi-Man Liu, Edward Wu, Lap-Kei Lee, Haoxiang Lin, Wenjuan Zhu, David W. Cheung, Hing-Fung Ting, Siu-Ming Yiu, Chang Yu, Yingrui Li, Ruiqiang Li, Tak-Wah Lam
(Submitted on 22 Feb 2013)

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, GEM and GPU-based aligners including BarraCUDA and CUSHAW, SOAP3-dp is two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60 percent. Real data evaluation using human genome demonstrates SOAP3-dp’s power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1 percent FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides a scoring scheme same as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution
Michael G. Elliot, Arne O. Mooers
(Submitted on 20 Feb 2013)

The value of a continuous character evolving on a phylogenetic tree is commonly modelled as the location of a particle moving under one-dimensional Brownian motion with constant rate. The Brownian motion model is best suited to characters evolving under neutral drift or tracking an optimum that drifts neutrally. We present a generalization of the Brownian motion model which relaxes assumptions of neutrality and gradualism by considering increments to evolving characters to be drawn from a heavy-tailed stable distribution (of which the normal distribution is a specialized form). We describe Markov chain Monte Carlo methods for fitting the model to biological data paying special attention to ancestral state reconstruction, and study the performance of the model in comparison with a selection of existing comparative methods, using both simulated data and a database of body mass in 1,679 mammalian species. We discuss hypothesis testing and model selection. The new model is well suited to a stochastic process with a volatile rate of change in which biological characters undergo a mixture of neutral drift and occasional evolutionary events of large magnitude.