Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Gustavo Sacomoto
(Submitted on 23 Jun 2014)

In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events. Finally, we show that it enables to identify more correct events than general purpose transcriptome assemblers.
In order to deal with ever-increasing volumes of NGS data, we put an extra effort to make KisSplice as scalable as possible. First, to improve its running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. Then, to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time.
Additionally, we apply the same techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson’s algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the classical K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms with the same time complexities but using exponentially less memory than previous approaches.

Assessing Technical Performance in Differential Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures

Assessing Technical Performance in Differential Gene Expression Experiments with External Spike-in RNA Control Ratio Mixtures

Sarah A. Munro, Steve P. Lund, P. Scott Pine, Hans Binder, Djork-Arné Clevert, Ana Conesa, Joaquin Dopazo, Mario Fasold, Sepp Hochreiter, Huixiao Hong, Nederah Jafari, David P. Kreil, Paweł P. Łabaj, Sheng Li, Yang Liao, Simon Lin, Joseph Meehan, Christopher E. Mason, Javier Santoyo, Robert A. Setterquist, Leming Shi, Wei Shi, Gordon K. Smyth, Nancy Stralis-Pavese, Zhenqiang Su, Weida Tong, Charles Wang, Jian Wang, Joshua Xu, Zhan Ye, Yong Yang, Ying Yu, Marc Salit
(Submitted on 18 Jun 2014)

There is a critical need for standard approaches to assess, report, and compare the technical performance of genome-scale differential gene expression experiments. We assess technical performance with a proposed “standard” dashboard of metrics derived from analysis of external spike-in RNA control ratio mixtures. These control ratio mixtures with defined abundance ratios enable assessment of diagnostic performance of differentially expressed transcript lists, limit of detection of ratio (LODR) estimates, and expression ratio variability and measurement bias. The performance metrics suite is applicable to analysis of a typical experiment, and here we also apply these metrics to evaluate technical performance among laboratories. An interlaboratory study using identical samples shared amongst 12 laboratories with three different measurement processes demonstrated generally consistent diagnostic power across 11 laboratories. Ratio measurement variability and bias were also comparable amongst laboratories for the same measurement process. Different biases were observed for measurement processes using different mRNA enrichment protocols.

Error correction and assembly complexity of single molecule sequencing reads.

Error correction and assembly complexity of single molecule sequencing reads.

Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, Michael Schatz

Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.

Nanopore Sequencing of the phi X 174 genome

Nanopore Sequencing of the phi X 174 genome

Andrew H. Laszlo, Ian M. Derrington, Brian C. Ross, Henry Brinkerhoff, Andrew Adey, Ian C. Nova, Jonathan M. Craig, Kyle W. Langford, Jenny Mae Samson, Riza Daza, Kenji Doering, Jay Shendure, Jens H. Gundlach
(Submitted on 17 Jun 2014)

Nanopore sequencing of DNA is a single-molecule technique that may achieve long reads, low cost, and high speed with minimal sample preparation and instrumentation. Here, we build on recent progress with respect to nanopore resolution and DNA control to interpret the procession of ion current levels observed during the translocation of DNA through the pore MspA. As approximately four nucleotides affect the ion current of each level, we measured the ion current corresponding to all 256 four-nucleotide combinations (quadromers). This quadromer map is highly predictive of ion current levels of previously unmeasured sequences derived from the bacteriophage phi X 174 genome. Furthermore, we show nanopore sequencing reads of phi X 174 up to 4,500 bases in length that can be unambiguously aligned to the phi X 174 reference genome, and demonstrate proof-of-concept utility with respect to hybrid genome assembly and polymorphism detection. All methods and data are made fully available.

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Carlo G. Artieri, Hunter B. Fraser

The recent advent of ribosome profiling ? sequencing of short ribosome-bound fragments of mRNA ? has offered an unprecedented opportunity to interrogate the sequence features responsible for modulating translational rates. Nevertheless, numerous analyses of the first riboprofiling dataset have produced equivocal and often incompatible results. Here we analyze three independent yeast riboprofiling data sets, including two with much higher coverage than previously available, and find that all three show substantial technical sequence biases that confound interpretations of ribosomal occupancy. After accounting for these biases, we find no effect of previously implicated factors on ribosomal pausing. Rather, we find that incorporation of proline, whose unique side-chain stalls peptide synthesis in vitro, also slows the ribosome in vivo. We also reanalyze a recent method that reported positively charged amino acids as the major determinant of ribosomal stalling and demonstrate that its assumptions lead to false signals of stalling in low-coverage data. Our results suggest that any analysis of riboprofiling data should account for sequencing biases and sparse coverage. To this end, we establish a robust methodology that enables analysis of ribosome profiling data without prior assumptions regarding which positions spanned by the ribosome cause stalling.

Reducing INDEL errors in whole-genome and exome sequencing

Reducing INDEL errors in whole-genome and exome sequencing

Han Fang, Giuseppe Narzisi, Jason A. O’Rawe, Yiyang Wu, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon

Background INDELs, especially those disrupting protein-coding regions of the genome, have been associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. We have recently developed a new INDEL-calling algorithm, Scalpel, with substantially improved accuracy. Results We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate false-positive and false-negative INDEL errors. We developed a classification scheme utilizing validation data to define a class of low-quality INDELs with ~2.7-fold higher error rates than high-quality INDELs. The mean concordance of INDEL detection between WGS and WES data was ~52%, while WGS data uniquely identified ~10.8-fold more high-quality INDELs. Concordance of INDEL detection between standard and PCR-free sequencing data was ~71%, while PCR-free data uniquely yielded ~6.3-fold fewer low-quality INDELs. We demonstrate that these INDEL errors are significantly reduced with a PCR-free library protocol, implying that these errors are introduced with PCR amplification. We calculated that 60X WGS data from the HiSeq 2000 platform are needed to recover ~95% of INDELs, much higher than that for SNP detection. Accurate detection of heterozygous INDELs requires ~1.2-fold higher coverage than that for homozygous INDELs. Conclusions Homopolymer A/T INDELs are a major source of low quality and/or uncertain INDEL calls, and these are highly enriched in the WES data. We recommend WGS for human genomes at 60X mean coverage with PCR-free protocols, which can substantially improve the quality of personal genomes.

Transcriptomic analysis of the lesser spotted catshark (Scyliorhinus canicula) pancreas, liver and brain reveals molecular level conservation of vertebrate pancreas function

Transcriptomic analysis of the lesser spotted catshark (Scyliorhinus canicula) pancreas, liver and brain reveals molecular level conservation of vertebrate pancreas function

John F Mulley, Adam D Hargreaves, Matthew J Hegarty, R. Scott Heller, Martin T Swain

Background Understanding the evolution of the vertebrate pancreas is key to understanding its functions. The chondrichthyes (cartilaginous fish such as sharks and rays) have been suggested to possess the most ancient example of a distinct pancreas with both hormonal (endocrine) and digestive (exocrine) roles, although the lack of genetic, genomic and transcriptomic data for cartilaginous fish has hindered a more thorough understanding of the molecular-level functions of the chondrichthyan pancreas, particularly with respect to their “unusual” energy metabolism (where ketone bodies and amino acids are the main oxidative fuel source) and their paradoxical ability to both maintain stable blood glucose levels and tolerate extensive periods of hypoglycemia. In order to shed light on some of these processes we have carried out the first large-scale comparative transcriptomic survey of multiple cartilaginous fish tissues: the pancreas, brain and liver of the lesser spotted catshark, Scyliorhinus canicula. Results We generated a mutli-tissue assembly comprising 86,006 contigs, of which 44,794 were assigned to a particular tissue or combination of tissue based on mapping of sequencing reads. We have characterised transcripts encoding genes involved in insulin regulation, glucose sensing, transcriptional regulation, signaling and digestion, as well as many peptide hormone precursors and their receptors for the first time. Comparisons to published mammalian pancreas transcriptomes reveals that mechanisms of glucose sensing and insulin regulation used to establish and maintain a stable internal environment are conserved across jawed vertebrates and likely pre-date the vertebrate radiation. Conservation of pancreatic hormones and genes encoding digestive proteins support the single, early evolution of a distinct pancreatic gland with endocrine and exocrine functions in vertebrates, although the peptide diversity of the early vertebrate pancreas has been overestimated as a result of the use of cross-reacting antisera in earlier studies. A three hormone islet organ is therefore the basal vertebrate condition, later elaborated upon only in the tetrapod lineage. Conclusions The cartilaginous fish are a great untapped resource for the reconstruction of patterns and processes of vertebrate evolution and new approaches such as those described in this paper will greatly facilitate their incorporation into the rank of “model organism”.

iRAP – an integrated RNA-seq Analysis Pipeline

iRAP – an integrated RNA-seq Analysis Pipeline

Nuno A. Fonseca, Robert Petryszak, John Marioni, Alvis Brazma

RNA-sequencing (RNA-Seq) has become the technology of choice for whole-transcriptome profiling. However, processing the millions of sequence reads generated requires considerable bioinformatics skills and computational resources. At each step of the processing pipeline many tools are available, each with specific advantages and disadvantages. While using a specific combination of tools might be desirable, integrating the different tools can be time consuming, often due to specificities in the formats of input/output files required by the different programs. Here we present iRAP, an integrated RNA-seq analysis pipeline that allows the user to select and apply their preferred combination of existing tools for mapping reads, quantifying expression, testing for differential expression. iRAP also includes multiple tools for gene set enrichment analysis and generates web browsable reports of the results obtained in the different stages of the pipeline. Depending upon the application, iRAP can be used to quantify expression at the gene, exon or transcript level. iRAP is aimed at a broad group of users with basic bioinformatics training and requires little experience with the command line. Despite this, it also provides more advanced users with the ability to customise the options used by their chosen tools.

Polyester: simulating RNA-seq datasets with differential transcript expression

Polyester: simulating RNA-seq datasets with differential transcript expression

Alyssa C Frazee, Andrew E Jaffe, Ben Langmead, Jeffrey Leek

Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially-constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with col- lections of RNA-seq reads. The main advantage of Polyester is the ability to simulate isoform-level differential expression across biological replicates for a variety of experimental designs at the read level. Differential expression signal can be simulated with either built-in or user-defined statistical models. Polyester is available on GitHub at https://github.com/alyssafrazee/polyester.

Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

Andreas Tuerk, Gregor Wiktorin

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations. This article introduces the Mix2 (rd. ”mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2 model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix2 model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix2 model show superior performance of the Mix2 model in the vast majority of test conditions.