Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard

Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard
Tianyang Li, Rui Jiang, Xuegong Zhang
(Submitted on 4 May 2013)

Maximum likelihood is a popular technique for isoform reconstruction. Here, we show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.

Adaptive reference-free compression of sequence quality scores

Adaptive reference-free compression of sequence quality scores

Lilian Janin, Giovanna Rosone, Anthony J. Cox
(Submitted on 1 May 2013)

Motivation:
Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. Since our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is therefore applicable to data from metagenomics and de novo experiments as well as to resequencing data.
Results:
We show that a conservative smoothing strategy affecting 75% of the quality scores above Q2 leads to an overall quality score compression of 1 bit per value with a negligible effect on variant calling. A compression of 0.68 bit per quality value is achieved using a more aggressive smoothing strategy, again with a very small effect on variant calling.
Availability:
Code to construct the BWT and LCP-array on large genomic data sets is part of the BEETL library, available as a github respository at this http URL .

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping
Buhm Han, Jae Hoon Sul, Eleazar Eskin, Paul I. W. de Bakker, Soumya Raychaudhuri
(Submitted on 30 Apr 2013)

Meta-analysis of genome-wide association studies is increasingly popular and many meta-analytic methods have been recently proposed. A majority of meta-analytic methods combine information from multiple studies by assuming that studies are independent since individuals collected in one study are unlikely to be collected again by another study. However, it has become increasingly common to utilize the same control individuals among multiple studies to reduce genotyping or sequencing cost. This causes those studies that share the same individuals to be dependent, and spurious associations may arise if overlapping subjects are not taken into account in a meta-analysis. In this paper, we propose a general framework for meta-analyzing dependent studies with overlapping subjects. Given dependent studies, our approach “decouples” the studies into independent studies such that meta-analysis methods assuming independent studies can be applied. This enables many meta-analysis methods, such as the random effects model, to account for overlapping subjects. Another advantage is that one can continue to use preferred software in the analysis pipeline which may not support overlapping subjects. Using simulations and the Wellcome Trust Case Control Consortium data, we show that our decoupling approach allows both the fixed and the random effects models to account for overlapping subjects while retaining desirable false positive rate and power.

Remote Homology Detection in Proteins Using Graphical Models

Remote Homology Detection in Proteins Using Graphical Models
Noah M. Daniels
(Submitted on 24 Apr 2013)

Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge.
We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology.
Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies.
We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.

Informed and Automated k-Mer Size Selection for Genome Assembly

Informed and Automated k-Mer Size Selection for Genome Assembly
Rayan Chikhi, Paul Medvedev
(Submitted on 20 Apr 2013)

Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.
We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.
Our tool KmerGenie is freely available at: this http URL

Comparing DNA sequence collections by direct comparison of compressed text indexes

Comparing DNA sequence collections by direct comparison of compressed text indexes
Anthony J. Cox, Tobias Jakobi, Giovanna Rosone, Ole B. Schulz-Trieglaff
(Submitted on 19 Apr 2013)

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored.
Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods.
Code to construct and compare the BWT of large genomic data sets is available at this http URL as part of the BEETL library.

GEMINI: integrative exploration of genetic variation and genome annotations

GEMINI: integrative exploration of genetic variation and genome annotations
Uma Paila, Brad Chapman, Rory Kirchner, Aaron Quinlan
(Submitted on 17 Apr 2013)

Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or variant prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate the utility of GEMINI for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to will provide researchers with a standard framework for medical genomics.

Improving transcriptome assembly through error correction of high-throughput sequence reads

Improving transcriptome assembly through error correction of high-throughput sequence reads
Matthew D MacManes, Michael B Eisen
(Submitted on 3 Apr 2013)

The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

Concurrent and Accurate RNA Sequencing on Multicore Platforms

Concurrent and Accurate RNA Sequencing on Multicore Platforms
Héctor Martínez (1), Joaquín Tárraga (2), Ignacio Medina (2), Sergio Barrachina (1), Maribel Castillo (1), Joaquín Dopazo (2), Enrique S. Quintana-Ortí (1) ((1) Dpto. de Ingeniería y Ciencia de los Computadores, Universidad Jaume I, Castellón, Spain, (2) Computational Genomics Institute, Centro de Investigación Príncipe Felipe, Valencia, Spain)
(Submitted on 2 Apr 2013)

In this paper we introduce a novel parallel pipeline for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, named HPG-Aligner, leverages the speed of the Burrows-Wheeler Transform to map a large number of RNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, that is employed to deal with conflictive reads. The aligner is complemented with a careful strategy to detect splice junctions based on the division of RNA reads into short segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing useful information for the successful alignment of the complete reads.
Experimental results on platforms with AMD and Intel multicore processors report the remarkable parallel performance of HPG-Aligner, on short and long RNA reads, which excels in both execution time and sensitivity to an state-of-the-art aligner such as TopHat 2 built on top of Bowtie and Bowtie 2.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
Heng Li
(Submitted on 16 Mar 2013)

Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs split alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For short-read mapping, BWA-MEM shows better performance than several state-of-art read aligners to date.
Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at this http URL