Sashimi plots: Quantitative visualization of RNA sequencing read alignments

Sashimi plots: Quantitative visualization of RNA sequencing read alignments
Yarden Katz, Eric T. Wang, Jacob Silterra, Schraga Schwartz, Bang Wong, Jill P. Mesirov, Edoardo M. Airoldi, Christopher B. Burge
(Submitted on 14 Jun 2013)

We introduce Sashimi plots, a quantitative multi-sample visualization of mRNA sequencing reads aligned to gene annotations. Sashimi plots are made using alignments (stored in the SAM/BAM format) and gene model annotations (in GFF format), which can be custom-made by the user or obtained from databases such as Ensembl or UCSC. We describe two implementations of Sashimi plots: (1) a stand-alone command line implementation aimed at making customizable publication quality figures, and (2) an implementation built into the Integrated Genome Viewer (IGV) browser, which enables rapid and dynamic creation of Sashimi plots for any genomic region of interest, suitable for exploratory analysis of alternatively spliced regions of the transcriptome. Isoform expression estimates outputted by the MISO program can be optionally plotted along with Sashimi plots. Sashimi plots can be used to quickly screen differentially spliced exons along genomic regions of interest and can be used in publication quality figures. The Sashimi plot software and documentation is available from: this http URL

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads
Laurent Gautier, Ole Lund
(Submitted on 6 Jun 2013)

Cheap high-throughput DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples.
We propose a novel general approach to the analysis of sequencing data in which the reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data, and the hints can be used for more computationally-demanding work.
Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references known to the server. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment.
To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients, one of them running in a web browser, in order to demonstrate that gigabytes of raw sequencing reads of unknown origin could be identified without the need to transfer a very large volume of data, and on modestly powered computing devices.
A web access is available at this http URL. The source code for a python command-line client, a server, and supplementary data is available at this http URL.

SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly

SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly
Tin Chi Nguyen, Zhiyu Zhao, Dongxiao Zhu
(Submitted on 6 Jun 2013)

Transcriptome assembly from RNA-Seq reads is an active area of bioinformatics research. The ever-declining cost and the increasing depth of RNA-Seq have provided unprecedented opportunities to better identify expressed transcripts. However, the nonlinear transcript structures and the ultra-high throughput of RNA-Seq reads pose significant algorithmic and computational challenges to the existing transcriptome assembly approaches, either reference-guided or de novo. While reference-guided approaches offer good sensitivity, they rely on alignment results of the splice-aware aligners and are thus unsuitable for species with incomplete reference genomes. In contrast, de novo approaches do not depend on the reference genome but face a computational daunting task derived from the complexity of the graph built for the whole transcriptome. In response to these challenges, we present a hybrid approach to exploit an incomplete reference genome without relying on splice-aware aligners. We have designed a split-and-align procedure to efficiently localize the reads to individual genomic loci, which is followed by an accurate de novo assembly to assemble reads falling into each locus. Using extensive simulation data, we demonstrate a high accuracy and precision in transcriptome reconstruction by comparing to selected transcriptome assembly tools. Our method is implemented in assemblySAM, a GUI software freely available at this http URL.

biobambam: tools for read pair collation based algorithms on BAM files

biobambam: tools for read pair collation based algorithms on BAM files
German Tischler, Steven Leonard
(Submitted on 4 Jun 2013)

Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, an API for efficient BAM file reading supporting the efficient collation of alignments by read name without performing a complete resorting of the input file and some tools based on this API performing tasks like marking duplicate reads and conversion to the FastQ format. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities in the Picard suite the approach of biobambam can often perform an equivalent task more efficiently in terms of the required main memory and run-time.

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq

Agreeing to disagree, some ironies, disappointing scientific practice and a call for better: reply to The poor performance of TMM on microRNA-Seq
Mark D. Robinson
(Submitted on 27 May 2013)

This letter is a response to a Divergent Views article entitled “The poor performance of TMM on microRNA-Seq” (Garmire and Subramaniam 2013), which was a response to our Divergent Views article entitled “miRNA-seq normalization comparisons need improvement” (Zhou et al. 2013). Using reproducible code examples, we showed that they incorrectly used our normalization method and highlighted additional concerns with their study. Here, I wish to debunk several untrue or misleading statements made by the authors (hereafter referred to as GS) in their response. Unlike GSs, my claims are supported by R code, citations and email correspondences. I finish by making a call for better practice.

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS
David Golan, Saharon Rosset
(Submitted on 23 May 2013)

One of the major developments in recent years in the search for missing heritability of human phenotypes is the adoption of linear mixed-effects models (LMMs) to estimate heritability due to genetic variants which are not significantly associated with the phenotype. A variant of the LMM approach has been adapted to case-control studies and applied to many major diseases by Lee et al. (2011), successfully accounting for a considerable portion of the missing heritability. For example, for Crohn’s disease their estimated heritability was 22% compared to 50-60% from family studies. In this letter we propose to estimate heritability of disease directly by regression of phenotype similarities on genotype correlations, corrected to account for ascertainment. We refer to this method as genetic correlation regression (GCR). Using GCR we estimate the heritability of Crohn’s disease at 34% using the same data. We demonstrate through extensive simulation that our method yields unbiased heritability estimates, which are consistently higher than LMM estimates. Moreover, we develop a heuristic correction to LMM estimates, which can be applied to published LMM results. Applying our heuristic correction increases the estimated heritability of multiple sclerosis from 30% to 52.6%.

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Xiang Zhou, Matthew Stephens
(Submitted on 19 May 2013)

Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, existing methods for calculating the likelihood ratio test statistics in mvLMMs are time consuming, and, without approximations, cannot be directly applied to analyze even two traits jointly in a typical-size GWAS. Here, we present a novel algorithm for computing parameter estimates and test statistics (Likelihood ratio and Wald) in mvLMMs that i) reduces per-iteration optimization complexity from cubic to linear in the number of samples; and ii) in GWAS analyses, reduces per-marker complexity from cubic to approximately quadratic (or linear if the relatedness matrix is of low rank) in the number of samples. The new method effectively generalizes both the EMMA (Efficient Mixed Model Association) algorithm and the GEMMA (Genome-wide EMMA) algorithm to the multivariate case, making the likelihood ratio tests in GWASs with mvLMM possible, for the first time, for tens of thousands of samples and a moderate number of phenotypes (<10). With real examples, we show that, as expected, the new method is orders of magnitude faster than competing methods in both variance component estimation in a single mvLMM, and in GWAS applications. The method is implemented in the GEMMA software package, freely available at this http URL

Variable-length haplotype construction for gene-gene interaction studies

Variable-length haplotype construction for gene-gene interaction studies
Anunchai Assawamakin, Nachol Chaiyaratana, Chanin Limwongse, Saravudh Sinsomros, Pa-thai Yenchitsomanus, Prakarnkiat Youngkong
(Submitted on 19 May 2013)

This paper presents a non-parametric classification technique for identifying a candidate bi-allelic genetic marker set that best describes disease susceptibility in gene-gene interaction studies. The developed technique functions by creating a mapping between inferred haplotypes and case/control status. The technique cycles through all possible marker combination models generated from the available marker set where the best interaction model is determined from prediction accuracy and two auxiliary criteria including low-to-high order haplotype propagation capability and model parsimony. Since variable-length haplotypes are created during the best model identification, the developed technique is referred to as a variable-length haplotype construction for gene-gene interaction (VarHAP) technique. VarHAP has been benchmarked against a multifactor dimensionality reduction (MDR) program and a haplotype interaction technique embedded in a FAMHAP program in various two-locus interaction problems. The results reveal that VarHAP is suitable for all interaction situations with the presence of weak and strong linkage disequilibrium among genetic markers.

SISRS: SNP Identification from Short Read Sequences

SISRS: SNP Identification from Short Read Sequences
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013)

One of the important challenges in modern phylogenetics is to identify data that can be used to resolve species relationships accurately. Whole-genome shotgun sequencing provides large amounts of data from which to identify phylogenetically informative sites; however, previous studies have required genome assembly or alignment to a reference genome, which is difficult when species are not closely related.
We have developed a pipeline to extract potentially informative sites directly from raw short-read sequence data. Reads are assembled into conserved genome fragments, reads are then aligned to these fragments, and informative sites are identified. This pipeline produced >14000 informative sites from reads for 12 species of Leishmania and a reference genome. When analyzed using standard phylogenetic methods, these data resulted in a fully bifurcating tree with strongly supported nodes.
Our procedure is implemented in the software SISRS (pronounced “scissors”) which is freely available at this https URL.

Meta-Analysis of Gene Level Association Tests

Meta-Analysis of Gene Level Association Tests
Dajiang J. Liu, Gina M. Peloso, Xiaowei Zhan, Oddgeir Holmen, Matthew Zawistowski, Shuang Feng, Majid Nikpay, Paul L. Auer, Anuj Goel, He Zhang, Ulrike Peters, Martin Farrall, Marju Orho-Melander, Charles Kooperberg, Ruth McPherson, Hugh Watkins, Cristen J. Willer, Kristian Hveem, Olle Melander, Sekar Kathiresan, Gonçalo R. Abecasis
(Submitted on 6 May 2013)

The vast majority of connections between complex disease and common genetic variants were identified through meta-analysis, a powerful approach that enables large samples sizes while protecting against common artifacts due to population structure, repeated small sample analyses, and/or limitations with sharing individual level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the unit of analysis. Here, we propose and evaluate new approaches for meta-analysis of rare variant association. We show that our approach retains useful features of single variant meta-analytic approaches and demonstrate its utility in a study of blood lipid levels in ~18,500 individuals genotyped with exome arrays.