Adaptive reference-free compression of sequence quality scores

Adaptive reference-free compression of sequence quality scores

Lilian Janin, Giovanna Rosone, Anthony J. Cox
(Submitted on 1 May 2013)

Motivation:
Rapid technological progress in DNA sequencing has stimulated interest in compressing the vast datasets that are now routinely produced. Relatively little attention has been paid to compressing the quality scores that are assigned to each sequence, even though these scores may be harder to compress than the sequences themselves. By aggregating a set of reads into a compressed index, we find that the majority of bases can be predicted from the sequence of bases that are adjacent to them and hence are likely to be less informative for variant calling or other applications. The quality scores for such bases are aggressively compressed, leaving a relatively small number at full resolution. Since our approach relies directly on redundancy present in the reads, it does not need a reference sequence and is therefore applicable to data from metagenomics and de novo experiments as well as to resequencing data.
Results:
We show that a conservative smoothing strategy affecting 75% of the quality scores above Q2 leads to an overall quality score compression of 1 bit per value with a negligible effect on variant calling. A compression of 0.68 bit per quality value is achieved using a more aggressive smoothing strategy, again with a very small effect on variant calling.
Availability:
Code to construct the BWT and LCP-array on large genomic data sets is part of the BEETL library, available as a github respository at this http URL .

Distilled Single Cell Genome Sequencing and De Novo Assembly for Sparse Microbial Communities

Distilled Single Cell Genome Sequencing and De Novo Assembly for Sparse Microbial Communities

Zeinab Taghavi, Narjes S. Movahedi, Sorin Draghici, Hamidreza Chitsaz
(Submitted on 1 May 2013)

Identification of all species in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier as the number of different species is usually much smaller than the number of cells. Here, we present a novel divide-and-conquer method to sequence and de novo assemble the genomes of all of the different species present in a microbial sample with a sequencing cost and computational complexity proportional to the number of species, not the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide-and-conquer method successfully reduces the cost of sequencing in comparison with the naive exhaustive approach.

Critical case stochastic phylogenetic tree model via the Laplace transform

Critical case stochastic phylogenetic tree model via the Laplace transform
Krzysztof Bartoszek, Michal Krzeminski
(Submitted on 30 Apr 2013)

Birth-and-death models are now a common mathematical tool to describe branching patterns observed in real-world phylogenetic trees. Liggett and Schinazi (2009) is one such example. The authors propose a simple birth-and-death model that is compatible with phylogenetic trees of both influenza and HIV, depending on the birth rate parameter. An interesting special case of this model is the critical case where the birth rate equals the death rate. This is a non-trivial situation and to study its asymptotic behaviour we employed the Laplace transform. With this we correct the proof of Liggett and Schinazi (2009) in the critical case.

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping
Buhm Han, Jae Hoon Sul, Eleazar Eskin, Paul I. W. de Bakker, Soumya Raychaudhuri
(Submitted on 30 Apr 2013)

Meta-analysis of genome-wide association studies is increasingly popular and many meta-analytic methods have been recently proposed. A majority of meta-analytic methods combine information from multiple studies by assuming that studies are independent since individuals collected in one study are unlikely to be collected again by another study. However, it has become increasingly common to utilize the same control individuals among multiple studies to reduce genotyping or sequencing cost. This causes those studies that share the same individuals to be dependent, and spurious associations may arise if overlapping subjects are not taken into account in a meta-analysis. In this paper, we propose a general framework for meta-analyzing dependent studies with overlapping subjects. Given dependent studies, our approach “decouples” the studies into independent studies such that meta-analysis methods assuming independent studies can be applied. This enables many meta-analysis methods, such as the random effects model, to account for overlapping subjects. Another advantage is that one can continue to use preferred software in the analysis pipeline which may not support overlapping subjects. Using simulations and the Wellcome Trust Case Control Consortium data, we show that our decoupling approach allows both the fixed and the random effects models to account for overlapping subjects while retaining desirable false positive rate and power.

Most viewed on Haldane’s Sieve: April 2013

Below are the most viewed posts on Haldane’s Sieve in April 2013. We’ve listed six instead of our usual five posts because the last two posts had identical numbers of views at the time of writing.