IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics
Marta Rosikiewicz, Marc Robinson-Rechavi
(Submitted on 8 Oct 2013)

Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: this ftp URL

Some mathematical tools for the Lenski experiment

Some mathematical tools for the Lenski experiment
Bernard Ycart (LJK), Agnès Hamon (LJK), Joël Gaffé (LAPM), Dominique Schneider (LAPM)
(Submitted on 2 Oct 2013)

The Lenski experiment is a long term daily reproduction of Escherichia coli, that has evidenced phenotypic and genetic evolutions along the years. Some mathematical models, that could be usefull in understanding the results of that experiment, are reviewed here: stochastic and deterministic growth, mutation appearance and fixation, competition of species.

Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes

Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes
Timothy B. Sackton, John H. Werren, Andrew G. Clark
(Submitted on 23 Sep 2013)

The innate immune system in insects consists of a conserved core signaling network and rapidly diversifying effector and recognition components, often containing a high proportion of taxonomically-restricted genes. In the absence of functional annotation, genes encoding immune system proteins can thus be difficult to identify, as homology-based approaches generally cannot detect lineage-specific genes. Here, we use RNA-seq to compare the uninfected and infection-induced transcriptome in the parasitoid wasp Nasonia vitripennis to identify genes regulated by infection. We identify 183 genes significantly up-regulated by infection and 61 genes significantly down-regulated by infection. We also produce a new homology-based immune catalog in N. vitripennis, and show that most infection-induced genes are not assigned an immune function from homology alone, suggesting the potential for substantial novel immune components in less-well-studied systems. Finally, we show that a high proportion of these novel induced genes are taxonomically-restricted, highlighting the rapid evolution of immune gene content. The combination of functional annotation using RNA-seq and homology-based annotation provides a robust method to characterize the innate immune response across a wide variety of insects, and reveals significant novel features of the Nasonia immune response.

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions
Mengjie Chen, Haifan Lin, Hongyu Zhao
(Submitted on 20 Sep 2013)

Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant biological functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks across the chromosome with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs manifest relatively homogeneous enrichment pattern of histone modifications. We further characterized BLOCKs by their transcription levels, distribution of genes, binding profiles of a broad panel of chromatin proteins, degree of co-expression and GO enrichment. Our results demonstrate that these blocks, although inferred merely from histone modifications, reveal strong relevance with transcription events and chromatin organization, which suggest their important roles in coordinated gene regulation.

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
Sean Whalen, Gaurav Pandey
(Submitted on 19 Sep 2013)

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown
(Submitted on 11 Sep 2013)

K-mer abundance analysis is widely used for many purposes in sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a CountMin Sketch. The CountMin Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support streaming k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a CountMin Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, and DSK. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer error rates. Khmer is implemented in C++ wrapped with a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
M. Cyrus Maher, Ryan D. Hernandez
(Submitted on 9 Sep 2013)

Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. There is a range of methods available for OD. However, relative performance varies by application, stymying attempts to identify a single best method. In this paper, we present a novel tool, MOSAIC, which is capable of integrating the entire swath of OD methods. We analyze the results of applying MOSAIC over four methodologically diverse OD methods. Relative to component and competing methods, we demonstrate large gains in the number of detected orthologs while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies

Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies
Xiaoquan Wen
(Submitted on 3 Sep 2013)

Motivated by examples from genetic association studies, this paper considers the model selection problem in a general complex linear model system and in a Bayesian framework. We discuss formulating model selection problems and incorporating context-dependent {\it a priori} information through different levels of prior specifications. We also derive analytic Bayes factors and their approximations to facilitate model selection and discuss their theoretical and computational properties. We demonstrate our Bayesian approach based on an implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real data application of mapping tissue-specific eQTLs. Our novel results on Bayes factors provide a general framework to perform efficient model comparisons in complex linear model systems.

MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping

MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping
Wan-Ping Lee (1), Michael Stromberg (1 and 2), Alistair Ward (1), Chip Stewart (1 and 3), Erik Garrison (1), Gabor T. Marth (1) ((1) Department of Biology, Boston College, Chestnut Hill, MA, (2) Illumina, Inc., San Diego, CA, (3) Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA)
(Submitted on 4 Sep 2013)

This paper presents an accurate short-read mapper for next-generation sequencing data which is widely used in the 1000 Genomes Project, and human clinical and other species genome studies.

Biological Averaging in RNA-Seq

Biological Averaging in RNA-Seq
Surojit Biswas, Yash N. Agrawal, Tatiana S. Mucyn, Jeffery L. Dangl, Corbin D. Jones
(Submitted on 3 Sep 2013)

RNA-seq has become a de facto standard for measuring gene expression. Traditionally, RNA-seq experiments are mathematically averaged — they sequence the mRNA of individuals from different treatment groups, hoping to correlate phenotype with differences in arithmetic read count averages at shared loci of interest. Alternatively, the tissue from the same individuals may be pooled prior to sequencing in what we refer to as a biologically averaged design. As mathematical averaging sequences all individuals it controls for both biological and technical variation; however, is the statistical resolution gained always worth the additional cost? To compare biological and mathematical averaging, we examined theoretical and empirical estimates of statistical efficiency and relative cost efficiency. Though less efficient at a fixed sample size, we found that biological averaging can be more cost efficient than mathematical averaging. With this motivation, we developed a differential expression classifier, ICRBC, to can detect alternatively expressed genes between biologically averaged samples. In simulation studies, we found that biological averaging and subsequent analysis with our classifier performed comparably to existing methods, such as ASC, edgeR, and DESeq, especially when individuals were pooled evenly and less than 20% of the regulome was expected to be differentially regulated. In two technically distinct mouse datasets and one plant dataset, we found that our method was over 87% concordant with edgeR for the 100 most significant features. We therefore conclude biological averaging may sufficiently control biological variation to a level that differences in gene expression may be detectable. In such situations, ICRBC can enable reliable exploratory analysis at a fraction of the cost, especially when interest lies in the most differentially expressed loci.