Likelihood Estimation with Incomplete Array Variate Observations

Likelihood Estimation with Incomplete Array Variate Observations

Deniz Akdemir
doi: http://dx.doi.org/10.1101/012278

Missing data present an important challenge when dealing with high dimensional data arranged in the form of an array. In this paper, we propose methods for estimation of the parameters of array variate normal probability model from partially observed multi-way data. The methods developed here are useful for missing data imputation, estimation of mean and covariance parameters for multi-way data. A multi-way semi-parametric mixed effects model that allows separation of multi-way covariance effects is also defined and an efficient algorithm for estimation based on the spectral decompositions of the covariance parameters is recommended. We demonstrate our methods with simulations and with real life data involving the estimation of genotype and environment interaction effects on possibly correlated traits.

Revealing missing isoforms encoded in the human genome by integrating genomic, transcriptomic and proteomic data

Revealing missing isoforms encoded in the human genome by integrating genomic, transcriptomic and proteomic data

Zhiqiang Hu, Hamish S. Scott, Guangrong Qin, Guangyong Zheng, Xixia Chu, Lu Xie, David L. Adelson, Bergithe E. Oftedal, Parvalthy Venugopal, Milena Barbic, Christopher N. Hahn, Bing Zhang, Xiaojing Wang, Nan Li, Chaochun Wei
doi: http://dx.doi.org/10.1101/012112

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing and is much larger than the number of human genes. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.

Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

Sarah A Gagliano, Reena Ravji, Michael R Barnes, Michael E Weale, Jo Knight
doi: http://dx.doi.org/10.1101/011445

Although technology has triumphed in facilitating routine genome re-sequencing, new challenges have been created for the data analyst. Genome scale surveys of human disease variation generate volumes of data that far exceed capabilities for laboratory characterization, and importantly also create a substantial burden of type I error. By incorporating a variety of functional annotations as predictors, such as regulatory and protein coding elements, statistical learning has been widely investigated as a mechanism for the prioritization of genetic variants that are more likely to be associated with complex disease. These methods offer a hope of identification of sufficiently large numbers of truly associated variants, to make cost-effective the large-scale functional characterization necessary to progress genome scale experiments. We compared the results from three published prioritization procedures which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding of the functional annotations. In this paper we also explore different combinations of algorithm and annotation set. We train the models in 60% of the data and reserve the remainder for testing the accuracy. As an application, we tested which methodology performed the best for prioritizing sub-genome-wide-significant variants using data from the first and second rounds of a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71). However, predictive accuracy results obtained from the test set do not always reflect results obtained from the application to the schizophrenia meta-analysis. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved towards the step change in the risk variant prediction required to address the impending bottleneck of the new generation of genome re-sequencing studies.

How Well Can We Detect Shifts in Rates of Lineage Diversification? A Simulation Study of Sequential AIC Methods

How Well Can We Detect Shifts in Rates of Lineage Diversification? A Simulation Study of Sequential AIC Methods

Michael R May, Brian R Moore
doi: http://dx.doi.org/10.1101/011452

Evolutionary biologists have long been fascinated by the extreme differences in species numbers across branches of the Tree of Life. This has motivated the development of statistical phy- logenetic methods for detecting shifts in the rate of lineage diversification (speciation – extinction). One of the most frequently used methods—implemented in the program MEDUSA—explores a set of diversification-rate models, where each model uniquely assigns branches of the phylogeny to a set of one or more diversification-rate categories. Each candidate model is first fit to the data, and the Akaike Information Criterion (AIC) is then used to identify the optimal diversification model. Surprisingly, the statistical behavior of this popular method is completely unknown, which is a concern in light of the poor performance of the AIC as a means of choosing among models in other phylogenetic comparative contexts, and also because of the ad hoc algorithm used to visit models. Here, we perform an extensive simulation study demonstrating that, as implemented, MEDUSA (1) has an extremely high Type I error rate (on average, spurious diversification-rate shifts are identi- fied 42% of the time), and (2) provides severely biased parameter estimates (on average, estimated net-diversification and relative-extinction rates are 183% and 20% of their true values, respectively). We performed simulation experiments to reveal the source(s) of these pathologies, which include (1) the use of incorrect critical thresholds for model selection, and (2) errors in the likelihood function. Understanding the statistical behavior of MEDUSA is critical both to empirical researchers—in order to clarify whether these methods can reliably be applied to empirical datasets—and to theoretical biologists—in order to clarify whether new methods are required, and to reveal the specific problems that need to be solved in order to develop more reliable approaches for detecting shifts in the rate of lineage diversification.

R/qtlcharts: interactive graphics for quantitative trait locus mapping

R/qtlcharts: interactive graphics for quantitative trait locus mapping

Karl W Broman
doi: http://dx.doi.org/10.1101/011437

Every data visualization can be improved with some level of interactivity. Interactive graphics hold particular promise for the exploration of high-dimensional data. R/qtlcharts is an R package to create interactive graphics for experiments to map quantitative trait loci (QTL; genetic loci that influence quantitative traits). R/qtlcharts serves as a companion to the R/qtl package, providing interactive versions of R/qtl’s static graphs, as well as additional interactive graphs for the exploration of high-dimensional genotype and phenotype data.

What to compare and how: comparative transcriptomics for Evo-Devo

What to compare and how: comparative transcriptomics for Evo-Devo

Julien Roux, Marta Rosikiewicz, Marc Robinson-Rechavi
doi: http://dx.doi.org/10.1101/011213

Evolutionary developmental biology has grown historically from the capacity to relate patterns of evolution in anatomy to patterns of evolution of expression of specific genes, whether between very distantly related species, or very closely related species or populations. Scaling up such studies by taking advantage of modern transcriptomics brings promising improvements, allowing us to estimate the overall impact and molecular mechanisms of convergence, constraint or innovation in anatomy and development. But it also presents major challenges, including the computational definitions of anatomical homology and of organ function, the criteria for the comparison of developmental stages, the annotation of transcriptomics data to proper anatomical and developmental terms, and the statistical methods to compare transcriptomic data between species to highlight significant conservation or changes. In this article, we review these challenges, and the ongoing efforts to address them, which are emerging from bioinformatics work on ontologies, evolutionary statistics, and data curation, with a focus on their implementation in the context of the development of our database Bgee (http://bgee.org).

Network Methods for Pathway Analysis of Genomic Data (Review)

Network Methods for Pathway Analysis of Genomic Data (Review)

Rosemary Braun, Sahil Shah
(Submitted on 7 Nov 2014)

Rapid advances in high-throughput technologies have led to considerable interest in analyzing genome-scale data in the context of biological pathways, with the goal of identifying functional systems that are involved in a given phenotype. In the most common approaches, biological pathways are modeled as simple sets of genes, neglecting the network of interactions comprising the pathway and treating all genes as equally important to the pathway’s function. Recently, a number of new methods have been proposed to integrate pathway topology in the analyses, harnessing existing knowledge and enabling more nuanced models of complex biological systems. However, there is little guidance available to researches choosing between these methods. In this review, we discuss eight topology-based methods, comparing their methodological approaches and appropriate use cases. In addition, we present the results of the application of these methods to a curated set of ten gene expression profiling studies using a common set of pathway annotations. We report the computational efficiency of the methods and the consistency of the results across methods and studies to help guide users in choosing a method. We also discuss the challenges and future outlook for improved network analysis methodologies.

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

Bryce van de Geijn, Graham McVicker, Yoav Gilad, Jonathan Pritchard
doi: http://dx.doi.org/10.1101/011221

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), however they are challenging to analyze and prone to technical artefacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and ChIP-seq reads, we demonstrate that our approach has a low error rate and is far more powerful than existing QTL mapping approaches.

Differential gene co-expression networks via Bayesian biclustering models

Differential gene co-expression networks via Bayesian biclustering models

Chuan Gao, Shiwen Zhao, Ian C. McDowell, Christopher D. Brown, Barbara E. Engelhardt
(Submitted on 7 Nov 2014)

Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes whose covariation may be observed in only a subset of the samples. Our biclustering method, BicMix, has desirable properties, including allowing overcomplete representations of the data, computational tractability, and jointly modeling unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios. Further, we develop a method to recover gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and recover a gene co-expression network that is differential across ER+ and ER- samples.

Annotating RNA motifs in sequences and alignments

Annotating RNA motifs in sequences and alignments
Paul P Gardner, Hisham Eldai
doi: http://dx.doi.org/10.1101/011197

RNA performs a diverse array of important functions across all cellular life. These functions include important roles in translation, building translational machinery and maturing messenger RNA. More recent discoveries include the miRNAs and bacterial sRNAs that regulate gene expression, the thermosensors, riboswitches and other cis-regulatory elements that help prokaryotes sense their environment and eukaryotic piRNAs that suppress transposition. However, there can be a long period between the initial discovery of a RNA and determining its function. We present a bioinformatic approach to characterise RNA motifs, which are the central building blocks of RNA structure. These motifs can, in some instances, provide researchers with functional hypotheses for uncharacterised RNAs. Moreover, we introduce a new profile-based database of RNA motifs – RMfam – and illustrate its application for investigating the evolution and functional characterisation of RNA. All the data and scripts associated with this work is available from: https://github.com/ppgardne/RMfam