CauseMap: Fast inference of causality from complex time series

CauseMap: Fast inference of causality from complex time series
M. Cyrus Maher​, Ryan D. Hernandez

Background: Establishing health-related causal relationships is a central pursuit in biomedical research. Yet, the interdependent non-linearity of biological systems renders causal dynamics laborious and at times impractical to disentangle. This pursuit is further impeded by the dearth of time series that are sufficiently long to observe and understand recurrent patterns of flux. However, as data generation costs plummet and technologies like wearable devices democratize data collection, we anticipate a coming surge in the availability of biomedically-relevant time series data. Given the life-saving potential of these burgeoning resources, it is critical to invest in the development of open source software tools that are capable of drawing meaningful insight from vast amounts of time series data.

Results: Here we present CauseMap, the first open source implementation of convergent cross mapping (CCM), a method for establishing causality from long time series data (> ~25 observations). Compared to existing time series methods, CCM has the advantage of being model-free and robust to unmeasured confounding that could otherwise induce spurious associations. CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable. These reconstructions can be thought of as shadows of the true causal system. If the reconstructed shadows can predict points from the opposing time series, we can infer that the corresponding variables are providing views of the same causal system, and so are causally related. Unlike traditional metrics, this test can establish the directionality of causation, even in the presence of feedback loops. Furthermore, since CCM can extract causal relationships from times series of, e.g. a single individual, it may be a valuable tool to personalized medicine. We implement CCM in Julia, a high-performance programming language designed for facile technical computing. Our software package, CauseMap, is platform-independent and freely available as an official Julia package.

Conclusions: CauseMap is an efficient implementation of a state-of-the-art algorithm for detecting causality from time series data. We believe this tool will be a valuable resource for biomedical research and personalized medicine.

Genetic Studies of Physiological Traits with Their Application to Sleep Apnea

Genetic Studies of Physiological Traits with Their Application to Sleep Apnea

D.Y. Lee, C. Hanis, G.I. Bell, D.A. Aguilar, S. Redline, J. Below, M.M. Xiong
(Submitted on 27 Oct 2014)

Advances of modern sensing and sequencing technologies generate a deluge of high dimensional space-temporal physiological and next-generation sequencing (NGS) data. Physiological traits are observed either as continuous random functions, or on a dense grid and referred to as function-valued traits. Both physiological and NGS data are highly correlated data with their inherent order, spacing, and functional nature which are ignored by traditional summary-based univariate and multivariate regression methods designed for quantitative genetic analysis of scalar trait and common variants. To capture morphological and dynamic features of the data and utilize their dependent structure, we propose a functional linear model (FLM) in which a trait curve is modeled as a response function, the genetic variation in a genomic region or gene is modeled as a functional predictor, and the genetic effects are modeled as a function of both time and genomic position (FLMF) for genetic analysis of function-valued trait with both GWAS and NGS data. By extensive simulations, we demonstrate that the FLMF has the correct type 1 error rates and much higher power to detect association than the existing methods. The FLMF is applied to sleep data from Starr County health studies where oxygen saturation were measured in 22,670 seconds on average for 833 individuals. We found 65 genes that were significantly associated with oxygen saturation functional trait with P-values ranging from 2.40E-06 to 2.53E-21. The results clearly demonstrate that the FLMF substantially outperforms the traditional genetic models with scalar trait.

Rapid Core-Genome Alignment and Visualization for Thousands of Intraspecific Microbial Genomes

Rapid Core-Genome Alignment and Visualization for Thousands of Intraspecific Microbial Genomes

Todd J. Treangen, Brian D. Ondov, Sergey Koren, Adam M. Phillippy
doi: http://dx.doi.org/10.1101/007351

Though many microbial species or clades now have hundreds of sequenced genomes, existing whole-genome alignment methods do not efficiently handle comparisons on this scale. Here we present the Harvest suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Combined they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.

Second-generation PLINK: rising to the challenge of larger and richer datasets

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C. Chang, Carson C. Chow, Laurent C.A.M. Tellier, Shashaank Vattikuti, Shaun M. Purcell, James J. Lee
Comments: 2 figures, 1 additional file
Subjects: Genomics (q-bio.GN); Computation (stat.CO)

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing

RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing

Vikas Gupta, April Dawn Estrada, Ivory Clabaugh Blakley, Rob Reid, Ketan Patel, Mason D. Meyer, Stig Uggerhoj Andersen, Allan F. Brown, Mary Ann Lila, Ann Loraine
doi: http://dx.doi.org/10.1101/010116

Background: Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable breeding berry varieties with enhanced health benefits. Results: Toward this end, we annotated a draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up and down regulation of metabolic pathway enzymes, cell growth-related genes, and putative transcriptional regulators. Analysis of RNA-seq alignments also identified developmentally regulated alternative splicing, promoter use, and 3′ end formation. Conclusions: We report genome sequence, gene models, functional annotations, and RNA-Seq expression data which provide an important new resource enabling high throughput studies in blueberry. RNA-Seq data are freely available for visualization in Integrated Genome Browser, and analysis code is available from the git repository at http://bitbucket.org/lorainelab/blueberrygenome.

Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders

Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders

Robert P Brown, Hane Lee, Ascia Eskin, Gleb Kichaev, Kirk E Lohmueller, Bruno Reversade, Stanley F Nelson, Bogdan Pasaniuc
doi: http://dx.doi.org/10.1101/010017

Recent breakthroughs in exome sequencing technology have made possible the identification of many causal variants of monogenic disorders. Although extremely powerful when closely related individuals (e.g. child and parents) are simultaneously sequenced, exome sequencing of individual only cases is often unsuccessful due to the large number of variants that need to be followed-up for functional validation. Many approaches remove from consideration common variants above a given frequency threshold (e.g. 1%), and then prioritize the remaining variants according to their allele frequency, functional, structural and conservation properties. In this work, we present methods that leverage the genetic structure of different populations while accounting for the finite sample size of the reference panels to improve the variant filtering step. Using simulations and real exome data from individuals with monogenic disorders, we show that our methods significantly reduce the number of variants to be followed-up (e.g. a 36% reduction from an average 418 variants per exome when ancestry is ignored to 267 when ancestry is taken into account for case-only sequenced individuals). Most importantly our proposed approaches are well calibrated with respect to the probability of filtering out a true causal variant (i.e. false negative rate, FNR), whereas existing approaches are susceptible to high FNR when reference panel sizes are limited.

XWAS: a toolset for genetic data analysis and association studies of the X chromosome

XWAS: a toolset for genetic data analysis and association studies of the X chromosome

Diana Chang, Feng Gao, Alon Keinan
doi: http://dx.doi.org/10.1101/009795

Summary: We present XWAS (chromosome X-Wide Analysis tool-Set)–a toolset specially designed for analysis of the X chromosome in association studies, both on the level of single markers and the level of entire genes. It further offers other X-specific analysis tools, including quality control (QC) procedures for X-linked data. We have applied and tested this software by carrying out several X-wide association studies of autoimmune diseases. Availability and Implementation: The XWAS software package, which includes scripts, the binary executable PLINK/XWAS and all source code is freely available for download from http://keinanlab.cb.bscb.cornell.edu/content/tools-data. PLINK/XWAS is implemented in C++ and other features in shell scripts and Perl. This software package is designed for Linux systems.

Estimating gene expression and codon specific translational efficiencies, mutation biases, and selection coefficients from genomic data

Estimating gene expression and codon specific translational efficiencies, mutation biases, and selection coefficients from genomic data

Michael Gilchrist, Wei-Chen Chen, Premal Shah, Russell Zaretzki
doi: http://dx.doi.org/10.1101/009670

The time and cost of generating a genomic dataset is expected to continue to decline dramatically in the upcoming years. As a result, extracting biologically meaningful information from this continuing flood of data is a major challenge in biology. In response, we present a powerful Bayesian MCMC method based on a nested model of protein synthesis and population genetics. Analyzing the patterns of codon usage observed within a genome, our algorithm extracts and decouples information on codon specific translational efficiencies and mutation biases as well as gene specific expression levels for all coding sequences. This information can be combined to generate gene and codon specific estimates of selection on synonymous substitutions. One major advance over previous work is that our method can be used without independent measurements of gene expression. Using the Saccharomyces cerevisiae S288c genome, we compare our model fits with and without independent gene expression measurements and observe an exceptionally high correlation between our codon specific parameters and gene specific expression levels (ρ > 0.99 in all cases). We also observe robust correlations between our predictions generated without independent expression measurements and previously published estimates of mutation bias, ribosome pausing time, and empirical estimates of mRNA abundance (ρ=0.53-0.72). Our results indicate that failing to take mutation bias into account can lead to the misidentification of an amino acid’s `optimal’ codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage and this information can be accessed through carefully formulated, biologically based models.

Accounting for experimental noise reveals that transcription dominates control of steady-state protein levels in yeast

Accounting for experimental noise reveals that transcription dominates control of steady-state protein levels in yeast

Gábor Csárdi, Alexander Franks, David S. Choi, Eduardo M. Airoldi, D. Allan Drummond
doi: http://dx.doi.org/10.1101/009472

Cells respond to their environment by modulating protein levels through mRNA transcription and post-transcriptional control. Modest correlations between global steady-state mRNA and protein measurements have been interpreted as evidence that transcript levels determine roughly 40% of the variation in protein levels, indicating dominant post-transcriptional effects. However, the techniques underlying these conclusions, such as correlation and regression, yield biased results when data are noisy, missing systematically, and collinear—properties of mRNA and protein measurements—which motivated us to revisit this subject. Noise-robust analyses of 25 studies of budding yeast reveal that mRNA levels explain roughly 80% of the variation in steady-state protein levels. Post-transcriptional regulation amplifies rather than competes with the transcriptional signal. Measurements are highly reproducible within but not between studies, and are distorted in part by between-study differences in gene expression. These results substantially revise current models of protein-level regulation and introduce multiple noise-aware approaches essential for proper analysis of many biological phenomena.

MUSiCC: Towards an accurate estimation of average genomic copy-numbers in the human microbiome

MUSiCC: Towards an accurate estimation of average genomic copy-numbers in the human microbiome

Ohad Manor, Elhanan Borenstein
doi: http://dx.doi.org/10.1101/009407

Functional metagenomic analyses commonly involve a normalization step, where measured levels of genes or pathways are converted into relative abundances. Here, we demonstrate that this normalization scheme introduces marked biases both across and within human microbiome samples and systematically identify various sample- and gene-specific properties that contribute to these biases. We introduce an alternative normalization paradigm, MUSiCC, which combines universal single-copy genes with machine learning methods to correct these biases and to obtain a more accurate and biologically meaningful measure of gene abundances. Finally, we demonstrate that MUSiCC significantly improves downstream discovery of functional shifts in the microbiome. MUSiCC is available at http://elbo.gs.washington.edu/software.html.