Network Methods for Pathway Analysis of Genomic Data (Review)

Network Methods for Pathway Analysis of Genomic Data (Review)

Rosemary Braun, Sahil Shah
(Submitted on 7 Nov 2014)

Rapid advances in high-throughput technologies have led to considerable interest in analyzing genome-scale data in the context of biological pathways, with the goal of identifying functional systems that are involved in a given phenotype. In the most common approaches, biological pathways are modeled as simple sets of genes, neglecting the network of interactions comprising the pathway and treating all genes as equally important to the pathway’s function. Recently, a number of new methods have been proposed to integrate pathway topology in the analyses, harnessing existing knowledge and enabling more nuanced models of complex biological systems. However, there is little guidance available to researches choosing between these methods. In this review, we discuss eight topology-based methods, comparing their methodological approaches and appropriate use cases. In addition, we present the results of the application of these methods to a curated set of ten gene expression profiling studies using a common set of pathway annotations. We report the computational efficiency of the methods and the consistency of the results across methods and studies to help guide users in choosing a method. We also discuss the challenges and future outlook for improved network analysis methodologies.

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

Heejung Shim, Daniel I Chasman, Joshua D Smith, Samia Mora, Paul M Ridker, Deborah A Nickerson, Ronald M Krauss, Matthew Stephens
doi: http://dx.doi.org/10.1101/011270

We conducted a genome-wide association analysis of 7 subfractions of low density lipoproteins (LDLs) and 3 subfractions of intermediate density lipoproteins (IDLs) measured by gradient gel electrophoresis, and their response to statin treatment, in 1868 individuals of European ancestry from the Pharmacogenomics and Risk of Cardiovascular Disease study. Our analyses identified four previously-implicated loci (SORT1, APOE, LPA, and CETP) as containing variants that are very strongly associated with lipoprotein subfractions (log10 Bayes Factor > 15). Subsequent conditional analyses suggest that three of these (APOE, LPA and CETP) likely harbor multiple independently associated SNPs. Further, while different variants typically showed different characteristic patterns of association with combinations of subfractions, the two SNPs in CETP show strikingly similar patterns – both in our original data and in a replication cohort – consistent with a common underlying molecular mechanism. Notably, the CETP variants are very strongly associated with LDL subfractions, despite showing no association with total LDLs in our study, illustrating the potential value of the more detailed phenotypic measurements. In contrast with these strong subfraction associations, genetic association analysis of subfraction response to statins showed much weaker signals (none exceeding log10 Bayes Factor of 6). However, two SNPs (in APOE and LPA) previously-reported to be associated with LDL statin response do show some modest evidence for association in our data, and the subfraction response profiles at the LPA SNP are consistent with the LPA association, with response likely being due primarily to resistance of Lp(a) particles to statin therapy. An additional important feature of our analysis is that, unlike most previous analyses of multiple related phenotypes, we analyzed the subfractions jointly, rather than one at a time. Comparisons of our multivariate analyses with standard univariate analyses demonstrate that multivariate analyses can substantially increase power to detect associations. Software implementing our multivariate analysis methods is available at http://stephenslab.uchicago.edu/software.html.

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

Bryce van de Geijn, Graham McVicker, Yoav Gilad, Jonathan Pritchard
doi: http://dx.doi.org/10.1101/011221

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), however they are challenging to analyze and prone to technical artefacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and ChIP-seq reads, we demonstrate that our approach has a low error rate and is far more powerful than existing QTL mapping approaches.

Differential gene co-expression networks via Bayesian biclustering models

Differential gene co-expression networks via Bayesian biclustering models

Chuan Gao, Shiwen Zhao, Ian C. McDowell, Christopher D. Brown, Barbara E. Engelhardt
(Submitted on 7 Nov 2014)

Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes whose covariation may be observed in only a subset of the samples. Our biclustering method, BicMix, has desirable properties, including allowing overcomplete representations of the data, computational tractability, and jointly modeling unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios. Further, we develop a method to recover gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and recover a gene co-expression network that is differential across ER+ and ER- samples.

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

Hua Chen, Jody Hey, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/011247

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Paul Fearnhead, Shoukai Yu, Patrick Biggs, Barbara Holland, Nigel French
(Submitted on 5 Nov 2014)

A number of studies have suggested using comparisons between DNA sequences of closely related bacterial isolates to estimate the relative rate of recombination to mutation for that bacterial species. We consider such an approach which uses single locus variants: pairs of isolates whose DNA differ at a single gene locus. One way of deriving point estimates for the relative rate of recombination to mutation from such data is to use composite likelihood methods. We extend recent work in this area so as to be able to construct confidence intervals for our estimates, without needing to resort to computationally-intensive bootstrap procedures, and to develop a test for whether the relative rate varies across loci. Both our test and method for constructing confidence intervals are obtained by modelling the dependence structure in the data, and then applying asymptotic theory regarding the distribution of estimators obtained using a composite likelihood. We applied these methods to multi-locus sequence typing (MLST) data from eight bacteria, finding strong evidence for considerable rate variation in three of these: Bacillus cereus, Enterococcus faecium and Klebsiella pneumoniae.

CauseMap: Fast inference of causality from complex time series

CauseMap: Fast inference of causality from complex time series
M. Cyrus Maher​, Ryan D. Hernandez

Background: Establishing health-related causal relationships is a central pursuit in biomedical research. Yet, the interdependent non-linearity of biological systems renders causal dynamics laborious and at times impractical to disentangle. This pursuit is further impeded by the dearth of time series that are sufficiently long to observe and understand recurrent patterns of flux. However, as data generation costs plummet and technologies like wearable devices democratize data collection, we anticipate a coming surge in the availability of biomedically-relevant time series data. Given the life-saving potential of these burgeoning resources, it is critical to invest in the development of open source software tools that are capable of drawing meaningful insight from vast amounts of time series data.

Results: Here we present CauseMap, the first open source implementation of convergent cross mapping (CCM), a method for establishing causality from long time series data (> ~25 observations). Compared to existing time series methods, CCM has the advantage of being model-free and robust to unmeasured confounding that could otherwise induce spurious associations. CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable. These reconstructions can be thought of as shadows of the true causal system. If the reconstructed shadows can predict points from the opposing time series, we can infer that the corresponding variables are providing views of the same causal system, and so are causally related. Unlike traditional metrics, this test can establish the directionality of causation, even in the presence of feedback loops. Furthermore, since CCM can extract causal relationships from times series of, e.g. a single individual, it may be a valuable tool to personalized medicine. We implement CCM in Julia, a high-performance programming language designed for facile technical computing. Our software package, CauseMap, is platform-independent and freely available as an official Julia package.

Conclusions: CauseMap is an efficient implementation of a state-of-the-art algorithm for detecting causality from time series data. We believe this tool will be a valuable resource for biomedical research and personalized medicine.

Most viewed on Haldane’s Sieve: October 2014

The most viewed preprints this month were:

Analyses of Eurasian wild and domestic pig genomes reveals long-term gene-flow during domestication

Analyses of Eurasian wild and domestic pig genomes reveals long-term gene-flow during domestication

Laurent A.F. Frantz, Joshua Schraiber, Ole Madsen, Hendrik-Jan Megens, Alex Cagan, Mirte Bosse, Yogesh Paudel, Richard P.M.A. Crooijmans, Greger Larson, Martien A.M. Groenen
doi: http://dx.doi.org/10.1101/010959

Traditionally, the process of domestication is assumed to be initiated by people, involve few individuals and rely on reproductive isolation between wild and domestic forms. However, an emerging zooarcheological consensus depicts animal domestication as a long-term process without reproductive isolation or strong intentional selection. Here, we ask whether pig domestication followed a traditional linear model, or a complex, reticulate model as predicted by zooarcheologists. To do so, we fit models of domestication to whole genome data from over 100 wild and domestic pigs. We found that the assumptions of traditional models, such as reproductive isolation and strong domestication bottlenecks, are incompatible with the genetic data and provide support for the zooarcheological theory of a complex domestication process. In particular, gene-flow from wild to domestic pigs was a ubiquitous feature of the domestication of pigs. In addition, we show that despite gene-flow, the genomes of domestic pigs show strong signatures of selection at loci that affect behaviour and morphology. Specifically, our results are consistent with independent parallel sweeps in two independent domestication areas (China and Anatolia) at loci linked to morphological traits. We argue that recurrent selection for domestic traits likely counteracted the homogenising effect of gene-flow from wild boars and created “islands of domestication” in the genome. Overall, our results suggest that genomic approaches that allow for more complex models of domestication to be embraced should be employed. The results from these studies will have significant ramifications for studies that attempt to infer the origin of domesticated animals.

E. coli populations in unpredictably fluctuating environments evolve to face novel stresses through enhanced efflux activity

E. coli populations in unpredictably fluctuating environments evolve to face novel stresses through enhanced efflux activity

Shraddha Madhav Karve, Sachit Daniel, Yashraj Chavhan, Abhishek Anand, Somendra Singh Kharola, Sutirth Dey
doi: http://dx.doi.org/10.1101/011007

There is considerable understanding about how laboratory populations respond to predictable (constant or deteriorating-environment) selection for single environmental variables like temperature or pH. However, such insights may not apply when selection environments comprise multiple variables that fluctuate unpredictably, as is common in nature. To address this issue, we grew replicate laboratory populations of E. coli in nutrient broth whose pH and concentrations of salt (NaCl) and hydrogen peroxide (H2O2) were randomly changed daily. After ~170 generations, the fitness of the selected populations had not increased in any of the three selection environments. However, these selected populations had significantly greater fitness in four novel environments which have no known fitness-correlation with tolerance to pH, NaCl or H2O2. Interestingly, contrary to expectations, hypermutators did not evolve. Instead, the selected populations evolved an increased ability for energy dependent efflux activity that might enable them to throw out toxins, including antibiotics, from the cell at a faster rate. This provides an alternate mechanism for how evolvability can evolve in bacteria and potentially lead to broad-spectrum antibiotic resistance, even in the absence of prior antibiotic exposure. Given that environmental variability is increasing in nature, this might have serious consequences for public-health.