Graph-based data integration predicts long-range regulatory interactions across the human genome

Graph-based data integration predicts long-range regulatory interactions across the human genome

Sofie Demeyer, Tom Michoel
(Submitted on 29 Apr 2014)

Transcriptional regulation of gene expression is one of the main processes that affect cell diversification from a single set of genes. Regulatory proteins often interact with DNA regions located distally from the transcription start sites (TSS) of the genes. We developed a computational method that combines open chromatin and gene expression information for a large number of cell types to identify these distal regulatory elements. Our method builds correlation graphs for publicly available DNase-seq and exon array datasets with matching samples and uses graph-based methods to filter findings supported by multiple datasets and remove indirect interactions. The resulting set of interactions was validated with both anecdotal information of known long-range interactions and unbiased experimental data deduced from Hi-C and CAGE experiments. Our results provide a novel set of high-confidence candidate open chromatin regions involved in gene regulation, often located several Mb away from the TSS of their target gene.

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Greg W. Clark, Sharon H. Ackerman, Elisabeth R. Tillier, Domenico L. Gatti
(Submitted on 26 Apr 2014)

Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. If the structure is already known, information on the covarying positions can be valuable to understand the protein mechanism.
In this study we have sought to determine whether a multivariate extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 {\AA} in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles.
We have also attempted to identify whether the difference in performance among methods is due to different efficiency in removing covariation originating from chains of structural contacts. We found that the reason why methods that derive partial correlation between the columns of a MSA provide a better recognition of close contacts is not because they remove chaining effects, but because they filter out the correlation between distant residues that originates from general fitness constraints. In contrast we found that true chaining effects are expression of real physical perturbations that propagate inside proteins, and therefore are not removed by the derivation of partial correlation between variables.

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Philip Gerlee, Eunjung Kim, Alexander R.A. Anderson
(Submitted on 28 Apr 2014)

In this review we summarize our recent efforts in trying to understand the role of heterogeneity in cancer progression by using neural networks to characterise different aspects of the mapping from a cancer cells genotype and environment to its phenotype. Our central premise is that cancer is an evolving system subject to mutation and selection, and the primary conduit for these processes to occur is the cancer cell whose behaviour is regulated on multiple biological scales. The selection pressure is mainly driven by the microenvironment that the tumour is growing in and this acts directly upon the cell phenotype. In turn, the phenotype is driven by the intracellular pathways that are regulated by the genotype. Integrating all of these processes is a massive undertaking and requires bridging many biological scales (i.e. genotype, pathway, phenotype and environment) that we will only scratch the surface of in this review. We will focus on models that use neural networks as a means of connecting these different biological scales, since they allow us to easily create heterogeneity for selection to act upon and importantly this heterogeneity can be implemented at different biological scales. More specifically, we consider three different neural networks that bridge different aspects of these scales and the dialogue with the micro-environment, (i) the impact of the micro-environment on evolutionary dynamics, (ii) the mapping from genotype to phenotype under drug-induced perturbations and (iii) pathway activity in both normal and cancer cells under different micro-environmental conditions.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Diane Saunders, Kentaro Yoshida, Christine Sambles, Rachel Glover, Bernardo Clavijo, Manuel Corpas, Daniel Bunting, Suomeng Dong, Matthew Clark, David Swarbreck, Sarah Ayling, Matthew Bashton, Steve Collin, Tsuyoshi Hosoya, Anne Edwards, Lisa Crossman, Graham Etherington, Joe Win, Liliana Cano, David Studholme, J Allan Downie, Mario Caccamo, Sophien Kamoun, Dan MacLean

Ash dieback is a fungal disease of ash trees caused by Hymenoscyphus pseudoalbidus that has swept across Europe in the last two decades and is a significant threat to the ash population. This emergent pathogen has been relatively poorly studied and little is known about its genetic make-up. In response to the arrival of this dangerous pathogen in the UK we took the unusual step of providing an open access database and initial sequence datasets to the scientific community for analysis prior to performing an analysis of our own. Our goal was to crowdsource genomic and other analyses and create a community analysing this pathogen. In this report on the evolution of the community and data and analysis obtained in the first year of this activity, we describe the nature and the volume of the contributions and reveal some preliminary insights into the genome and biology of H. pseudoalbidus that emerged. In particular our nascent community generated a first-pass genome assembly containing abundant collapsed AT-rich repeats indicating a typically complex genome structure. Our open science and crowdsourcing effort has brought a wealth of new knowledge about this emergent pathogen within a short time-frame. Our community endeavour highlights the positive impact that open, collaborative approaches can have on fast, responsive modern science.

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Amir Shahmoradi, Dariya K. Sydykova, Stephanie J. Spielman, Eleisha L. Jackson, Eric T. Dawson, Austin G. Meyer, Claus O. Wilke

Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The structural properties we considered include buriedness (relative solvent accessibility and contact number), structural flexibility (B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on 9 non-homologous viral protein structures and from variation in homologous variants of those proteins, where available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1 to 0.4). Moreover, we found that measures of buriedness were better predictors of evolutionary variation than were measures of structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than was buriedness, but was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness are better predictors of evolutionary variation than are more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.

Author post: Estimating transcription factor abundance and specificity from genome-wide binding profiles

This guest post is by Radu Zabet on his preprint (with Boris Adryan) “Estimating transcription factor abundance and specificity from genome-wide binding profiles“, arXived here.

Binding of transcription factors (TFs) to the genome controls gene activity by either increasing or reducing the rate of transcription. We previously used stochastic simulations of the TF search mechanism (the facilitated diffusion mechanism which assumes both three-dimensional diffusion and one-dimensional random walk on the DNA) and investigated the binding of TFs to the genome; see http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0073714#pone-0073714-g006 and http://nar.oxfordjournals.org/content/42/7/4196; also covered on https://haldanessieve.org/2013/04/09/our-paper-the-effects-of-transcription-factor-competition-on-gene-regulation/ and
https://haldanessieve.org/2014/01/10/author-post-physical-constraints-determine-the-logic-of-bacterial-promoter-architectures/. Our results confirmed that the binding profiles of TFs are mainly affected by the binding energy (usually represented by the Position Weight Matrix – PWM) between the TF and DNA and the number of molecules. What this means is that the binding profiles can now be approximated by the equilibrium occupancy and, thus, instead of running computationally expensive stochastic simulations, one can use the statistical thermodynamics framework to predict these binding profiles.

The statistical thermodynamics framework entails the computation of the statistical weight for each possible configuration of the system (the specific combination of locations on the DNA where TF molecules are bound). It immediately becomes clear that the number of possible configurations grows with increasing DNA segment size; thus making it impossible to compute genome-wide profiles. We addressed this using several approximations within the statistical thermodynamics framework and, based on these approximations, we derived an analytical solution. This allows the computation of genome-wide binding profiles by scanning the DNA quite similar to more naïve PWM based approaches. Our model takes as inputs four parameters: (i) the PWM scores, (ii) DNA accessibility data, (iii) the number of bound molecules and (iv) a factor that controls the specificity of the TF by rescaling the PWM scores. The first two are usually known from experimental data, while the last two are difficult to estimate from experiments and are usually computed by fitting the model to the data.

To test our model, we applied it to five ChIP-seq data sets (for Drosophila Bicoid, Caudal, Giant, Hunchback and Kruppel). Our results confirmed that, when including DNA accessibility data, the model fits the ChIP-seq profile with high accuracy (correlation coefficient > 0.65 for 4/5 TFs). Interestingly, we found that most TFs display lower abundance (in the range of 10-1000) than previously estimated (10000-100000). In addition, we also observed that while Bicoid and Caudal display high specificity (and our model predicts with good accuracy their ChIP-seq profiles), Giant, Hunchback and Kruppel display a lower specificity. Finally, we would like to emphasize that our method is applicable to any eukaryotic system for which the required data is available and can be applied genome-wide.

Our paper is accompanied by a how-to and all raw data to replicate our results: http://logic.sysbiol.cam.ac.uk/nrz/ChIPprofile/.

The evolution of genetic diversity in changing environments

The evolution of genetic diversity in changing environments

Oana Carja, Uri Liberman, Marcus W. Feldman

The production and maintenance of genetic and phenotypic diversity under temporally fluctuating selection and the signatures of environmental and selective volatility in the patterns of genetic and phenotypic variation have been important areas of focus in population genetics. On one hand, stretches of constant selection pull the genetic makeup of populations towards local fitness optima. On the other, in order to cope with changes in the selection regime, populations may evolve mechanisms that create a diversity of genotypes. By tuning the rates at which variability is produced, such as the rates of recombination, mutation or migration, populations may increase their long-term adaptability. Here we use theoretical models to gain insight into how the rates of these three evolutionary forces are shaped by fluctuating selection. We compare and contrast the evolution of recombination, mutation and migration under similar patterns of environmental change and show that these three sources of phenotypic variation are surprisingly similar in their response to changing selection. We show that knowing the shape, size, variance and asymmetry of environmental runs is essential for accurate prediction of genetic evolutionary dynamics.

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

Josef C Uyeda, Luke J Harmon

Our understanding of macroevolutionary patterns of adaptive evolution has greatly increased with the advent of large-scale phylogenetic comparative methods. Widely used Ornstein-Uhlenbeck (OU) models can describe an adaptive process of divergence and selection. However, inference of the dynamics of adaptive landscapes from comparative data is complicated by interpretational difficulties, lack of identifiability among parameter values and the common requirement that adaptive hypotheses must be assigned a priori. Here we develop a reversible-jump Bayesian method of fitting multi-optima OU models to phylogenetic comparative data that estimates the placement and magnitude of adaptive shifts directly from the data. We show how biologically informed hypotheses can be tested against this inferred posterior of shift locations using Bayes Factors to establish whether our a priori models adequately describe the dynamics of adaptive peak shifts. Furthermore, we show how the inclusion of informative priors can be used to restrict models to biologically realistic parameter space and test particular biological interpretations of evolutionary models. We argue that Bayesian model-fitting of OU models to comparative data provides a framework for integrating of multiple sources of biological data–such as microevolutionary estimates of selection parameters and paleontological timeseries–allowing inference of adaptive landscape dynamics with explicit, process-based biological interpretations.

Soft selective sweeps in complex demographic scenarios

Soft selective sweeps in complex demographic scenarios

Benjamin A Wilson, Dmitri Petrov, Philipp W Messer

Recent studies have shown that adaptation from de novo mutation often produces so-called soft selective sweeps, where adaptive mutations of independent mutational origin sweep through the population at the same time. Population genetic theory predicts that soft sweeps should be likely if the product of the population size and the mutation rate towards the adaptive allele is sufficiently large, such that multiple adaptive mutations can establish before one has reached fixation; however, it remains unclear how demographic processes affect the probability of observing soft sweeps. Here we extend the theory of soft selective sweeps to realistic demographic scenarios that allow for changes in population size over time. We first show that population bottlenecks can lead to the removal of all but one adaptive lineage from an initially soft selective sweep. The parameter regime under which such ‘hardening’ of soft selective sweeps is likely is determined by a simple heuristic condition. We further develop a generalized analytical framework, based on an extension of the coalescent process, for calculating the probability of soft sweeps under arbitrary demographic scenarios. Two important limits emerge within this analytical framework: In the limit where population size fluctuations are fast compared to the duration of the sweep, the likelihood of soft sweeps is determined by the harmonic mean of the variance effective population size estimated over the duration of the sweep; in the opposing slow fluctuation limit, the likelihood of soft sweeps is determined by the instantaneous variance effective population size at the onset of the sweep. We show that as a consequence of this finding the probability of observing soft sweeps becomes a function of the strength of selection. Specifically, in species with sharply fluctuating population size, strong selection is more likely to produce soft sweeps than weak selection. Our results highlight the importance of accurate demographic estimates over short evolutionary timescales for understanding the population genetics of adaptation from de novo mutation.

Author post: VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

This guest post is by Olly Burren and Chris Wallace on their preprint, VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes, arXived here.

The idea for this paper came from reading a study by Liu et al. ( http://www.sciencedirect.com/science/article/pii/S0002929710003125) and the fact that summary p values from genome wide association studies are increasingly becoming publicly available. In the field of human disease, genome-wide association studies have been very successful in isolating regions of the genome that confer disease susceptibility. The next step however, is to understand mechanistically exactly how variation in these loci gives rise to this susceptibility. There are a myriad of pre-existing methods available for integrating genetic and genomic datasets, however things are complicated by the high degree of linkage disequilibrium that exists, which causes substantial inflation in the variance of any test statistic. This inter-SNP correlation must be taken into account, classically by permuting case/control status and recomputing association, requiring access to raw genotyping data. Indeed, this approach was taken in our previously published method see Heing et al. (http://www.nature.com/nature/journal/v467/n7314/full/nature09386.html) which uses a non-parametric test to compare distribution of GWAS p values from two sets of SNPs (“test” and “control”). As most researchers working with GWAS know gaining access to raw genotyping data is often difficult, and then how to include meta-analysis and imputed data? Liu et al., got around this by estimating the inter-SNP correlation using public datasets and sampling from a multivariate normal to generate simulated p values, analogous to the permuted p values possible with permuting phenotype status when raw data are available. VEGAS uses genotype data publicly available through the International HapMap project and aims to integrate GWAS results with trans eQTLs to identify causal disease genes.

Our thought was that by combining our previously published method, with the VEGAS approach, we could create a novel approach that would allow the integration of genetic information from GWAS with functional information from for example a set of micro-array experiments, crucially without the need for genotype information. The rationale being that it would help to prioritise future mechanistic studies, which can be costly and time-consuming to conduct. We also upped the stakes, and decided to use 1000 Genomes Project genotyping information for our estimations, to allow application to dense-genotyping technologies. The result was a software pipeline that takes as input a gene set of interest, a matched ‘control’ set and a summary set of GWAS statistics and computes an enrichment score.

Note that this approach differs from the Bayesian model suggested by Pickrell (https://haldanessieve.org/2013/12/16/author-post-joint-analysis-of-functional-genomic-data-and-genome-wide-association-studies-of-18-human-traits) as it focuses on comparing broad regions, rather than on considering more targeted genomic annotation, and in that sense is perhaps more akin to pathway analysis, although we do suggest that functionally defined genes sets, such as those found by knock down experiments in cell lines, may be more productive than using manually annotated pathways whose completeness can vary considerably.

To illustrate the method we applied it to a large meta-analysis GWAS study of type 1 diabetes (8000 case vs 8000 controls), and an interesting dataset examining the effect on gene-expression of knocking down a series of 59 transcription factors in a lymphoblastoid cell line see Cusanovich et al (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004226). We identified three transcription factors, IKZF3, BATF and ESRRA, whose putative targets are significantly enriched for variation associated with type 1 diabetes susceptibility. IKZF3 overlaps a known type 1 diabetes susceptibility region, whereas BATF and ESRRA overlap other autoimmune susceptibility regions, validating our approach. Of course there are caveats interpreting results derived from cell lines, however we think it’s promising that our top hit lies in a region already associated with type 1 diabetes susceptibility.
Using the quantities already computed, once enrichment is detected, we implemented a simple technique to prioritise genes within the set. This allows the generation of a succinct list of genes that are responsible for the enrichment detected on the global level. Cross referenced with other information these can either be informative in their own right or be used to inform future studies.

This study is also an example of the preprint process speeding up scientific discovery. We knew about the Cusanovich dataset because they released a preprint on arXiv, which was caught by Haldane’s Sieve (https://haldanessieve.org/2013/10/22/the-functional-consequences-of-variation-in-transcription-factor-binding/) in October 2013. One email, and the authors kindly shared their complete results. Had we waited for it to be published in PLoS Genetics in March 2014, we’d have been five months behind where we are.

The major benefit is that all of the datasets employed are within the public domain. Our hope is that either this or other methods in the same vein will help to bridge the gap between GWAS and disease mechanisms, ultimately fuelling the development of new therapeutics.