Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Xiaoquan Wen
(Submitted on 29 Apr 2014)

We consider the problems of hypothesis testing and model comparison under a flexible Bayesian linear regression model whose formulation is closely connected with the linear mixed effect model and the parametric models for SNP set analysis in genetic association studies. We derive a class of analytic approximate Bayes factors and illustrate their connections with a variety of frequentist test statistics, including the Wald statistic and the variance component score statistic. Taking advantage of Bayesian model averaging and hierarchical modeling, we demonstrate some distinct advantages and flexibilities in the approaches utilizing the derived Bayes factors in the context of genetic association studies. We demonstrate our proposed methods using real or simulated numerical examples in applications of single SNP association testing, multi-locus fine-mapping and SNP set association testin


Characterizing a collective and dynamic component of chromatin immunoprecipitation enrichment profiles in yeast

Characterizing a collective and dynamic component of chromatin immunoprecipitation enrichment profiles in yeast

Lucas D. Ward, Junbai Wang, Harmen J. Bussemaker

Recent chromatin immunoprecipitation (ChIP) experiments in fly, mouse, and human have revealed the existence of high-occupancy target (HOT) regions or “hotspots” that show enrichment across many assayed DNA-binding proteins. Similar co-enrichment observed in yeast so far has been treated as artifactual, and has not been fully characterized. Here we reanalyze ChIP data from both array-based and sequencing-based experiments to show that in the yeast S. cerevisiae, the collective enrichment phenomenon is strongly associated with proximity to noncoding RNA genes and with nucleosome depletion. DNA sequence motifs that confer binding affinity for the proteins are largely absent from these hotspots, suggesting that protein-protein interactions play a prominent role. The hotspots are condition-specific, suggesting that they reflect a chromatin state or protein state, and are not a static feature of underlying sequence. Additionally, only a subset of all assayed factors is associated with these loci, suggesting that the co-enrichment cannot be simply explained by a chromatin state that is universally more prone to immunoprecipitation. Together our results suggest that the co-enrichment patterns observed in yeast represent transcription factor co-occupancy. More generally, they make clear that great caution must be used when interpreting ChIP enrichment profiles for individual factors in isolation, as they will include factor-specific as well as collective contributions.

Graph-based data integration predicts long-range regulatory interactions across the human genome

Graph-based data integration predicts long-range regulatory interactions across the human genome

Sofie Demeyer, Tom Michoel
(Submitted on 29 Apr 2014)

Transcriptional regulation of gene expression is one of the main processes that affect cell diversification from a single set of genes. Regulatory proteins often interact with DNA regions located distally from the transcription start sites (TSS) of the genes. We developed a computational method that combines open chromatin and gene expression information for a large number of cell types to identify these distal regulatory elements. Our method builds correlation graphs for publicly available DNase-seq and exon array datasets with matching samples and uses graph-based methods to filter findings supported by multiple datasets and remove indirect interactions. The resulting set of interactions was validated with both anecdotal information of known long-range interactions and unbiased experimental data deduced from Hi-C and CAGE experiments. Our results provide a novel set of high-confidence candidate open chromatin regions involved in gene regulation, often located several Mb away from the TSS of their target gene.

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Greg W. Clark, Sharon H. Ackerman, Elisabeth R. Tillier, Domenico L. Gatti
(Submitted on 26 Apr 2014)

Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. If the structure is already known, information on the covarying positions can be valuable to understand the protein mechanism.
In this study we have sought to determine whether a multivariate extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 {\AA} in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles.
We have also attempted to identify whether the difference in performance among methods is due to different efficiency in removing covariation originating from chains of structural contacts. We found that the reason why methods that derive partial correlation between the columns of a MSA provide a better recognition of close contacts is not because they remove chaining effects, but because they filter out the correlation between distant residues that originates from general fitness constraints. In contrast we found that true chaining effects are expression of real physical perturbations that propagate inside proteins, and therefore are not removed by the derivation of partial correlation between variables.

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Philip Gerlee, Eunjung Kim, Alexander R.A. Anderson
(Submitted on 28 Apr 2014)

In this review we summarize our recent efforts in trying to understand the role of heterogeneity in cancer progression by using neural networks to characterise different aspects of the mapping from a cancer cells genotype and environment to its phenotype. Our central premise is that cancer is an evolving system subject to mutation and selection, and the primary conduit for these processes to occur is the cancer cell whose behaviour is regulated on multiple biological scales. The selection pressure is mainly driven by the microenvironment that the tumour is growing in and this acts directly upon the cell phenotype. In turn, the phenotype is driven by the intracellular pathways that are regulated by the genotype. Integrating all of these processes is a massive undertaking and requires bridging many biological scales (i.e. genotype, pathway, phenotype and environment) that we will only scratch the surface of in this review. We will focus on models that use neural networks as a means of connecting these different biological scales, since they allow us to easily create heterogeneity for selection to act upon and importantly this heterogeneity can be implemented at different biological scales. More specifically, we consider three different neural networks that bridge different aspects of these scales and the dialogue with the micro-environment, (i) the impact of the micro-environment on evolutionary dynamics, (ii) the mapping from genotype to phenotype under drug-induced perturbations and (iii) pathway activity in both normal and cancer cells under different micro-environmental conditions.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Diane Saunders, Kentaro Yoshida, Christine Sambles, Rachel Glover, Bernardo Clavijo, Manuel Corpas, Daniel Bunting, Suomeng Dong, Matthew Clark, David Swarbreck, Sarah Ayling, Matthew Bashton, Steve Collin, Tsuyoshi Hosoya, Anne Edwards, Lisa Crossman, Graham Etherington, Joe Win, Liliana Cano, David Studholme, J Allan Downie, Mario Caccamo, Sophien Kamoun, Dan MacLean

Ash dieback is a fungal disease of ash trees caused by Hymenoscyphus pseudoalbidus that has swept across Europe in the last two decades and is a significant threat to the ash population. This emergent pathogen has been relatively poorly studied and little is known about its genetic make-up. In response to the arrival of this dangerous pathogen in the UK we took the unusual step of providing an open access database and initial sequence datasets to the scientific community for analysis prior to performing an analysis of our own. Our goal was to crowdsource genomic and other analyses and create a community analysing this pathogen. In this report on the evolution of the community and data and analysis obtained in the first year of this activity, we describe the nature and the volume of the contributions and reveal some preliminary insights into the genome and biology of H. pseudoalbidus that emerged. In particular our nascent community generated a first-pass genome assembly containing abundant collapsed AT-rich repeats indicating a typically complex genome structure. Our open science and crowdsourcing effort has brought a wealth of new knowledge about this emergent pathogen within a short time-frame. Our community endeavour highlights the positive impact that open, collaborative approaches can have on fast, responsive modern science.

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Amir Shahmoradi, Dariya K. Sydykova, Stephanie J. Spielman, Eleisha L. Jackson, Eric T. Dawson, Austin G. Meyer, Claus O. Wilke

Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The structural properties we considered include buriedness (relative solvent accessibility and contact number), structural flexibility (B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on 9 non-homologous viral protein structures and from variation in homologous variants of those proteins, where available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1 to 0.4). Moreover, we found that measures of buriedness were better predictors of evolutionary variation than were measures of structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than was buriedness, but was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness are better predictors of evolutionary variation than are more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.

Author post: Estimating transcription factor abundance and specificity from genome-wide binding profiles

This guest post is by Radu Zabet on his preprint (with Boris Adryan) “Estimating transcription factor abundance and specificity from genome-wide binding profiles“, arXived here.

Binding of transcription factors (TFs) to the genome controls gene activity by either increasing or reducing the rate of transcription. We previously used stochastic simulations of the TF search mechanism (the facilitated diffusion mechanism which assumes both three-dimensional diffusion and one-dimensional random walk on the DNA) and investigated the binding of TFs to the genome; see and; also covered on and Our results confirmed that the binding profiles of TFs are mainly affected by the binding energy (usually represented by the Position Weight Matrix – PWM) between the TF and DNA and the number of molecules. What this means is that the binding profiles can now be approximated by the equilibrium occupancy and, thus, instead of running computationally expensive stochastic simulations, one can use the statistical thermodynamics framework to predict these binding profiles.

The statistical thermodynamics framework entails the computation of the statistical weight for each possible configuration of the system (the specific combination of locations on the DNA where TF molecules are bound). It immediately becomes clear that the number of possible configurations grows with increasing DNA segment size; thus making it impossible to compute genome-wide profiles. We addressed this using several approximations within the statistical thermodynamics framework and, based on these approximations, we derived an analytical solution. This allows the computation of genome-wide binding profiles by scanning the DNA quite similar to more naïve PWM based approaches. Our model takes as inputs four parameters: (i) the PWM scores, (ii) DNA accessibility data, (iii) the number of bound molecules and (iv) a factor that controls the specificity of the TF by rescaling the PWM scores. The first two are usually known from experimental data, while the last two are difficult to estimate from experiments and are usually computed by fitting the model to the data.

To test our model, we applied it to five ChIP-seq data sets (for Drosophila Bicoid, Caudal, Giant, Hunchback and Kruppel). Our results confirmed that, when including DNA accessibility data, the model fits the ChIP-seq profile with high accuracy (correlation coefficient > 0.65 for 4/5 TFs). Interestingly, we found that most TFs display lower abundance (in the range of 10-1000) than previously estimated (10000-100000). In addition, we also observed that while Bicoid and Caudal display high specificity (and our model predicts with good accuracy their ChIP-seq profiles), Giant, Hunchback and Kruppel display a lower specificity. Finally, we would like to emphasize that our method is applicable to any eukaryotic system for which the required data is available and can be applied genome-wide.

Our paper is accompanied by a how-to and all raw data to replicate our results:

The evolution of genetic diversity in changing environments

The evolution of genetic diversity in changing environments

Oana Carja, Uri Liberman, Marcus W. Feldman

The production and maintenance of genetic and phenotypic diversity under temporally fluctuating selection and the signatures of environmental and selective volatility in the patterns of genetic and phenotypic variation have been important areas of focus in population genetics. On one hand, stretches of constant selection pull the genetic makeup of populations towards local fitness optima. On the other, in order to cope with changes in the selection regime, populations may evolve mechanisms that create a diversity of genotypes. By tuning the rates at which variability is produced, such as the rates of recombination, mutation or migration, populations may increase their long-term adaptability. Here we use theoretical models to gain insight into how the rates of these three evolutionary forces are shaped by fluctuating selection. We compare and contrast the evolution of recombination, mutation and migration under similar patterns of environmental change and show that these three sources of phenotypic variation are surprisingly similar in their response to changing selection. We show that knowing the shape, size, variance and asymmetry of environmental runs is essential for accurate prediction of genetic evolutionary dynamics.

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data

Josef C Uyeda, Luke J Harmon

Our understanding of macroevolutionary patterns of adaptive evolution has greatly increased with the advent of large-scale phylogenetic comparative methods. Widely used Ornstein-Uhlenbeck (OU) models can describe an adaptive process of divergence and selection. However, inference of the dynamics of adaptive landscapes from comparative data is complicated by interpretational difficulties, lack of identifiability among parameter values and the common requirement that adaptive hypotheses must be assigned a priori. Here we develop a reversible-jump Bayesian method of fitting multi-optima OU models to phylogenetic comparative data that estimates the placement and magnitude of adaptive shifts directly from the data. We show how biologically informed hypotheses can be tested against this inferred posterior of shift locations using Bayes Factors to establish whether our a priori models adequately describe the dynamics of adaptive peak shifts. Furthermore, we show how the inclusion of informative priors can be used to restrict models to biologically realistic parameter space and test particular biological interpretations of evolutionary models. We argue that Bayesian model-fitting of OU models to comparative data provides a framework for integrating of multiple sources of biological data–such as microevolutionary estimates of selection parameters and paleontological timeseries–allowing inference of adaptive landscape dynamics with explicit, process-based biological interpretations.