Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Xiaoquan Wen
(Submitted on 29 Apr 2014)

We consider the problems of hypothesis testing and model comparison under a flexible Bayesian linear regression model whose formulation is closely connected with the linear mixed effect model and the parametric models for SNP set analysis in genetic association studies. We derive a class of analytic approximate Bayes factors and illustrate their connections with a variety of frequentist test statistics, including the Wald statistic and the variance component score statistic. Taking advantage of Bayesian model averaging and hierarchical modeling, we demonstrate some distinct advantages and flexibilities in the approaches utilizing the derived Bayes factors in the context of genetic association studies. We demonstrate our proposed methods using real or simulated numerical examples in applications of single SNP association testing, multi-locus fine-mapping and SNP set association testin

Advertisements

Characterizing a collective and dynamic component of chromatin immunoprecipitation enrichment profiles in yeast

Characterizing a collective and dynamic component of chromatin immunoprecipitation enrichment profiles in yeast

Lucas D. Ward, Junbai Wang, Harmen J. Bussemaker

Recent chromatin immunoprecipitation (ChIP) experiments in fly, mouse, and human have revealed the existence of high-occupancy target (HOT) regions or “hotspots” that show enrichment across many assayed DNA-binding proteins. Similar co-enrichment observed in yeast so far has been treated as artifactual, and has not been fully characterized. Here we reanalyze ChIP data from both array-based and sequencing-based experiments to show that in the yeast S. cerevisiae, the collective enrichment phenomenon is strongly associated with proximity to noncoding RNA genes and with nucleosome depletion. DNA sequence motifs that confer binding affinity for the proteins are largely absent from these hotspots, suggesting that protein-protein interactions play a prominent role. The hotspots are condition-specific, suggesting that they reflect a chromatin state or protein state, and are not a static feature of underlying sequence. Additionally, only a subset of all assayed factors is associated with these loci, suggesting that the co-enrichment cannot be simply explained by a chromatin state that is universally more prone to immunoprecipitation. Together our results suggest that the co-enrichment patterns observed in yeast represent transcription factor co-occupancy. More generally, they make clear that great caution must be used when interpreting ChIP enrichment profiles for individual factors in isolation, as they will include factor-specific as well as collective contributions.

Graph-based data integration predicts long-range regulatory interactions across the human genome

Graph-based data integration predicts long-range regulatory interactions across the human genome

Sofie Demeyer, Tom Michoel
(Submitted on 29 Apr 2014)

Transcriptional regulation of gene expression is one of the main processes that affect cell diversification from a single set of genes. Regulatory proteins often interact with DNA regions located distally from the transcription start sites (TSS) of the genes. We developed a computational method that combines open chromatin and gene expression information for a large number of cell types to identify these distal regulatory elements. Our method builds correlation graphs for publicly available DNase-seq and exon array datasets with matching samples and uses graph-based methods to filter findings supported by multiple datasets and remove indirect interactions. The resulting set of interactions was validated with both anecdotal information of known long-range interactions and unbiased experimental data deduced from Hi-C and CAGE experiments. Our results provide a novel set of high-confidence candidate open chromatin regions involved in gene regulation, often located several Mb away from the TSS of their target gene.

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Greg W. Clark, Sharon H. Ackerman, Elisabeth R. Tillier, Domenico L. Gatti
(Submitted on 26 Apr 2014)

Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. If the structure is already known, information on the covarying positions can be valuable to understand the protein mechanism.
In this study we have sought to determine whether a multivariate extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 {\AA} in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles.
We have also attempted to identify whether the difference in performance among methods is due to different efficiency in removing covariation originating from chains of structural contacts. We found that the reason why methods that derive partial correlation between the columns of a MSA provide a better recognition of close contacts is not because they remove chaining effects, but because they filter out the correlation between distant residues that originates from general fitness constraints. In contrast we found that true chaining effects are expression of real physical perturbations that propagate inside proteins, and therefore are not removed by the derivation of partial correlation between variables.

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Bridging scales in cancer progression: Mapping genotype to phenotype using neural networks

Philip Gerlee, Eunjung Kim, Alexander R.A. Anderson
(Submitted on 28 Apr 2014)

In this review we summarize our recent efforts in trying to understand the role of heterogeneity in cancer progression by using neural networks to characterise different aspects of the mapping from a cancer cells genotype and environment to its phenotype. Our central premise is that cancer is an evolving system subject to mutation and selection, and the primary conduit for these processes to occur is the cancer cell whose behaviour is regulated on multiple biological scales. The selection pressure is mainly driven by the microenvironment that the tumour is growing in and this acts directly upon the cell phenotype. In turn, the phenotype is driven by the intracellular pathways that are regulated by the genotype. Integrating all of these processes is a massive undertaking and requires bridging many biological scales (i.e. genotype, pathway, phenotype and environment) that we will only scratch the surface of in this review. We will focus on models that use neural networks as a means of connecting these different biological scales, since they allow us to easily create heterogeneity for selection to act upon and importantly this heterogeneity can be implemented at different biological scales. More specifically, we consider three different neural networks that bridge different aspects of these scales and the dialogue with the micro-environment, (i) the impact of the micro-environment on evolutionary dynamics, (ii) the mapping from genotype to phenotype under drug-induced perturbations and (iii) pathway activity in both normal and cancer cells under different micro-environmental conditions.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Diane Saunders, Kentaro Yoshida, Christine Sambles, Rachel Glover, Bernardo Clavijo, Manuel Corpas, Daniel Bunting, Suomeng Dong, Matthew Clark, David Swarbreck, Sarah Ayling, Matthew Bashton, Steve Collin, Tsuyoshi Hosoya, Anne Edwards, Lisa Crossman, Graham Etherington, Joe Win, Liliana Cano, David Studholme, J Allan Downie, Mario Caccamo, Sophien Kamoun, Dan MacLean

Ash dieback is a fungal disease of ash trees caused by Hymenoscyphus pseudoalbidus that has swept across Europe in the last two decades and is a significant threat to the ash population. This emergent pathogen has been relatively poorly studied and little is known about its genetic make-up. In response to the arrival of this dangerous pathogen in the UK we took the unusual step of providing an open access database and initial sequence datasets to the scientific community for analysis prior to performing an analysis of our own. Our goal was to crowdsource genomic and other analyses and create a community analysing this pathogen. In this report on the evolution of the community and data and analysis obtained in the first year of this activity, we describe the nature and the volume of the contributions and reveal some preliminary insights into the genome and biology of H. pseudoalbidus that emerged. In particular our nascent community generated a first-pass genome assembly containing abundant collapsed AT-rich repeats indicating a typically complex genome structure. Our open science and crowdsourcing effort has brought a wealth of new knowledge about this emergent pathogen within a short time-frame. Our community endeavour highlights the positive impact that open, collaborative approaches can have on fast, responsive modern science.

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Amir Shahmoradi, Dariya K. Sydykova, Stephanie J. Spielman, Eleisha L. Jackson, Eric T. Dawson, Austin G. Meyer, Claus O. Wilke

Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The structural properties we considered include buriedness (relative solvent accessibility and contact number), structural flexibility (B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on 9 non-homologous viral protein structures and from variation in homologous variants of those proteins, where available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1 to 0.4). Moreover, we found that measures of buriedness were better predictors of evolutionary variation than were measures of structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than was buriedness, but was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness are better predictors of evolutionary variation than are more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.