Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments

Greg W. Clark, Sharon H. Ackerman, Elisabeth R. Tillier, Domenico L. Gatti
(Submitted on 26 Apr 2014)

Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. If the structure is already known, information on the covarying positions can be valuable to understand the protein mechanism.
In this study we have sought to determine whether a multivariate extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 {\AA} in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles.
We have also attempted to identify whether the difference in performance among methods is due to different efficiency in removing covariation originating from chains of structural contacts. We found that the reason why methods that derive partial correlation between the columns of a MSA provide a better recognition of close contacts is not because they remove chaining effects, but because they filter out the correlation between distant residues that originates from general fitness constraints. In contrast we found that true chaining effects are expression of real physical perturbations that propagate inside proteins, and therefore are not removed by the derivation of partial correlation between variables.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Crowdsourced analysis of ash and ash dieback through the Open Ash Dieback project: A year 1 report on datasets and analyses contributed by a self-organising community.

Diane Saunders, Kentaro Yoshida, Christine Sambles, Rachel Glover, Bernardo Clavijo, Manuel Corpas, Daniel Bunting, Suomeng Dong, Matthew Clark, David Swarbreck, Sarah Ayling, Matthew Bashton, Steve Collin, Tsuyoshi Hosoya, Anne Edwards, Lisa Crossman, Graham Etherington, Joe Win, Liliana Cano, David Studholme, J Allan Downie, Mario Caccamo, Sophien Kamoun, Dan MacLean

Ash dieback is a fungal disease of ash trees caused by Hymenoscyphus pseudoalbidus that has swept across Europe in the last two decades and is a significant threat to the ash population. This emergent pathogen has been relatively poorly studied and little is known about its genetic make-up. In response to the arrival of this dangerous pathogen in the UK we took the unusual step of providing an open access database and initial sequence datasets to the scientific community for analysis prior to performing an analysis of our own. Our goal was to crowdsource genomic and other analyses and create a community analysing this pathogen. In this report on the evolution of the community and data and analysis obtained in the first year of this activity, we describe the nature and the volume of the contributions and reveal some preliminary insights into the genome and biology of H. pseudoalbidus that emerged. In particular our nascent community generated a first-pass genome assembly containing abundant collapsed AT-rich repeats indicating a typically complex genome structure. Our open science and crowdsourcing effort has brought a wealth of new knowledge about this emergent pathogen within a short time-frame. Our community endeavour highlights the positive impact that open, collaborative approaches can have on fast, responsive modern science.

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Predicting evolutionary site variability from structure in viral proteins: buriedness, flexibility, and design

Amir Shahmoradi, Dariya K. Sydykova, Stephanie J. Spielman, Eleisha L. Jackson, Eric T. Dawson, Austin G. Meyer, Claus O. Wilke

Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The structural properties we considered include buriedness (relative solvent accessibility and contact number), structural flexibility (B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on 9 non-homologous viral protein structures and from variation in homologous variants of those proteins, where available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1 to 0.4). Moreover, we found that measures of buriedness were better predictors of evolutionary variation than were measures of structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than was buriedness, but was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness are better predictors of evolutionary variation than are more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.

Mapping to a Reference Genome Structure

Mapping to a Reference Genome Structure
Benedict Paten, Adam Novak, David Haussler
Comments: 25 pages
Subjects: Genomics (q-bio.GN)

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

Oliver S Burren, Hui Guo, Chris Wallace
(Submitted on 17 Apr 2014)

Motivation: Genome-wide association studies (GWAS) have identified many loci implicated in disease susceptibility. Integration of GWAS summary statistics (p values) and functional genomic datasets should help to elucidate mechanisms. Results: We describe the extension of a previously described non-parametric method to test whether GWAS signals are enriched in functionally defined loci to a situation where only GWAS p values are available. The approach is implemented in VSEAMS, a freely available software pipeline. We use VSEAMS to integrate functional gene sets defined via transcription factor knock down experiments with GWAS results for type 1 diabetes and find variant set enrichment in gene sets associated with IKZF3, BATF and ESRRA. IKZF3 lies in a known T1D susceptibility region, whilst BATF and ESRRA overlap other immune disease susceptibility regions, validating our approach and suggesting novel avenues of research for type 1 diabetes. Availability and implementation: VSEAMS is available for download this http URL

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data
Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Bayesian Neural Networks for Genetic Association Studies of Complex Disease

Bayesian Neural Networks for Genetic Association Studies of Complex Disease

Andrew L. Beam, Alison Motsinger-Reif, Jon Doyle
(Submitted on 15 Apr 2014)

Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. A non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. Using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships. The proposed framework is shown to be powerful at detecting causal SNPs while having the computational efficiency needed handle large datasets.

Modeling DNA methylation dynamics with approaches from phylogenetics

Modeling DNA methylation dynamics with approaches from phylogenetics

John A. Capra, Dennis Kostka
(Submitted on 11 Apr 2014)

Methylation of CpG dinucleotides is a prevalent epigenetic modification that is required for proper development in vertebrates, and changes in CpG methylation are essential to cellular differentiation. Genome-wide DNA methylation assays have become increasingly common, and recently distinct stages across differentiating cellular lineages have been assayed. How- ever, current methods for modeling methylation dynamics do not account for the dependency structure between precursor and dependent cell types. We developed a continuous-time Markov chain approach, based on the observation that changes in methylation state over tissue differentiation can be modeled similarly to DNA nucleotide changes over evolutionary time. This model explicitly takes precursor to descendant relationships into account and enables inference of CpG methylation dynamics. To illustrate our method, we analyzed a high-resolution methylation map of the differentiation of mouse stem cells into several blood cell types. Our model can successfully infer unobserved CpG methylation states from observations at the same sites in related cell types (90% correct), and this approach more accurately reconstructs missing data than imputation based on neighboring CpGs (84% correct). Additionally, the single CpG resolution of our methylation dynamics estimates enabled us to show that DNA sequence context of CpG sites is informative about methylation dynamics across tissue differentiation. Finally, we identified genomic regions with clusters of highly dynamic CpGs and present a likely functional example. Our work establishes a framework for inference and modeling that is well-suited to DNA methylation data, and our success suggests that other methods for analyzing DNA nucleotide substitutions will also translate to the modeling of epigenetic phenomena.

Comparing Evolutionary Rates Using An Exact Test for 2×2 Tables with Continuous Cell Entries

Comparing Evolutionary Rates Using An Exact Test for 2×2 Tables with Continuous Cell Entries

A. Morgan Thompson, M. Cyrus Maher, Lawrence H. Uricchio, Zachary A. Szpiech, Ryan D. Hernandez
(Submitted on 11 Apr 2014)

Assessing the statistical significance of an observed 2×2 contingency table can easily be accomplished using Fisher’s exact test (FET). However, if the cell entries are continuous or represent values inferred from a continuous parametric model, then FET cannot be applied. Such tables arise frequently in areas of biostatistical research including population genetics and evolutionary genomics, where cell entries are estimated by computational methods and result in cell entries drawn from the non-negative real line R+. Simply rounding cell entries to conform to the assumptions of FET is an ill-suited approach that we show creates problems related to both type-I and type-II errors. Pearson’s chi^2 test for independence, while technically applicable, is not often effective for these tables, as the test has several limiting assumptions that make application of this method inadvisable in many common instances (particularly with small cell entries). Here we develop a novel method for tables with continuous entries, which we term continuous Fisher’s Exact Test (cFET). Through simulations, we show that cFET has a close-to-uniform distribution of p-values under the null hypothesis of independence, and more power when applied to tables where the null hypothesis is false (compared to FET applied to rounded cell entries). We apply cFET to an example from comparative genomics to confirm an overall increased evolutionary rate among primates compared to rodents, and identify several genes that show particularly elevated evolutionary rates in primates. Some of these genes exhibit signatures of continued positive selection along the human lineage since our divergence with chimpanzee 5-7 million years ago, as well as ongoing selection in modern humans.

The relationships among GC content, nucleosome occupancy, and exon size

The relationships among GC content, nucleosome occupancy, and exon size

Liya Wang, Lincoln Stein, Doreen Ware
(Submitted on 9 Apr 2014)

The average size of internal translated exons, ranging from 120 to 165 nt across metazoans, is approximately the size of the typical mononucleosome (147 nt). Genome-wide study has also shown that nucleosome occupancy is significantly higher in exons than in introns, which might indicate that the evolution of exon size is related to its nucleosome occupancy. By grouping exons by the GC contents of their flanking introns, we show that the average exon size is positively correlated with its GC content. Using the sequencing data from direct mapping of Homo sapiens nucleosomes with limited nuclease digestion, we show that the level of nucleosome occupancy is also positively correlated with the exon GC content in a similar fashion. We then demonstrated that exon size is positively correlated with their nucleosome occupancy. The strong correlation between exon size and the nucleosome occupancy suggests that chromatin organization may be related to the evolution of exon sizes.