Estimating transcription factor abundance and specificity from genome-wide binding profiles

Estimating transcription factor abundance and specificity from genome-wide binding profiles

Nicolae Radu Zabet, Boris Adryan
Comments: 39 pages, 25 figures, 10 tables
Subjects: Quantitative Methods (q-bio.QM)

The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in case of eukaryotes), the number of TF molecules expected to be bound to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display a lower number of DNA-bound molecules than previously assumed (in the range of tens and hundreds) and that, while Bicoid and Caudal display a higher specificity, the other three transcription factors (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound to the DNA.

Are we able to detect mass extinction events using phylogenies ?

Are we able to detect mass extinction events using phylogenies ?
Sacha S.J. Laurent, Marc Robinson-Rechavi, Nicolas Salamin
Comments: 14 pages, 8 figures
Subjects: Populations and Evolution (q-bio.PE)

The estimation of the rates of speciation and extinction provides important information on the macro-evolutionary processes shaping biodiversity through time (Ricklefs 2007). Since the seminal paper by Nee et al. (1994), much work have been done to extend the applicability of the birth-death process, which now allows us to test a wide range of hypotheses on the dynamics of the diversification process. Several approaches have been developed to identify the changes in rates of diversification occurring along a phylogenetic tree. Among them, we can distinguish between lineage-dependent, trait-dependent, time-dependent and density-dependent changes. Lineage specific methods identify changes in speciation and extinction rates — {\lambda} and {\mu}, respectively — at inner nodes of a phylogenetic tree (Rabosky et al. 2007; Alfaro et al. 2009; Silvestro et al. 2011). We can also identify trait-dependence in macro-evolutionary rates if the states of the particular trait of interest are known for the species under study (Maddison et al. 2007; FitzJohn et al. 2009; Mayrose et al. 2011). It is also possible to look for concerted changes in rates on independent branches of the phylogenetic tree by dividing the tree into time slices (Stadler 2011a). Finally, density-dependent effects can be detected when changes of diversification are correlated with overall species number (Etienne et al. 2012). Most methods can correct for incomplete taxon sampling, by assigning species numbers at tips of the phylogeny (Alfaro et al. 2009; Stadler and Bokma 2013), or by introducing a sampling parameter (Nee et al. 1994). By taking into account this sampling parameter at time points in the past, one can also look for events of mass extinction (Stadler 2011a).

Mapping to a Reference Genome Structure

Mapping to a Reference Genome Structure
Benedict Paten, Adam Novak, David Haussler
Comments: 25 pages
Subjects: Genomics (q-bio.GN)

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene
Thomas Turner, Christopher C Giauque, Daniel R Schrider, Andrew D Kern

Thirty four years ago, it was postulated that natural populations of Drosophila melanogaster are comprised of two behavioral morphs termed “rover” and “sitter”, and that this variation is caused mainly by large-effect alleles at a single locus. Since that time, considerable data has been amassed that compares the behavior and physiology of these morphs. Contrary to common assertions, however, published support for the existence of common large effect alleles in nature is quite limited. To further investigate, we quantified the foraging behavior of 36 natural strains, performed a genome-wide association study, and described patterns of molecular evolution at the foraging locus. Though there was significant variation in foraging behavior among genotypes, this variation was continuously distributed and not significantly associated with genetic variation at the foraging gene. Patterns of molecular population genetic variation at this gene also provide no support for the hypothesis that for is a target of long term balancing selection We propose that additional data is required to support a hypothesis of common alleles of large effect on foraging behavior in nature. Genome-wide association does support a role for natural variation at several other loci, including the sulfateless gene, though these associations should be considered preliminary until validated with a larger sample size.

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species
Rori Rohlfs, Rasmus Nielsen

Thanks to the reduced cost of RNA-Sequencing and other advanced methods for quantifying expression levels, accurate and expansive comparative expression data sets including data from multiple individuals per species are emerging. Comparative genomics has been greatly facilitated by the availability of statistical methods considering both between and within species variation for testing hypotheses regarding the evolution of DNA sequences. Similar methods are now needed to fully leverage comparative expression data. In this paper, we describe the β model which parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for expression divergence or diversity for a single gene or a class of genes. The β model can also be used to test for lineage-specific shifts in expression level, amongst other applications. We use simulations to explore the functionality of these tests under a variety of circumstances. We then apply them to a mammalian phylogeny of 15 species typed in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes possibly related to dietary changes in humans. We compare these results to those reported previously using the species mean model which ignores population expression variance, uncovering important differences in performance.

Regulatory variants explain much more heritability than coding variants across 11 common diseases

Regulatory variants explain much more heritability than coding variants across 11 common diseases
Alexander Gusev, S Hong Lee, Benjamin M Neale, Gosia Trynka, Bjarni J Vilhjalmsson, Hilary Finucane, Han Xu, Chongzhi Zang, Stephan Ripke, Eli Stahl, n/a Schizophrenia Working Group of the PGC, n/a SWE-SCZ Consortium, Anna K Kahler, Christina M Hultman, Shaun M Purcell, Steven A McCarroll, Mark Daly, Bogdan Pasaniuc, Patrick F Sullivan, Naomi R Wray, Soumya Raychaudhuri, Alkes L Price

Common variants implicated by genome-wide association studies (GWAS) of complex diseases are known to be enriched for coding and regulatory variants. We applied methods to partition the heritability explained by genotyped SNPs (h2g) across functional categories (while accounting for shared variance due to linkage disequilibrium) to genotype and imputed data for 11 common diseases. DNaseI Hypersensitivity Sites (DHS) from 218 cell-types, spanning 16% of the genome, explained an average of 79% of h2g (5.1× enrichment; P < 10−20); further enrichment was observed at enhancer and cell-type specific DHS elements. The enrichments were much smaller in analyses that did not use imputed data or were restricted to GWAS- associated SNPs. In contrast, coding variants, spanning 1% of the genome, explained only 8% of h2g (13.8× enrichment; P = 5 × 10−4). We replicated these findings but found no significant contribution from rare coding variants in an independent schizophrenia cohort genotyped on GWAS and exome chips.

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data

READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data
Konrad Ulrich Förstner, Jörg Vogel, Cynthia Mira Sharma

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at (DOI:10.6084/m9.figshare.977849).

Model adequacy and the macroevolution of angiosperm functional traits

Model adequacy and the macroevolution of angiosperm functional traits
Matthew Pennell, Richard G FitzJohn, William K Cornwell, Luke J Harmon

All models are wrong and sometimes even the best of a set of models is useless. Modern phylogenetic comparative methods (PCMs) are almost exclusively model–based and therefore making robust inferences from PCMs requires using a model of trait evolution that is a good explanation for the data. To date, researchers using PCMs have evaluated the explanatory power of a model only in terms of relative, not absolute, fit. Here we develop a general statistical framework for assessing the absolute fit, or adequacy, of phylogenetic models for the evolution of quantitative traits. We use our approach to test whether commonly used models are adequate descriptors of the macroevolutionary dynamics of real comparative data. We fit models of trait evolution to 337 comparative datasets covering three key Angiosperm functional traits and evaluated the absolute fit of the models to each dataset. Overall, the models we used are very inadequate for the evolution of these traits; this was true for many different groups and at many different scales. Furthermore, the relative support for a model had very little to do with its absolute adequacy. We argue that assessing model adequacy should be a key step in comparative analyses.

Population genetics of identity by descent

Population genetics of identity by descent
Pier Francesco Palamara, Ph.D. thesis

Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.

Principal component gene set enrichment (PCGSE)

Principal component gene set enrichment (PCGSE)
H. Robert Frost, Zhigang Li, Jason H. Moore

Motivation: Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with few non-zero loadings. Although useful when just a few variables dominate the population PCs, these methods are often inadequate for characterizing the PCs of high-dimensional genomic data. For genomic data, reproducible and biologically meaningful PC interpretation requires methods based on the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results: We describe a novel approach, principal component gene set enrichment (PCGSE), for computing the statistical association between gene sets and the PCs of genomic data. The PCGSE method performs a two-stage competitive gene set test using the correlation between each gene and each PC as the gene-level test statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. Using simulated data with simulated gene sets and real gene expression data with curated gene sets, we demonstrate that biologically meaningful and computationally efficient results can be obtained from a simple parametric version of the PCGSE method that performs a correlation-adjusted two-sample t-test between the gene-level test statistics for gene set members and genes not in the set. Availability: this http URL Contact: or