Estimating transcription factor abundance and specificity from genome-wide binding profiles

Estimating transcription factor abundance and specificity from genome-wide binding profiles

Nicolae Radu Zabet, Boris Adryan
Comments: 39 pages, 25 figures, 10 tables
Subjects: Quantitative Methods (q-bio.QM)

The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in case of eukaryotes), the number of TF molecules expected to be bound to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display a lower number of DNA-bound molecules than previously assumed (in the range of tens and hundreds) and that, while Bicoid and Caudal display a higher specificity, the other three transcription factors (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound to the DNA.

Are we able to detect mass extinction events using phylogenies ?

Are we able to detect mass extinction events using phylogenies ?
Sacha S.J. Laurent, Marc Robinson-Rechavi, Nicolas Salamin
Comments: 14 pages, 8 figures
Subjects: Populations and Evolution (q-bio.PE)

The estimation of the rates of speciation and extinction provides important information on the macro-evolutionary processes shaping biodiversity through time (Ricklefs 2007). Since the seminal paper by Nee et al. (1994), much work have been done to extend the applicability of the birth-death process, which now allows us to test a wide range of hypotheses on the dynamics of the diversification process. Several approaches have been developed to identify the changes in rates of diversification occurring along a phylogenetic tree. Among them, we can distinguish between lineage-dependent, trait-dependent, time-dependent and density-dependent changes. Lineage specific methods identify changes in speciation and extinction rates — {\lambda} and {\mu}, respectively — at inner nodes of a phylogenetic tree (Rabosky et al. 2007; Alfaro et al. 2009; Silvestro et al. 2011). We can also identify trait-dependence in macro-evolutionary rates if the states of the particular trait of interest are known for the species under study (Maddison et al. 2007; FitzJohn et al. 2009; Mayrose et al. 2011). It is also possible to look for concerted changes in rates on independent branches of the phylogenetic tree by dividing the tree into time slices (Stadler 2011a). Finally, density-dependent effects can be detected when changes of diversification are correlated with overall species number (Etienne et al. 2012). Most methods can correct for incomplete taxon sampling, by assigning species numbers at tips of the phylogeny (Alfaro et al. 2009; Stadler and Bokma 2013), or by introducing a sampling parameter (Nee et al. 1994). By taking into account this sampling parameter at time points in the past, one can also look for events of mass extinction (Stadler 2011a).

Mapping to a Reference Genome Structure

Mapping to a Reference Genome Structure
Benedict Paten, Adam Novak, David Haussler
Comments: 25 pages
Subjects: Genomics (q-bio.GN)

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene
Thomas Turner, Christopher C Giauque, Daniel R Schrider, Andrew D Kern

Thirty four years ago, it was postulated that natural populations of Drosophila melanogaster are comprised of two behavioral morphs termed “rover” and “sitter”, and that this variation is caused mainly by large-effect alleles at a single locus. Since that time, considerable data has been amassed that compares the behavior and physiology of these morphs. Contrary to common assertions, however, published support for the existence of common large effect alleles in nature is quite limited. To further investigate, we quantified the foraging behavior of 36 natural strains, performed a genome-wide association study, and described patterns of molecular evolution at the foraging locus. Though there was significant variation in foraging behavior among genotypes, this variation was continuously distributed and not significantly associated with genetic variation at the foraging gene. Patterns of molecular population genetic variation at this gene also provide no support for the hypothesis that for is a target of long term balancing selection We propose that additional data is required to support a hypothesis of common alleles of large effect on foraging behavior in nature. Genome-wide association does support a role for natural variation at several other loci, including the sulfateless gene, though these associations should be considered preliminary until validated with a larger sample size.

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species
Rori Rohlfs, Rasmus Nielsen

Thanks to the reduced cost of RNA-Sequencing and other advanced methods for quantifying expression levels, accurate and expansive comparative expression data sets including data from multiple individuals per species are emerging. Comparative genomics has been greatly facilitated by the availability of statistical methods considering both between and within species variation for testing hypotheses regarding the evolution of DNA sequences. Similar methods are now needed to fully leverage comparative expression data. In this paper, we describe the β model which parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for expression divergence or diversity for a single gene or a class of genes. The β model can also be used to test for lineage-specific shifts in expression level, amongst other applications. We use simulations to explore the functionality of these tests under a variety of circumstances. We then apply them to a mammalian phylogeny of 15 species typed in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes possibly related to dietary changes in humans. We compare these results to those reported previously using the species mean model which ignores population expression variance, uncovering important differences in performance.

Regulatory variants explain much more heritability than coding variants across 11 common diseases

Regulatory variants explain much more heritability than coding variants across 11 common diseases
Alexander Gusev, S Hong Lee, Benjamin M Neale, Gosia Trynka, Bjarni J Vilhjalmsson, Hilary Finucane, Han Xu, Chongzhi Zang, Stephan Ripke, Eli Stahl, n/a Schizophrenia Working Group of the PGC, n/a SWE-SCZ Consortium, Anna K Kahler, Christina M Hultman, Shaun M Purcell, Steven A McCarroll, Mark Daly, Bogdan Pasaniuc, Patrick F Sullivan, Naomi R Wray, Soumya Raychaudhuri, Alkes L Price

Common variants implicated by genome-wide association studies (GWAS) of complex diseases are known to be enriched for coding and regulatory variants. We applied methods to partition the heritability explained by genotyped SNPs (h2g) across functional categories (while accounting for shared variance due to linkage disequilibrium) to genotype and imputed data for 11 common diseases. DNaseI Hypersensitivity Sites (DHS) from 218 cell-types, spanning 16% of the genome, explained an average of 79% of h2g (5.1× enrichment; P < 10−20); further enrichment was observed at enhancer and cell-type specific DHS elements. The enrichments were much smaller in analyses that did not use imputed data or were restricted to GWAS- associated SNPs. In contrast, coding variants, spanning 1% of the genome, explained only 8% of h2g (13.8× enrichment; P = 5 × 10−4). We replicated these findings but found no significant contribution from rare coding variants in an independent schizophrenia cohort genotyped on GWAS and exome chips.

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

Oliver S Burren, Hui Guo, Chris Wallace
(Submitted on 17 Apr 2014)

Motivation: Genome-wide association studies (GWAS) have identified many loci implicated in disease susceptibility. Integration of GWAS summary statistics (p values) and functional genomic datasets should help to elucidate mechanisms. Results: We describe the extension of a previously described non-parametric method to test whether GWAS signals are enriched in functionally defined loci to a situation where only GWAS p values are available. The approach is implemented in VSEAMS, a freely available software pipeline. We use VSEAMS to integrate functional gene sets defined via transcription factor knock down experiments with GWAS results for type 1 diabetes and find variant set enrichment in gene sets associated with IKZF3, BATF and ESRRA. IKZF3 lies in a known T1D susceptibility region, whilst BATF and ESRRA overlap other immune disease susceptibility regions, validating our approach and suggesting novel avenues of research for type 1 diabetes. Availability and implementation: VSEAMS is available for download this http URL