Comment on “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions” by Kim et al.
Alexander Dobin, Thomas R Gingeras
In the recent paper by Kim et al. (Genome biology, 2013. 14(4): p. R36) the accuracy of TopHat2 was compared to other RNA-seq aligners. In this comment we re-examine most important analyses from this paper and identify several deficiencies that significantly diminished performance of some of the aligners, including incorrect choice of mapping parameters, unfair comparison metrics, and unrealistic simulated data. Using STAR (Dobin et al., Bioinformatics, 2013. 29(1): p. 15-21) as an exemplar, we demonstrate that correcting these deficiencies makes its accuracy equal or better than that of TopHat2. Furthermore, this exercise highlighted some serious issues with the TopHat2 algorithms, such as poor recall of alignments with a moderate (>3) number of mismatches, low sensitivity and high false discovery rate for splice junction detection, loss of precision for the realignment algorithm, and large number of false chimeric alignments.
Joint analysis of functional genomic data and genome-wide association studies of 18 human traits
Annotations of gene structures and regulatory elements can inform genome-wide association studies (GWAS). However, choosing the relevant annotations for interpreting an association study of a given trait remains challenging. We describe a statistical model that uses association statistics computed across the genome to identify classes of genomic element that are enriched or depleted for loci that influence a trait. The model naturally incorporates multiple types of annotations. We applied the model to GWAS of 18 human traits, including red blood cell traits, platelet traits, glucose levels, lipid levels, height, BMI, and Crohn’s disease. For each trait, we evaluated the relevance of 450 different genomic annotations, including protein-coding genes, enhancers, and DNase-I hypersensitive sites in over a hundred tissues and cell lines. We show that the fraction of phenotype-associated SNPs that influence protein sequence ranges from around 2% (for platelet volume) up to around 20% (for LDL cholesterol); that repressed chromatin is significantly depleted for SNPs associated with several traits; and that cell type-specific DNase-I hypersensitive sites are enriched for SNPs associated with several traits (for example, fibroblasts in Crohn’s disease and muscle tissue in bone density). Finally, by re-weighting each GWAS using information from functional genomics, we increase the number of loci with high-confidence associations by around 5%.
Drosophila embryogenesis scales uniformly across temperature and developmentally diverse species
Steven Gregory Kuntz, Michael B Eisen
Temperature affects both the timing and outcome of animal development, but the detailed effects of temperature on the progress of early development have been poorly characterized. To determine the impact of temperature on the order and timing of events during Drosophila melanogaster embryogenesis, we used time-lapse imaging to track the progress of embryos from shortly after egg laying through hatching at seven precisely maintained temperatures between 17.5°C and 32.5°C. We employed a combination of automated and manual annotation to determine when 36 milestones occurred in each embryo. D. melanogaster embryogenesis takes 33 hours at 17.5°C, and accelerates with increasing temperature to a low of 16 hours at 27.5°C, above which embryogenesis slows slightly. Remarkably, while the total time of embryogenesis varies over two fold, the relative timing of events from cellularization through hatching is constant across temperatures. To further explore the relationship between temperature and embryogenesis, we expanded our analysis to cover ten additional Drosophila species of varying climatic origins. Six of these species, like D. melanogaster, are of tropical origin, and embryogenesis time at different temperatures was similar for them all. D. mojavensis, a sub-tropical fly, develops slower than the tropical species at lower temperatures, while D. virilis, a temperate fly, exhibits slower development at all temperatures. The alpine sister species D. persimilis and D. pseudoobscura develop as rapidly as tropical flies at cooler temperatures, but exhibit diminished acceleration above 22.5°C and have drastically slowed development by 30°C. Despite ranging from 13 hours for D. erecta at 30°C to 46 hours for D. virilis at 17.5°C, the relative timing of events from cellularization through hatching is constant across all of the species and temperatures examined here, suggesting the existence of a previously unrecognized timer controlling the progress of embryogenesis that has been tuned by natural selection in response to the thermal environment in which each species lives.
Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs
Nick Schurch, Christian Cole, Alexander Sherstnev, Junfang Song, Céline Duc, Kate G. Storey, W. H. Irwin McLean, Sara J. Brown, Gordon G. Simpson, Geoffrey J. Barton
(Submitted on 11 Nov 2013)
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data
The Functional Consequences of Variation in Transcription Factor Binding
Darren A. Cusanovich, Bryan Pavlovic, Jonathan K. Pritchard, Yoav Gilad
(Submitted on 18 Oct 2013)
One goal of human genetics is to understand how the information for precise and dynamic gene expression programs is encoded in the genome. The interactions of transcription factors (TFs) with DNA regulatory elements clearly play an important role in determining gene expression outputs, yet the regulatory logic underlying functional transcription factor binding is poorly understood. Many studies have focused on characterizing the genomic locations of TF binding, yet it is unclear to what extent TF binding at any specific locus has functional consequences with respect to gene expression output. To evaluate the context of functional TF binding we knocked down 59 TFs and chromatin modifiers in one HapMap lymphoblastoid cell line. We then identified genes whose expression was affected by the knockdowns. We intersected the gene expression data with transcription factor binding data (based on ChIP-seq and DNase-seq) within 10 kb of the transcription start sites of expressed genes. This combination of data allowed us to infer functional TF binding. On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. We found that functional TF binding is enriched in regulatory elements that harbor a large number of TF binding sites, at sites with predicted higher binding affinity, and at sites that are enriched in genomic regions annotated as active enhancers.
A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects
Chuan Gao, Christopher D Brown, Barbara E Engelhardt
(Submitted on 17 Oct 2013)
One important problem in genome science is to determine sets of co-regulated genes based on measurements of gene expression levels across samples, where the quantification of expression levels includes substantial technical and biological noise. To address this problem, we developed a Bayesian sparse latent factor model that uses a three parameter beta prior to flexibly model shrinkage in the loading matrix. By applying three layers of shrinkage to the loading matrix (global, factor-specific, and element-wise), this model has non-parametric properties in that it estimates the appropriate number of factors from the data. We added a two-component mixture to model each factor loading as being generated from either a sparse or a dense mixture component; this allows dense factors that capture confounding noise, and sparse factors that capture local gene interactions. We developed two statistics to quantify the stability of the recovered matrices for both sparse and dense matrices. We tested our model on simulated data and found that we successfully recovered the true latent structure as compared to related models. We applied our model to a large gene expression study and found that we recovered known covariates and small groups of co-regulated genes. We validated these gene subsets by testing for associations between genotype data and these latent factors, and we found a substantial number of biologically important genetic regulators for the recovered gene subsets.
Integrating diverse datasets improves developmental enhancer prediction
Genevieve D. Erwin, Rebecca M. Truty, Dennis Kostka, Katherine S. Pollard, John A. Capra
(Submitted on 27 Sep 2013)
Gene-regulatory enhancers have been identified by many lines of evidence, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a novel method for predicting developmental enhancers and their tissue specificity. EnhancerFinder uses a two-step multiple-kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and thousands of diverse functional genomics datasets from a variety of cell types and developmental stages. We trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser, in contrast to histone mark or sequence-based enhancer definitions commonly used. We comprehensively evaluated EnhancerFinder, and found that our integrative approach improves enhancer prediction accuracy over previous approaches that consider a single type of data. Our evaluation highlights the importance of considering information from many tissues when predicting specific types of enhancers. We find that VISTA enhancers active in embryonic heart are easier to predict than enhancers active in several other tissues due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and hits from genome-wide association studies. We demonstrate the utility of our enhancer predictions by identifying and validating a novel cranial nerve enhancer in the ZEB2 locus. Our genome-wide developmental enhancer predictions will be freely available as a UCSC Genome Browser track.
The epigenome of evolving Drosophila neo-sex chromosomes: dosage compensation and heterochromatin formation
Qi Zhou, Christopher E. Ellison, Vera B. Kaiser, Artyom A. Alekseyenko, Andrey A. Gorchakov, Doris Bachtrog
(Submitted on 26 Sep 2013)
Drosophila Y chromosomes are composed entirely of silent heterochromatin, while male X chromosomes have highly accessible chromatin and are hypertranscribed due to dosage compensation. Here, we dissect the molecular mechanisms and functional pressures driving heterochromatin formation and dosage compensation of the recently formed neo-sex chromosomes of Drosophila miranda. We show that the onset of heterochromatin formation on the neo-Y is triggered by an accumulation of repetitive DNA. The neo-X has evolved partial dosage compensation and we find that diverse mutational paths have been utilized to establish several dozen novel binding consensus motifs for the dosage compensation complex on the neo-X, including simple point mutations at pre-binding sites, insertion and deletion mutations, microsatellite expansions, or tandem amplification of weak binding sites. Spreading of these silencing or activating chromatin modifications to adjacent regions results in massive mis-expression of neo-sex linked genes, and little correspondence between functionality of genes and their silencing on the neo-Y or dosage compensation on the neo-X. Intriguingly, the genomic regions being targeted by the dosage compensation complex on the neo-X and those becoming heterochromatic on the neo-Y show little overlap, possibly reflecting different propensities along the ancestral chromosome to adopt active or repressive chromatin configurations. Our findings have broad implications for current models of sex chromosome evolution, and demonstrate how mechanistic constraints can limit evolutionary adaptations. Our study also highlights how evolution can follow predictable genetic trajectories, by repeatedly acquiring the same 21-bp consensus motif for recruitment of the dosage compensation complex, yet utilizing a diverse array of random mutational changes to attain the same phenotypic outcome.
A computational model for histone mark propagation reproduces the distribution of heterochromatin in different human cell types
Veit Schwämmle, Ole Nørregaard Jensen
(Submitted on 27 Sep 2013)
Chromatin is a highly compact and dynamic nuclear structure that consists of DNA and associated proteins. The main organizational unit is the nucleosome, which consists of a histone octamer with DNA wrapped around it. Histone proteins are implicated in the regulation of eukaryote genes and they carry numerous reversible post-translational modifications that control DNA-protein interactions and the recruitment of chromatin binding proteins. Heterochromatin, the transcriptionally inactive part of the genome, is densely packed and contains histone H3 that is methylated at Lys 9 (H3K9me). The propagation of H3K9me in nucleosomes along the DNA in chromatin is antagonizing by methylation of H3 Lysine 4 (H3K4me) and acetylations of several lysines, which is related to euchromatin and active genes. We show that the related histone modifications form antagonized domains on a coarse scale. These histone marks are assumed to be initiated within distinct nucleation sites in the DNA and to propagate bi-directionally. We propose a simple computer model that simulates the distribution of heterochromatin in human chromosomes. The simulations are in agreement with previously reported experimental observations from two different human cell lines. We reproduced different types of barriers between heterochromatin and euchromatin providing a unified model for their function. The effect of changes in the nucleation site distribution and of propagation rates were studied. The former occurs mainly with the aim of (de-)activation of single genes or gene groups and the latter has the power of controlling the transcriptional programs of entire chromosomes. Generally, the regulatory program of gene transcription is controlled by the distribution of nucleation sites along the DNA string.
Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes
Timothy B. Sackton, John H. Werren, Andrew G. Clark
(Submitted on 23 Sep 2013)
The innate immune system in insects consists of a conserved core signaling network and rapidly diversifying effector and recognition components, often containing a high proportion of taxonomically-restricted genes. In the absence of functional annotation, genes encoding immune system proteins can thus be difficult to identify, as homology-based approaches generally cannot detect lineage-specific genes. Here, we use RNA-seq to compare the uninfected and infection-induced transcriptome in the parasitoid wasp Nasonia vitripennis to identify genes regulated by infection. We identify 183 genes significantly up-regulated by infection and 61 genes significantly down-regulated by infection. We also produce a new homology-based immune catalog in N. vitripennis, and show that most infection-induced genes are not assigned an immune function from homology alone, suggesting the potential for substantial novel immune components in less-well-studied systems. Finally, we show that a high proportion of these novel induced genes are taxonomically-restricted, highlighting the rapid evolution of immune gene content. The combination of functional annotation using RNA-seq and homology-based annotation provides a robust method to characterize the innate immune response across a wide variety of insects, and reveals significant novel features of the Nasonia immune response.