Assessing allele specific expression across multiple tissues from RNA-seq read data
Matti Pirinen, Tuuli Lappalainen, Noah A Zaitlen, GTEx Consortium, Emmanouil T Dermitzakis, Peter Donnelly, Mark I McCarthy, Manuel A Rivas
Motivation: RNA sequencing enables allele specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression project (GTEx) is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data. Results: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally. Availability: MAMBA software: http://birch.well.ox.ac.uk/~rivas/mamba/ R source code and data examples: http://www.iki.fi/mpirinen/ Contact: firstname.lastname@example.org email@example.com
RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma
Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
THE GENETIC LANDSCAPE OF TRANSCRIPTIONAL NETWORKS IN A COMBINED HAPLOID/DIPLOID PLANT SYSTEM
Jukka-Pekka Verta, Christian R Landry, John J MacKay
Heritable variation in gene expression is a source of evolutionary change and our understanding of the genetic basis of expression variation remains incomplete. Here, we dissected the genetic basis of transcriptional variation in a wild, outbreeding gymnosperm (Picea glauca) according to linked and unlinked genetic variants, their allele-specific (cis) and allele non-specific (trans) effects, and their phenotypic additivity. We used a novel plant system that is based on the analysis of segregating alleles of a single self-fertilized plant in haploid and diploid seed tissues. We measured transcript abundance and identified transcribed SNPs in 66 seeds with RNA-seq. Linked and unlinked genetic effects that influenced expression levels were abundant in the haploid megagametophyte tissue, influencing 48% and 38% of analyzed genes, respectively. Analysis of these effects in diploid embryos revealed that while distant effects were acting in trans consistent with their hypothesized diffusible nature, local effects were associated with a complex mix of cis, trans and compensatory effects. Most cis effects were additive irrespective of their effect sizes, consistent with a hypothesis that they represent rate-limiting factors in transcript accumulation. We show that trans effects fulfilled a key prediction of Wright?s physiological theory, in which variants with small effects tend to be additive and those with large effects tend to be dominant/recessive. Our haploid/diploid approach allows a comprehensive genetic dissection of expression variation and can be applied to a large number of wild plant species.
Benchmark Analysis of Algorithms for Determining and Quantifying Full-length mRNA Splice Forms from RNA-Seq Data
Katharina Hayer, Angel Pizzaro, Nicholas L Lahens, John B Hogenesch, Gregory R Grant
The advantages of RNA sequencing (RNA-Seq) suggest it will replace microarrays for highly parallel gene expression analysis. For example, in contrast to arrays, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length transcripts. A number of methods have been developed for this purpose, but short error prone reads makes it a difficult problem in practice. It is essential to determine which algorithms perform best, and where and why they fail. However, there is a dearth of independent and unbiased benchmarking studies of these algorithms. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy. We conclude that most methods are inaccurate even using idealized data, and that no is method sufficiently accurate once complicating factors such as polymorphisms, intron signal, sequencing error, and multiple splice forms are present. These results point to the pressing need for further algorithm development.
Pervasive variation of transcription factor orthologs contributes to regulatory network evolution
Shilpa Nadimpalli, Anton V. Persikov, Mona Singh
Comments: 29 pages, 5 figures, 5 supplemental figures, 3 supplemental tables
Subjects: Genomics (q-bio.GN)
Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning ~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in ~47% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, found in ~66% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to ~24% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve slower than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a subset of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.
Genome-wide Identification of Zero Nucleotide Recursive Splicing in Drosophila
Michael O Duff, Sara Olson, Xintao Wei, Ahmad Osman, Alex Plocik, Mohan Bolisetty, Susan Celniker, Brenton Graveley
Recursive splicing is a process in which large introns are removed in multiple steps by resplicing at ratchet points – 5? splice sites recreated after splicing. Recursive splicing was first identified in the Drosophila Ultrabithorax (Ubx) gene and only three additional Drosophila genes have since been experimentally shown to undergo recursive splicing. Here, we identify 196 zero nucleotide exon ratchet points in 130 introns of 115 Drosophila genes from total RNA sequencing data generated from developmental time points, dissected tissues, and cultured cells. Recursive splicing events were identified by splice junctions that map to annotated 5? splice sites and unannotated intronic 3? splice sites, the presence of the sequence AG/GT at the 3? splice site, and a 5? to 3? gradient of decreasing RNA-Seq read density indicative of co-transcriptional splicing. The sequential nature of recursive splicing was confirmed by identification of lariat introns generated by splicing to and from the ratchet points. We also show that recursive splicing is a constitutive process, and that the sequence and function of ratchet points are evolutionarily conserved. Together these results indicate that recursive splicing is commonly used in Drosophila and provides insight into the mechanisms by which some introns are removed.
Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels.
Nicholas E Banovich, Xun Lan, Graham McVicker, Bryce Van de Geijn, Jacob F Degner, John D. Blischak, Jonathan K. Pritchard, Yoav Gilad
DNA methylation is an important epigenetic regulator of gene expression. Recent studies have revealed widespread associations between genetic variation and methylation levels. However, the mechanistic links between genetic variation and methylation remain unclear. To begin addressing this gap, we collected methylation data at ~300,000 loci in lymphoblastoid cell lines (LCLs) from 64 HapMap Yoruba individuals, and genome-wide bisulfite sequence data in ten of these individuals. We identified (at an FDR of 10%) 11,752 methylation QTLs (meQTLs)?i.e., loci in which genetic variation is associated with changes in DNA methylation. We found that meQTLs are frequently associated with changes in methylation at multiple CpGs across regions of up to 3 kb. Interestingly, meQTLs are also frequently associated with variation in other properties of gene regulation, including histone modifications, DNase I accessibility, chromatin accessibility, and expression levels of nearby genes. These observations suggest that genetic variants may lead to coordinated molecular changes in all of these regulatory phenotypes. One plausible driver of coordinated changes in different regulatory mechanisms is variation in transcription factor (TF) binding. Indeed, we found that SNPs that change predicted TF binding affinities are significantly enriched for associations with DNA methylation at nearby CpGs. Taken together, our observations are consistent with a model whereby changes in TF binding may frequently drive coordinated changes in DNA methylation, histone modification, and gene expression levels.
Validation of methods for Low-volume RNA-seq
Peter Acuña Combs, Michael B Eisen
Recently, a number of protocols extending RNA-sequencing to the single-cell regime have been published. However, we were concerned that the additional steps to deal with such minute quantities of input sample would introduce serious biases that would make analysis of the data using existing approaches invalid. In this study, we performed a critical evaluation of several of these low-volume RNA-seq protocols, and found that they performed slightly less well in metrics of interest to us than a more standard protocol, but with at least two orders of magnitude less sample required. We also explored a simple modification to one of these protocols that, for many samples, reduced the cost of library preparation to approximately $20/sample
High-resolution transcriptome analysis with long-read RNA sequencing
Hyunghoon Cho, Joe Davis, Xin Li, Kevin S. Smith, Alexis Battle, Stephen B. Montgomery
Comments: 29 pages, 8 figures, 11 supplementary figures
Subjects: Genomics (q-bio.GN)
RNA sequencing (RNA-seq) enables characterization and quantification of individual transcriptomes as well as detection of patterns of allelic expression and alternative splicing. Current RNA-seq protocols depend on high-throughput short-read sequencing of cDNA. However, as ongoing advances are rapidly yielding increasing read lengths, a technical hurdle remains in identifying the degree to which differences in read length influence various transcriptome analyses. In this study, we generated two paired-end RNA-seq datasets of differing read lengths (2×75 bp and 2×262 bp) for lymphoblastoid cell line GM12878 and compared the effect of read length on transcriptome analyses, including read-mapping performance, gene and transcript quantification, and detection of allele-specific expression (ASE) and allele-specific alternative splicing (ASAS) patterns. Our results indicate that, while the current long-read protocol is considerably more expensive than short-read sequencing, there are important benefits that can only be achieved with longer read length, including lower mapping bias and reduced ambiguity in assigning reads to genomic elements, such as mRNA transcript. We show that these benefits ultimately lead to improved detection of cis-acting regulatory and splicing variation effects within individuals.
Cis-regulatory elements and human evolution
Adam Siepel, Leonardo Arbiza
Modification of gene regulation has long been considered an important force in human evolution, particularly through changes to cis-regulatory elements (CREs) that function in transcriptional regulation. For decades, however, the study of cis-regulatory evolution was severely limited by the available data. New data sets describing the locations of CREs and genetic variation within and between species have now made it possible to study CRE evolution much more directly on a genome-wide scale. Here, we review recent research on the evolution of CREs in humans based on large-scale genomic data sets. We consider inferences based on primate divergence,human polymorphism, and combinations of divergence and polymorphism. We then consider “new frontiers” in this field stemming from recent research on transcriptional regulation.