Our paper: Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Our next guest post is by Mike Eisen [@mbeisen] on his paper with Peter Combs [@rflrob]
Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression. arXived here.

This is cross posted from Mike’s blog.

It’s no secret to people who read this blog that I hate the way scientific publishing works today. Most of my efforts in this domain have focused on removing barriers to the access and reuse of published papers. But there are other things that are broken with the way scientists communicate with each other, and chief amongst them is pre-publication peer review. I’ve written about this before, and won’t rehash the arguments here, save to say that I think we should publish first, and then review. But one could argue that I haven’t really practiced what I preach, as all of my lab’s papers have gone through peer review before they were published.

No more. From now on we are going to post all of our papers online when we feel they’re ready to share – before they go to a journal. We’ll then solicit comments from our colleagues and use them to improve the work prior to formal publication. Physicists and mathematicians have been doing this for decades, as have an increasing number of biologists. It’s time for this to become standard practice.

Some ground rules. I will not filter comments except to remove obvious spam. You are welcome to post comments under your name or under a pseudonym – I will not reveal anyone’s identity – but I urge you to use your real name as I think we should have fully open peer review in science.

OK. Now for the paper, which is posted on arxiv and can be linked to, cited there. We also have a copy here, in case you’re having trouble with figures on arXiv.

Peter A. Combs and Michael B. Eisen (2013). Sequencing mRNA from cryo-sliced Drosophila embryos to determine genome-wide spatial patterns of gene expression.

Several years ago a postdoc in my lab, Susan Lott (now at UC Davis) developed methods to sequence the RNA’s from single Drosophila embryos. She was interested in looking at expression differences between males and females in early embryogenesis, and published a beautiful paper on that topic.

Although we were initially worried that we wouldn’t be able to get enough RNA from single embryos to get reliable sequencing results, it turns out we got more than enough. Each embryo yielded around 100ng of total RNA, and we would end up loading only ~10% of the sample onto the sequencer. So it occurred to us that maybe we could work with material from pieces of individual embryos and thereby get spatial expression information on a genomic scale in a single quick experiment – an alternative to highly informative, but slow imaging-based methods.

I recruited a new biophysics student, Peter Combs, to work on slicing embryos with a microtome along the anterior-posterior axis and sequencing each of the sections to identify genes with patterned expression along the A-P axis. In typical PI fashion, I figured this would take a few weeks, but it ended up taking over a year to get right.

The major challenge was that, while a tenth of an embyro contains more than enough RNA to analyze by mRNA-seq, it turned out to be very difficult to shepherd that RNA successfully from a single cryosection to the sequencer. Peter was routinely failing to recover RNA and make libraries from these samples using methods that worked great for whole embryos. While there are various protocols out there claiming to analyze RNA from single cells, we were reluctant to use these amplification-based strategies.

The typical way people deal with loss of small quantities of nucleic acids during experimental manipulation is to add carrier RNA or DNA – something like tRNA or salmon sperm DNA. We didn’t want to do that, since we would just end up with tons of useless sequencing reads. So we came up with a different strategy – adding embryos from distantly related Drosophila species to each slice at an early stage in the process. This brought the total amount of RNA in each sample well amove the threshold where our purification and library preparation worked robustly, and we could easily separate the D. melanogaster RNA we were interested in for this experiment from that of the “carrier” embryo. But we could avoid wasting sequencing reads by turning the carrier RNAs into an experiment of their own – in this case looking at expression variation between species.

With this trick, the method now works great, and the paper is really just a description of the method and a demonstration that accurate expression patterns can be recovered from individual cryosectioned embryos. The resolution here is not that great – we used 6 slices of ~60um each per embryo. But we’ve started to make smaller sections, and a back of the envelope calculation suggests we can, with available sample handling and sequencing techniques, make up to 100 slices per embryo. This would be more than enough to see stripes and other subtle patterns missed in the current dataset.

Our immediate near term goals are to do a developmental time course, compare patterns in male and female embryos, look at other species and examine embryos from strains carrying various patterning defects. For those of you going to the fly meeting in DC in April, Peter’s talk will, I hope, have some of this new data.

Anyway, we would love comments on either the method or the manuscript.

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression

Sequencing mRNA from cryo-siced Drosophila embryos to determine genome-wide spatial patterns of gene expression
Peter A. Combs, Michael B. Eisen
(Submitted on 19 Feb 2013)

Complex spatial and temporal patterns of gene expression underlie embryo differentiation, yet methods do not yet exist for the efficient genome-wide determination of spatial patterns of gene expression. {\em In situ} imaging of transcripts and proteins is the gold-standard, but is difficult and time consuming to apply to an entire genome, even when highly automated. Sequencing, in contrast, is fast and genome-wide, but generally applied to homogenized tissues, thereby discarding spatial information. At some point, these methods will converge, and we will be able to sequence RNAs {\em in situ}, simultaneously determining their identity and location. As a step along this path, we developed methods to cryosection individual blastoderm stage {\em Drosophila melanogaster} embryos along the anterior-posterior axis and sequence the mRNA isolated from each 60\micron{} slice. The spatial patterns of gene expression we infer closely match patterns determined by {\em in situ} hybridization and microscopy, where such data exist, and thus we conclude that we have generated the first genome-wide map of spatial patterns in the {\em Drosophila} embryo. We identify numerous genes with spatial patterns that have not yet been screened in the several ongoing systematic in situ based projects, the majority of which are localized to the posterior end of the embryo, likely in the pole cells. This simple experiment demonstrates the potential for combining careful anatomical dissection with high-throughput sequencing to obtain spatially resolved gene expression on a genome-wide scale.

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description

Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
Marc Santolini, Thierry Mora, Vincent Hakim
(Submitted on 18 Feb 2013)

The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair independently contributes to the transcription factor (TF) binding, despite mounting evidence of interdependence between base pairs positions. The recent availability of genome-wide data on TF-bound DNA regions offers the possibility to revisit this question in detail for TF binding {\em in vivo}. Here, we use available fly and mouse ChIPseq data, and show that the independent model generally does not reproduce the observed statistics of TFBS, generalizing previous observations. We further show that TFBS description and predictability can be systematically improved by taking into account pairwise correlations in the TFBS via the principle of maximum entropy. The resulting pairwise interaction model is formally equivalent to the disordered Potts models of statistical mechanics and it generalizes previous approaches to interdependent positions. Its structure allows for co-variation of two or more base pairs, as well as secondary motifs. Although models consisting of mixtures of PWMs also have this last feature, we show that pairwise interaction models outperform them. The significant pairwise interactions are found to be sparse and found dominantly between consecutive base pairs. Finally, the use of a pairwise interaction model for the identification of TFBSs is shown to give significantly different predictions than a model based on independent positions.

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor
Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
(Submitted on 15 Feb 2013)

RNA sequencing (RNA-seq) has been rapidly adopted for the multilayered profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, thus there is a need for guidance on current best practices. This protocol presents a “state-of-the-art” computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day, even with modest resources.

Phylogenetic analysis of gene expression


Phylogenetic analysis of gene expression

Casey W. Dunn, Xi Luo, Zhijin Wu
(Submitted on 13 Feb 2013)

Phylogenetic analyses of gene expression have great potential for addressing a wide range of questions. These analyses will, for example, identify genes that have evolutionary shifts in expression that are correlated with evolutionary changes in morphological, physiological, and developmental characters of interest. This will provide entirely new opportunities to identify genes related to particular phenotypes. There are, however, three key challenges that must be addressed for such studies to realize their potential. First, gene expression data must be measured from multiple species, some of which may be field collected, and parameterized in such a way that they can be compared across species. Second, it will be necessary to develop phylogenetic comparative methods suitable for large multidimensional datasets. In most phylogenetic comparative studies to date, the number n of independent observations (independent contrasts) has been greater than the number p of variables (characters). The behavior of comparative methods for these classic n>p problems are now well understood under a wide variety of conditions. In gene expression studies, and studies based on other high-throughput tools, the number n of samples is dwarfed by the number p of variables. The estimated covariance matrices will be singular, complicating their analysis and interpretation, and prone to spurious results. Third, new approaches are needed to investigate the expression of the many genes whose phylogenies are not congruent with species phylogenies due to gene loss, gene duplication, and incomplete lineage sorting. Here we outline general project design considerations for phylogenetic analyses of gene expression, and suggest solutions to these three categories of challenges. These topics are relevant to high-throughput phenotypic data well beyond gene expression.

Our Paper: Transcript length mediates developmental timing of gene expression across Drosophila.

This guest post is a commentary by Carlo Artieri on “Transcript length mediates developmental timing of gene expression across Drosophila” by Artieri, C.G. and H.B. Fraser. The preprint is arXived here.

We have recently posted a preprint manuscript to arXiv that tests a decades-old hypothesis about how biological aspects of development constraint gene structure using several genome-scale transcriptional timecourses and interpret its effects in the context of Drosophila evolution. The paper may be of particular interest to researchers using genomic data in evo-devo studies.

During the early stages of identification and characterization of homeobox
domain (HOX) genes and their related regulators, it was noted that they activated in a temporally sequential manner roughly correlated to their pre-mRNA transcript length (i.e., short genes express early, followed by longer genes.) This led to the hypothesis that this pattern was produced by a purely physical mechanism (Gubb 1986): genes with long pre-mRNAs cannot complete transcription in the interval between the rapid cell cycles taking place during early insect development, leading to abortive, non-functional transcripts. As long pre-mRNAs result primarily from long introns, this was termed ‘Intron Delay’.

We explored patterns of expression of genes in D. melanogaster over two embryonic timescales: eight time points spanning the latter part of the early embryonic ‘syncytial cycles’, during which the most rapid cell cycles take place, and 12 time points spanning the ~24 hours of embryogenesis. Long genes (≥ 5 kb long pre-mRNA transcripts) expressed from the zygotic genome showed a lag in the time required to reach stable levels of expression relative to short genes (< 5 kb) in both timecourses; in fact, stable expression of long genes did not occur until ~12 hours into embryogenesis, or midway between fertilization and emergence of larva from the egg. No such pattern was observed among long or short genes that are maternally deposited in the embryo, as is expected if inability to terminate transcription is the driving mechanism behind this delay. Additional embryonic timecourse data from RNA-Seq libraries generated from non poly-A selected total RNA, and therefore not biased towards capture of processed RNAs, showed that only long zygotic
genes expressed during the earliest developmental time points show a marked deficiency in 3’ relative to 5’ derived reads. This is consistent with their inability to terminate transcription, but not with transcriptional delay due to reduced transcriptional activation during early development.

The analysis was extended using developmental expression data from 3 additional Drosophila species spanning ~60 million years of evolution and showed that this pattern of delayed expression of long zygotically expressed genes is conserved across the phylogeny. This led us to predict that short zygotically expressed genes that are conserved in their ability to escape intron delay would be under substantial evolutionary pressure to maintain their compact lengths, and found that this was the case when compared to long zygotic or either short or long maternally deposited genes.

We suggest that intron delay is an underappreciated mechanism affecting the expression level of a substantial fraction of the Drosophila embryonic transcriptome (~10%) and acts as a source of significant constraint on the structural evolution of important developmental genes.

References:
Gubb D. 1986. Intron‐delay and the precision of expression of homoeotic gene products in Drosophila. Developmental Genetics 7: 119–131

Comprehensive evaluation of differential expression analysis methods for RNA-seq data

Comprehensive evaluation of differential expression analysis methods for RNA-seq data
Franck Rapaport, Raya Khanin, Yupu Liang, Azra Krek, Paul Zumbo, Christopher E. Mason, Nicholas D. Socci, Doron Betel
(Submitted on 22 Jan 2013)

High-throughput sequencing of RNA transcripts (RNA-seq) has become the method of choice for detection of differential expression (DE). Concurrent with the growing popularity of this technology there has been a significant research effort devoted towards understanding the statistical properties of this data and the development of analysis methods. We report on a comprehensive evaluation of the commonly used DE methods using the SEQC benchmark data set. We evaluate a number of key features including: assessment of normalization, accuracy of DE detection, modeling of genes expressed in only one condition, and the impact of sequencing depth and number of replications on identifying DE genes. We find significant differences among the methods with no single method consistently outperforming the others. Furthermore, the performance of array-based approach is comparable to methods customized for RNA-seq data. Perhaps most importantly, our results demonstrate that increasing the number of replicate samples provides significantly more detection power than increased sequencing depth.

Transcript length mediates developmental timing of gene expression across Drosophila

Transcript length mediates developmental timing of gene expression across Drosophila
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 18 Jan 2013)

The time required to transcribe genes with long primary transcripts may limit their ability to be expressed in cells with short mitotic cycles, a phenomenon termed intron delay. As such short cycles are a hallmark of the earliest stages of insect development, we used Drosophila developmental timecourse expression data to test whether intron delay affects gene expression genome-wide, and to determine its consequences for the evolution of gene structure. We find that long zygotically expressed, but not maternally deposited, genes show substantial delay in expression relative to their shorter counterparts and that this delay persists over a substantial portion of the ~24 hours of embryogenesis. Patterns of RNA-seq coverage from the 5′ and 3′ ends of transcripts show that this delay is consistent with their inability to terminate transcription, but not with transcriptional initiation-based regulatory control. Highly expressed zygotic genes are subject to purifying selection to maintain compact transcribed regions, allowing conservation of embryonic expression patterns across the Drosophila phylogeny. We propose that intron delay is an underappreciated physical mechanism affecting both patterns of expression as well as gene structure of many genes across Drosophila.

A comparative analysis of transcription factor expression during metazoan embryonic development

A comparative analysis of transcription factor expression during metazoan embryonic development
Alicia Schep, Boris Adryan
(Submitted on 8 Jan 2013)

During embryonic development, a complex organism is formed from a single starting cell. These processes of growth and differentiation are driven by large transcriptional changes, which are following the expression and activity of transcription factors (TFs). This study sought to compare TF expression during embryonic development in a diverse group of metazoan animals: representatives of vertebrates (Danio rerio, Xenopus tropicalis), a chordate (Ciona intestinalis) and invertebrate phyla such as insects (Drosophila melanogaster, Anopheles gambiae) and nematodes (Caenorhabditis elegans) were sampled, The different species showed overall very similar TF expression patterns, with TF expression increasing during the initial stages of development. C2H2 zinc finger TFs were over-represented and Homeobox TFs were under-represented in the early stages in all species. We further clustered TFs for each species based on their quantitative temporal expression profiles. This showed very similar TF expression trends in development in vertebrate and insect species. However, analysis of the expression of orthologous pairs between more closely related species showed that expression of most individual TFs is not conserved, following the general model of duplication and diversification. The degree of similarity between TF expression between Xenopus tropicalis and Danio rerio followed the hourglass model, with the greatest similarity occuring during the early tailbud stage in Xenopus tropicalis and the late segmentation stage in Danio rerio. However, for Drosophila melanogaster and Anopheles gambiae there were two periods of high TF transcriptome similarity, one during the Arthropod phylotypic stage at 8-10 hours into Drosophila development and the other later at 16-18 hours into Drosophila development.

Our paper: Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals

Our next “our paper” guest post is by Vincent Lynch [@VinJLynch] who’s just joined the UChicago faculty from a postdoc at Yale. He’s posting about his recently arXived paper:

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals. ArXived here.
_________________________________________________________________________________
Explaining how morphology evolves is a major challenge in biology. While it’s clear that changes in gene regulation are ultimately responsible for the development and evolution of complex characters, we are only just beginning to understand the molecular mechanisms of gene regulatory evolution. This is largely due to the emergence of new technologies, such as mRNA-Seq and ChIP-Seq, which give biologists the tools to explore evolution across the genome and in non-model species.

We took advantage of these methods to explore the evolution of gene expression in the uterus during the origin of pregnancy in mammals. Using mRNA-Seq, we show that gene expression evolved extremely rapidly during major stages in the evolution of pregnancy, for example during the origin of maternal resource provisioning in the stem-lineage of Mammalia, placentation in the stem-lineage of Theria, and implantation in the stem-lineage of Eutheria. Using ChIP-Seq to identify the cis-regulatory elements of genes recruited into uterine expression in mammals suggests that the majority of enhancers and promoters derived from mammalian lineage-specific transposons.

While recent technological advances are changing the way we do biology (see Wagner 2013), as these emerging methods come into the mainstream we must collectively define our new standards of evidence. What experiments and methods build a convincing case for X? Is it sufficient, for example, to conclude that a transposon donated a novel promoter to a gene if a ChIP-Seq peak for a histone mark associated with promoters lies within the transposon? If we then expand that observation across the genome, can we reasonably conclude that transposons are casually responsible for gene regulatory change? For these reasons we chose to post our manuscript as a work-in-progress to arXiv, both as our contribution to the larger discussion of what constitutes the standards of evidence in this emerging field of biology and as an opportunity to receive feedback from our colleagues to complement formal peer-review.

Vincent Lynch