Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

Count-based differential expression analysis of RNA sequencing data using R and Bioconductor
Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
(Submitted on 15 Feb 2013)

RNA sequencing (RNA-seq) has been rapidly adopted for the multilayered profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations), while optionally adjusting for other systematic factors that affect the data collection process. There are a number of subtle yet critical aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, thus there is a need for guidance on current best practices. This protocol presents a “state-of-the-art” computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and in particular, two widely-used tools DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 samples) can be <1 hour, with computation time <1 day, even with modest resources.


Muller’s ratchet with overlapping generations

Muller’s ratchet with overlapping generations
Jakob J. Metzger, Stephan Eule
(Submitted on 14 Feb 2013)

Muller’s ratchet is a paradigmatic model for the accumulation of deleterious mutations in a population of finite size. A click of the ratchet occurs when all individuals with the least number of deleterious mutations are lost irreversibly due to a stochastic fluctuation. In spite of the simplicity of the model, a quantitative understanding of the process remains an open challenge. In contrast to previous works, we here study a model of the ratchet with overlapping generations. Employing an approximation which describes the fittest individuals as one class and the rest as a second class, we obtain closed analytical expressions of the ratchet rate in the rare clicking regime. As a click in this regime is caused by a rare large fluctuation from a metastable state, we do not resort to a diffusion approximation but apply an approximation scheme which is especially well suited to describe extinction events from metastable states. This method also allows for a derivation of expressions for the quasi-stationary distribution of the fittest class. Additionally, we show numerically that the formulation with overlapping generations leads to similar results as the standard model with non-overlapping generations and the diffusion approximation in the regime where the ratchet clicks frequently. For parameter values closer to the rare clicking regime, however, we find that the rate of Muller’s ratchet strongly depends on the microscopic reproduction model.

Disentangling the effects of geographic and ecological isolation on genetic differentiation

Disentangling the effects of geographic and ecological isolation on genetic differentiation
Gideon Bradburd, Peter Ralph, Graham Coop
(Submitted on 13 Feb 2013)

Populations can be genetically isolated by geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (e.g. the partial Mantel test) have a number of drawbacks. In this article, we present a Bayesian method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. We model the allele frequencies in populations at a set of unlinked loci as spatial Gaussian processes, and model the covariance structure of pairs of populations as a decreasing function of both geographic and ecological distance between that pair. Parameters of the model are estimated using a Markov chain Monte Carlo algorithm. We have implemented this method, Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE), in a user-friendly format in the statistical platform R, and we demonstrate its utility with a simulation study and empirical applications to human and teosinte datasets.

Slow evolution of vertebrates with large genomes

Slow evolution of vertebrates with large genomes
Bianca Sclavi, John Herrick

Darwin introduced the concept of the “living fossil” to describe species belonging to lineages that have experienced little evolutionary change, and suggested that species in more slowly evolving lineages are more prone to extinction (1). Recent studies revealed that some living fossils such as the lungfish are indeed evolving more slowly than other vertebrates (2, 3). The reason for the slower rate of evolution in these lineages remains unclear, but the same observations suggest a possible genome size effect on rates of evolution. Genome size (C-value) in vertebrates varies over 200 fold ranging from pufferfish (0.4 pg) to lungfish (132.8 pg) (4). Variation in genome size and architecture is a fundamental cellular adaptation that remains poorly understood (5). C-value is correlated with several allometric traits such as body size and developmental rates in many, but not all, organisms (6, 7). To date, no consensus exists concerning the mechanisms driving genome size evolution or the effect that genome size has on species traits such as evolutionary rates (8-12). In the following we show that: 1) within the same range of divergence times, genetic diversity decreases as genome size increases and 2) average rates of molecular evolution decline with increasing genome size in vertebrates. Together, these observations indicate that genome size is an important factor influencing rates of speciation and extinction.

Phylogenetic analysis of gene expression

Phylogenetic analysis of gene expression

Casey W. Dunn, Xi Luo, Zhijin Wu
(Submitted on 13 Feb 2013)

Phylogenetic analyses of gene expression have great potential for addressing a wide range of questions. These analyses will, for example, identify genes that have evolutionary shifts in expression that are correlated with evolutionary changes in morphological, physiological, and developmental characters of interest. This will provide entirely new opportunities to identify genes related to particular phenotypes. There are, however, three key challenges that must be addressed for such studies to realize their potential. First, gene expression data must be measured from multiple species, some of which may be field collected, and parameterized in such a way that they can be compared across species. Second, it will be necessary to develop phylogenetic comparative methods suitable for large multidimensional datasets. In most phylogenetic comparative studies to date, the number n of independent observations (independent contrasts) has been greater than the number p of variables (characters). The behavior of comparative methods for these classic n>p problems are now well understood under a wide variety of conditions. In gene expression studies, and studies based on other high-throughput tools, the number n of samples is dwarfed by the number p of variables. The estimated covariance matrices will be singular, complicating their analysis and interpretation, and prone to spurious results. Third, new approaches are needed to investigate the expression of the many genes whose phylogenies are not congruent with species phylogenies due to gene loss, gene duplication, and incomplete lineage sorting. Here we outline general project design considerations for phylogenetic analyses of gene expression, and suggest solutions to these three categories of challenges. These topics are relevant to high-throughput phenotypic data well beyond gene expression.

Dualities in population genetics: a fresh look with new dualities

Dualities in population genetics: a fresh look with new dualities

Gioia Carinci, Cristian Giardina’, Claudio Giberti, Frank Redig
(Submitted on 13 Feb 2013)

We apply our general method of duality, introduced in [Giardina’, Kurchan, Redig, J. Math. Phys. 48, 033301 (2007)], to models of population dynamics. The classical dualities between forward and ancestral processes can be viewed as a change of representation in the classical creation and annihilation operators, both for diffusions dual to coalescents of Kingman’s type, as well as for models with finite population size. Next, using SU(1,1) raising and lowering operators, we find new dualities between the Wright-Fisher diffusion with $d$ types and the Moran model, both in presence and absence of mutations. These new dualities relates two forward evolutions. From our general scheme we also identify self-duality of the Moran model.

Population genetics of neutral mutations in exponentially growing cancer cell populations

Population genetics of neutral mutations in exponentially growing cancer cell populations
Rick Durrett
(Submitted on 12 Feb 2013)

In order to analyze data from cancer genome sequencing projects, we need to be able to distinguish causative, or “driver,” mutations from “passenger” mutations that have no selective effect. Toward this end, we prove results concerning the frequency of neutural mutations in exponentially growing multitype branching processes that have been widely used in cancer modeling. Our results yield a simple new population genetics result for the site frequency spectrum of a sample from an exponentially growing population.

Population Genetics of Rare Variants and Complex Diseases

Population Genetics of Rare Variants and Complex Diseases
M. Cyrus Maher, Lawrence H. Uricchio, Dara G. Torgerson, Ryan D. Hernandez
(Submitted on 12 Feb 2013)

Identifying drivers of complex traits from the noisy signals of genetic variation obtained from high throughput genome sequencing technologies is a central challenge faced by human geneticists today. We hypothesize that the variants involved in complex diseases are likely to exhibit non-neutral evolutionary signatures. Uncovering the evolutionary history of all variants is therefore of intrinsic interest for complex disease research. However, doing so necessitates the simultaneous elucidation of the targets of natural selection and population-specific demographic history. Here we characterize the action of natural selection operating across complex disease categories, and use population genetic simulations to evaluate the expected patterns of genetic variation in large samples. We focus on populations that have experienced historical bottlenecks followed by explosive growth (consistent with most human populations), and describe the differences between evolutionarily deleterious mutations and those that are neutral. Genes associated with several complex disease categories exhibit stronger signatures of purifying selection than non-disease genes. In addition, loci identified through genome-wide association studies of complex traits also exhibit signatures consistent with being in regions recurrently targeted by purifying selection. Through simulations, we show that population bottlenecks and rapid growth enables deleterious rare variants to persist at low frequencies just as long as neutral variants, but low frequency and common variants tend to be much younger than neutral variants. This has resulted in a large proportion of modern-day rare alleles that have a deleterious effect on function, and that potentially contribute to disease susceptibility.

Our Paper: Transcript length mediates developmental timing of gene expression across Drosophila.

This guest post is a commentary by Carlo Artieri on “Transcript length mediates developmental timing of gene expression across Drosophila” by Artieri, C.G. and H.B. Fraser. The preprint is arXived here.

We have recently posted a preprint manuscript to arXiv that tests a decades-old hypothesis about how biological aspects of development constraint gene structure using several genome-scale transcriptional timecourses and interpret its effects in the context of Drosophila evolution. The paper may be of particular interest to researchers using genomic data in evo-devo studies.

During the early stages of identification and characterization of homeobox
domain (HOX) genes and their related regulators, it was noted that they activated in a temporally sequential manner roughly correlated to their pre-mRNA transcript length (i.e., short genes express early, followed by longer genes.) This led to the hypothesis that this pattern was produced by a purely physical mechanism (Gubb 1986): genes with long pre-mRNAs cannot complete transcription in the interval between the rapid cell cycles taking place during early insect development, leading to abortive, non-functional transcripts. As long pre-mRNAs result primarily from long introns, this was termed ‘Intron Delay’.

We explored patterns of expression of genes in D. melanogaster over two embryonic timescales: eight time points spanning the latter part of the early embryonic ‘syncytial cycles’, during which the most rapid cell cycles take place, and 12 time points spanning the ~24 hours of embryogenesis. Long genes (≥ 5 kb long pre-mRNA transcripts) expressed from the zygotic genome showed a lag in the time required to reach stable levels of expression relative to short genes (< 5 kb) in both timecourses; in fact, stable expression of long genes did not occur until ~12 hours into embryogenesis, or midway between fertilization and emergence of larva from the egg. No such pattern was observed among long or short genes that are maternally deposited in the embryo, as is expected if inability to terminate transcription is the driving mechanism behind this delay. Additional embryonic timecourse data from RNA-Seq libraries generated from non poly-A selected total RNA, and therefore not biased towards capture of processed RNAs, showed that only long zygotic
genes expressed during the earliest developmental time points show a marked deficiency in 3’ relative to 5’ derived reads. This is consistent with their inability to terminate transcription, but not with transcriptional delay due to reduced transcriptional activation during early development.

The analysis was extended using developmental expression data from 3 additional Drosophila species spanning ~60 million years of evolution and showed that this pattern of delayed expression of long zygotically expressed genes is conserved across the phylogeny. This led us to predict that short zygotically expressed genes that are conserved in their ability to escape intron delay would be under substantial evolutionary pressure to maintain their compact lengths, and found that this was the case when compared to long zygotic or either short or long maternally deposited genes.

We suggest that intron delay is an underappreciated mechanism affecting the expression level of a substantial fraction of the Drosophila embryonic transcriptome (~10%) and acts as a source of significant constraint on the structural evolution of important developmental genes.

Gubb D. 1986. Intron‐delay and the precision of expression of homoeotic gene products in Drosophila. Developmental Genetics 7: 119–131

Total internal and external lengths of the Bolthausen-Sznitman coalescent

Total internal and external lengths of the Bolthausen-Sznitman coalescent
Götz Kersting, Juan Carlos Pardo, Arno Siri-Jégousse
(Submitted on 6 Feb 2013)

In this paper, we study a weak law of large numbers for the total internal length of the Bolthausen-Szmitman coalescent. As a consequence, we obtain the weak limit law of the centered and rescaled total external length. The latter extends results obtained by Dhersin & M\”ohle \cite{DM12}. An application to population genetics dealing with the total number of mutations in the genealogical tree is also given.