Butter: High-precision genomic alignment of small RNA-seq data
Michael J Axtell
Eukaryotes produce large numbers of small non-coding RNAs that act as specificity determinants for various gene-regulatory complexes. These include microRNAs (miRNAs), endogenous short interfering RNAs (siRNAs), and Piwi-associated RNAs (piRNAs). These RNAs can be discovered, annotated, and quantified using small RNA-seq, a variant RNA-seq method based on highly parallel sequencing. Alignment to a reference genome is a critical step in analysis of small RNA-seq data. Because of their small size (20-30 nts depending on the organism and sub-type) and tendency to originate from multi-gene families or repetitive regions, reads that align equally well to more than one genomic location are very common. Typical methods to deal with multi-mapped small RNA-seq reads sacrifice either precision or sensitivity. The tool ‘butter’ balances precision and sensitivity by placing multi-mapped reads using an iterative approach, where the decision between possible locations is dictated by the local densities of more confidently aligned reads. Butter displays superior performance relative to other small RNA-seq aligners. Treatment of multi-mapped small RNA-seq reads has substantial impacts on downstream analyses, including quantification of MIRNA paralogs, and discovery of endogenous siRNA loci. Butter is freely available under a GNU general public license.
Clonal interference and Muller’s ratchet in spatial habitats
Jakub Otwinowski, Joachim Krug
(Submitted on 18 Feb 2013 (v1), last revised 23 Jul 2014 (this version, v3))
Competition between independently arising beneficial mutations is enhanced in spatial populations due to the linear rather than exponential growth of clones. Recent theoretical studies have pointed out that the resulting fitness dynamics is analogous to a surface growth process, where new layers nucleate and spread stochastically, leading to the build up of scale-invariant roughness. This scenario differs qualitatively from the standard view of adaptation in that the speed of adaptation becomes independent of population size while the fitness variance does not. Here we exploit recent progress in the understanding of surface growth processes to obtain precise predictions for the universal, non-Gaussian shape of the fitness distribution for one-dimensional habitats, which are verified by simulations. When the mutations are deleterious rather than beneficial the problem becomes a spatial version of Muller’s ratchet. In contrast to the case of well-mixed populations, the rate of fitness decline remains finite even in the limit of an infinite habitat, provided the ratio Ud/s2 between the deleterious mutation rate and the square of the (negative) selection coefficient is sufficiently large. Using again an analogy to surface growth models we show that the transition between the stationary and the moving state of the ratchet is governed by directed percolation.
Concerning RNA-Guided Gene Drives for the Alteration of Wild Populations
Kevin M Esvelt, Andrea L Smidler, Flaminia Catteruccia, George M Church
Gene drives may be capable of addressing ecological problems by altering entire populations of wild organisms, but their use has remained largely theoretical due to technical constraints. Here we consider the potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations. We detail likely capabilities, discuss limitations, and provide novel precautionary strategies to control the spread of gene drives and reverse genomic changes. The ability to edit populations of sexual species would offer substantial benefits to humanity and the environment. For example, RNA-guided gene drives could potentially prevent the spread of disease, support agriculture by reversing pesticide and herbicide resistance in insects and weeds, and control damaging invasive species. However, the possibility of unwanted ecological effects and near-certainty of spread across political borders demand careful assessment of each potential application. We call for thoughtful, inclusive, and well-informed public discussions to explore the responsible use of this currently theoretical technology.
Assessing allele specific expression across multiple tissues from RNA-seq read data
Matti Pirinen, Tuuli Lappalainen, Noah A Zaitlen, GTEx Consortium, Emmanouil T Dermitzakis, Peter Donnelly, Mark I McCarthy, Manuel A Rivas
Motivation: RNA sequencing enables allele specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression project (GTEx) is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data. Results: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally. Availability: MAMBA software: http://birch.well.ox.ac.uk/~rivas/mamba/ R source code and data examples: http://www.iki.fi/mpirinen/ Contact: firstname.lastname@example.org email@example.com
Fixation properties of subdivided populations with balancing selection
Pierangelo Lombardo, Andrea Gambassi, Luca Dall’Asta
Comments: 17 pages, 10 figures
Subjects: Populations and Evolution (q-bio.PE); Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph)
In subdivided populations, migration acts together with selection and genetic drift and determines their evolution. Building up on a recently proposed method, which hinges on the emergence of a time scale separation between local and global dynamics, we study the fixation properties of subdivided populations in the presence of balancing selection. The approximation implied by the method is accurate when the effective selection strength is small and the number of subpopulations is large. In particular, it predicts a phase transition between species coexistence and biodiversity loss in the infinite-size limit and, in finite populations, a nonmonotonic dependence of the mean fixation time on the migration rate. In order to investigate the fixation properties of the subdivided population for stronger selection, we introduce an effective coarser description of the dynamics in terms of a voter model with intermediate states, which highlights the basic mechanisms driving the evolutionary process.
RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma
Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
Reagent contamination can critically impact sequence-based microbiome analyses
Susannah Salter, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nick Loman, Alan W Walker
The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.
No evidence that sex and transposable elements drive genome size variation in evening primroses
J Arvid Agren, Stephan Greiner, Marc TJ Johnson, Stephen I Wright
Genome size varies dramatically across species, but despite an abundance of attention there is little agreement on the relative contributions of selective and neutral processes in governing this variation. The rate of sexual reproduction can potentially play an important role in genome size evolution because of its effect on the efficacy of selection and transmission of transposable elements. Here, we used a phylogenetic comparative approach and whole genome sequencing to investigate the contribution of sex and transposable element content to genome size variation in the evening primrose (Oenothera) genus. We determined genome size using flow cytometry from 30 Oenothera species of varying reproductive system and find that variation in sexual/asexual reproduction cannot explain the almost two-fold variation in genome size. Moreover, using whole genome sequences of three species of varying genome sizes and reproductive system, we found that genome size was not associated with transposable element abundance; instead the larger genomes had a higher abundance of simple sequence repeats. Although it has long been clear that sexual reproduction may affect various aspects of genome evolution in general and transposable element evolution in particular, it does not appear to have played a major role in the evening primroses.
Mitochondrial Genomes of Domestic Animals Need Scrutiny
Ni-Ni Shi, Long Fan, Yong-Gang Yao, Min-Sheng Peng, Ya-Ping Zhang
(Submitted on 16 Jul 2014)
More than 1000 complete or near-complete mitochondrial DNA (mtDNA) sequences have been deposited in GenBank for eight common domestic animals (i.e. cattle, dog, goat, horse, pig, sheep, yak and chicken) and their close wild ancestors or relatives. Nevertheless, few efforts have been performed to evaluate the sequence data quality, which heavily impact the original conclusion. Herein, we conducted a phylogenetic survey of these complete or near-complete mtDNA sequences based on mtDNA haplogroup trees for the eight animals. We show that, errors due to artificial recombination, surplus of mutations, and phantom mutations, do exist in 14.5% (194/1342) of mtDNA sequences and shall be treated with wide caution. We propose some caveats for mtDNA studies of domestic animals in the future.
THE GENETIC LANDSCAPE OF TRANSCRIPTIONAL NETWORKS IN A COMBINED HAPLOID/DIPLOID PLANT SYSTEM
Jukka-Pekka Verta, Christian R Landry, John J MacKay
Heritable variation in gene expression is a source of evolutionary change and our understanding of the genetic basis of expression variation remains incomplete. Here, we dissected the genetic basis of transcriptional variation in a wild, outbreeding gymnosperm (Picea glauca) according to linked and unlinked genetic variants, their allele-specific (cis) and allele non-specific (trans) effects, and their phenotypic additivity. We used a novel plant system that is based on the analysis of segregating alleles of a single self-fertilized plant in haploid and diploid seed tissues. We measured transcript abundance and identified transcribed SNPs in 66 seeds with RNA-seq. Linked and unlinked genetic effects that influenced expression levels were abundant in the haploid megagametophyte tissue, influencing 48% and 38% of analyzed genes, respectively. Analysis of these effects in diploid embryos revealed that while distant effects were acting in trans consistent with their hypothesized diffusible nature, local effects were associated with a complex mix of cis, trans and compensatory effects. Most cis effects were additive irrespective of their effect sizes, consistent with a hypothesis that they represent rate-limiting factors in transcript accumulation. We show that trans effects fulfilled a key prediction of Wright?s physiological theory, in which variants with small effects tend to be additive and those with large effects tend to be dominant/recessive. Our haploid/diploid approach allows a comprehensive genetic dissection of expression variation and can be applied to a large number of wild plant species.