Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms

Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms
Rob Patro (1), Stephen M. Mount (2), Carl Kingsford (1) ((1) Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, (2) Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland)
(Submitted on 16 Aug 2013)

RNA-seq has rapidly become the de facto technique to measure gene expression. However, the time required for analysis has not kept up with the pace of data generation. Here we introduce Sailfish, a novel computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Sailfish entirely avoids mapping reads, which is a time-consuming step in all current methods. Sailfish provides quantification estimates much faster than existing approaches (typically 20-times faster) without loss of accuracy.

Realistic simulations reveal extensive sample-specificity of RNA-seq biases

Realistic simulations reveal extensive sample-specificity of RNA-seq biases
Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman
(Submitted on 14 Aug 2013)

In line with the importance of RNA-seq, the bioinformatics community has produced numerous data analysis tools incorporating methods to correct sample-specific biases. However, few advanced simulation tools exist to enable benchmarking of competing correction methods. We introduce the first framework to reproduce the properties of individual RNA-seq runs and, by applying it on several datasets, we demonstrate the importance of accounting for sample-specificity in realistic simulations.

On the sympatric evolution of coexistence by relative nonlinearity of competition

On the sympatric evolution of coexistence by relative nonlinearity of competition
Florian Hartig, Tamara Münkemüller, Karin Johst, Ulf Dieckmann
(Submitted on 14 Aug 2013)

If two species show different nonlinear responses to a single shared resource, and if each species modifies resource dynamics such that it favors its competitor, they may stably coexist. While the mechanism behind this phenomenon, known as relative nonlinearity of competition, is well understood, less is known about its evolutionary properties and its prevalence in real communities. We address this challenge by using the adaptive dynamics framework as well as individual-based simulations to compare dynamic and evolutionary stability of communities coexisting through relative nonlinearity. Evolution operates on the species’ density compensation strategies, and a trade-off between growth at high versus low resource availability (population density) is assumed. We confirm previous findings that, irrespective of the particular model of density-dependence, there are usually broad ranges of coexistence between overcompensating and undercompensating density-compensation strategies. We show that most of these strategies, however, are not evolutionarily stable and will be outcompeted by a single compensatory strategy. Only very specific evolutionary trade-offs allow evolutionary stability of strategies that coexist through relative nonlinearity. As we find no reason why these particular trade-offs should be abundant in nature, we conclude that sympatric evolution of relative nonlinearity seems possible, but rather unlikely. We speculate that this may explain why relative nonlinearity has seldom been observed, although we note that a low probability of sympatric evolution does not exclude the possibility that this mechanism of coexistence might still frequently occur when species with different evolutionary histories meet in the same community. Our results highlight the need for combining ecological and evolutionary perspectives for understanding community assembly and biogeographical patterns.

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human
Hunter B. Fraser
(Submitted on 8 Aug 2013)

Eukaryotic DNA replication follows a specific temporal program, with some genomic regions consistently replicating earlier than others, yet what determines this program is largely unknown. Highly transcribed regions have been observed to replicate in early S-phase in all plant and animal species studied to date, but this relationship is thought to be absent from both budding yeast and fission yeast. No association between cell-cycle regulated transcription and replication timing has been reported for any species. Here I show that in budding yeast, fission yeast, and human, the genes most highly transcribed during S-phase replicate early, whereas those repressed in S-phase replicate late. Transcription during other cell-cycle phases shows either the opposite correlation with replication timing, or no relation. The relationship is strongest near late-firing origins of replication, which is not consistent with a previously proposed model — that replication timing may affect transcription — and instead suggests a potential mechanism involving the recruitment of limiting replication initiation factors during S-phase. These results suggest that S-phase transcription may be an important determinant of DNA replication timing across eukaryotes, which may explain the well-established association between transcription and replication timing.

Our Paper: The genomic impacts of drift and selection for hybrid performance in maize

This next paper is by Jeff Ross-Ibarra (@jrossibarra) on his paper (along with coauthors) Gerke et al The genomic impacts of drift and selection for hybrid performance in maize arXived here.

Iowa recurrent selection as an evolutionary experiment in hybrid vigor

Maize is an outcrossing species, and was cultivated as such up through the first quarter of the 20th century. Starting in the 1920’s, however, breeders began to abandon open-pollinated maize in favor of hybrid varieties resulting from crosses between inbred lines. Hybrids are often more robust and higher yielding than either inbred parent, a phenomenon known as hybrid vigor or heterosis.

Breeding for hybrid varieties – and presumably increased heterosis – has had a profound impact on diversity across the maize genome. There are at least two important differences from previous breeding efforts: first, breeders select on and work with inbred maize lines rather than mass selection on open-pollinated populations. This results in much smaller effective population sizes, and has implications for recessive traits and deleterious alleles that could be masked in heterozygotes. The second difference is that instead of selecting the best plants per se, breeders now select for inbreds that make high-yielding hybrids. This means a breeder might favor an inbred that itself is not high-yielding if it consistently makes good hybrids when paired with other inbreds.

We set out to study the effects of these breeding strategies on patterns of diversity across the maize genome. We took addvantage of one of the longest-running ongoing experiments on selection for hybrid performance, started in the late 1940’s by the US Dept. of Agriculture’s Agricutural Research Service. Two small (12 and 16) sets of founder inbred lines were randomly mated to create two base populations: the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1). In addition to its role as an important selection experiment, multiple maize breeding lines have come out of the BSSS population, including the line used for the maize reference genome.

Diversity in the BSSS and BSCB1 is patterned predominantly by drift

Over the course of the experiment we studied, the two base populations underwent 16 cycles of recurrent selection, in which lines from each population were crossed to each other and evaluated for both hybrid and per-se performance. Selected lines were intermated within each population to form the next generation. To investigate the genomic impact of this selection scheme, we genotyped progenitor lines and over 600 individuals from multiple selection cycles using the Illumina MaizeSNP50 SNP array. And because we know the exact crossing and selection scheme used, we can compare the observed changes in genome-wide diversity with strictly neutral crossing simulations using the genotypes of the starting populations.

Both populations steadily lost genetic diversity as they became more diverged from one another, but diversity and divergence between BSSS and BSCB1 can be largely reproduced by simulation without any selection. In fact, principal component analysis clearly reveals changes in population structure and diversity that mirror alterations in rates of inbreeding and effective population size that occurred over the course of the experiment. This indicates the structure is not necessarily related to the phenotypic improvement, but might be a by-product of the breeding scheme. Similar population structure is reflected in a recent broad comparison of US maize germplasm and suggests that much of the diversity and structure of modern maize germplasm has been effected by genetic drift.

Selection efficacy and fixation at regions of low-recombination.

But genetic drift can’t be the whole story in these populations. Numerous experiments have shown that the later populations are superior to their progenitors in terms of hybrid yield and traits important to increased planting density (more plants per acre = more yield). These same trends are observed across North American maize as a whole, suggesting common themes in how maize has improved over time. Selection is difficult to detect in the face of strong genetic drift, especially when the selection has been on traits with complex genetic architectures. However our simulations do detect regions of low heterozygosity in each population that are longer than expected given their genetic distance.

The most striking pattern of these regions is their lack overlap between the two populations. In simple cases, classic overdominance models of heterosis predict that at a single locus, two distinct alleles confer heterozygote advantage when combined. In this case, selection should lead to decreased heterozygosity at a locus in both populations as complementary alleles rise in frequency. We don’t observe this, and neither did a different study that used other populations.

A popular alternative to the over-dominance model is the dominance model, which predicts that heterosis is caused by the complementation of linked recessive deleterious alleles. In this case, multiple haplotypes in the other population may complement a fixed region if most deleterious alleles in maize are rare. Evidence from numerous studies supports a dominance model of heterosis, including findings of excess residual heterozygosity in low recombination regions of a maize mapping population. In regions of low recombination, heterozygosity (and thus complementation) becomes important due to an inabilty to efficiently select for new recombinants in these regions, especially with low effective population sizes. And because of low rates of recombination, a small genetic interval in these regions becomes massive in physical space and encompasses the composite effects of many deleterious loci. We observe fixation in these regions in the BSSS and BSCB1 populations. They are short genetically (1-2 centimorgans), but make up very large fractions of the chromosome. We find that in many cases, these regions have been inherited largely intact from the original population founders, indicating that selection for new haplotype combinations in these regions has been ineffective. Large haplotypes in some cases may have fixed early on in the formation of many breeding programs, and the combination of limited exchange between breeding pools and small effective population sizes has provided little opportunity for selective removal of deleterious alleles. Complementation and the inefficiency of selection in these pericentromeric regions, which span a large portion of the physical genome, may thus explain the difference between hybrid and inbred yield and why it has remained fairly constant.

Predicting protein contact map using evolutionary and physical constraints by integer programming

Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang, Jinbo Xu
(Submitted on 8 Aug 2013)

Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map based upon residue co-evolution, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods require a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically unfavorable.
Results. This paper presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints include sequence profile, residue co-evolution and context-specific statistical potential. The physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. PhyCMAP can predict contacts within minutes after PSIBLAST search for sequence homologs is done, much faster than the two recent methods PSICOV and EvFold.

How Population Growth Affects Linkage Disequilibrium

How Population Growth Affects Linkage Disequilibrium
Alan R. Rogers
(Submitted on 8 Aug 2013)

Linkage disequilibrium (LD) is often summarized using the “LD curve,” which relates the LD between pairs of sites to the distance that separates them along the chromosome. This paper shows how the LD curve responds to changes in population size. An expansion of population size generates an LD curve that declines steeply, especially if that expansion has followed a bottleneck. A reduction in size generates an LD curve that is high but relatively flat. In European data, the curve is steep, suggesting a history of population expansion.
These conclusions emerge from the study of $\sigma_d^2$, a measure of LD that has never played a central role. It has been seen merely as an approximation to another measure, $r^2$. Yet $\sigma_d^2$ has different dynamical behavior and provides deeper time depth. Furthermore, it is easily estimated from data and can be predicted from population history using a fast, deterministic algorithm.

The dynamics of alternative pathways to compensatory substitution

The dynamics of alternative pathways to compensatory substitution
Chris A. Nasrallah
(Submitted on 9 Aug 2013)

The role of epistatic interactions among loci is a central question in evolutionary biology and is increasingly relevant in the genomic age. While the population genetics of compensatory substitution have received considerable attention, most studies have focused on the case when natural selection is very strong against deleterious intermediates. In the biologically-plausible scenario of weak to moderate selection there exist two alternate pathways for compensatory substitution. In one pathway, a deleterious mutation becomes fixed prior to occurrence of the compensatory mutation. In the other, the two loci are simultaneously polymorphic. The rates of compensatory substitution along these two pathways and their relative probabilities are functions of the population size, selection strength, mutation rate, and recombination rate. In this paper these rates and path probabilities are derived analytically and verified using population genetic simulations. The expected time durations of these two paths are similar when selection is moderate, but not when selection is weak. The effect of recombination on the dynamics of the substitution process are explored using simulation. Using the derived rates, a phylogenetic substitution model of the compensatory evolution process is presented that could be used for inference of population genetic parameters from interspecific data.

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects
Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan
(Submitted on 9 Aug 2013)

Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++ and freely accessible at this ftp URL

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements
Weiwei Zhang, Tim D Spector, Panos Deloukas, Jordana T Bell, Barbara E Engelhardt
(Submitted on 9 Aug 2013)

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is important, but current approaches tackle average methylation within a genomic locus and are often limited to specific genomic regions. Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict CpG site methylation levels using as features neighboring CpG site methylation levels and genomic distance, and co-localization with coding regions, CGIs, and regulatory elements from the ENCODE project, among others. Our approach achieves 91% — 94% prediction accuracy of genome-wide methylation levels at single CpG site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs. Our classifier outperforms state-of-the-art methylation classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation status, CpG island status, co-localized DNase I hypersensitive sites, and specific transcription factor binding sites were found to be most predictive of methylation levels. Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict site-specific methylation levels that achieves the best DNA methylation predictive accuracy to date. Furthermore, our method identified genomic features that interact with DNA methylation, elucidating mechanisms involved in DNA methylation modification and regulation, and linking different epigenetic processes.