Methods for distinguishing between protein-coding and long noncoding RNAs and the elusive biological purpose of translation of long noncoding RNAs

Methods for distinguishing between protein-coding and long noncoding RNAs and the elusive biological purpose of translation of long noncoding RNAs
Gali Housman , Igor Ulitsky
doi: http://dx.doi.org/10.1101/017889

Long noncoding RNAs (lncRNAs) are a diverse class of RNAs with increasingly appreciated functions in vertebrates, yet much of their biology remains poorly understood. In particular, it is unclear to what extent the current catalog of over 10,000 distinct annotated lncRNAs is indeed devoid of genes coding for proteins. Here we review the available computational and experimental schemes for distinguishing between recent genome-wide applications. We conclude that the model most consistent with available data is that a large number of mammalian lncRNAs undergo translation, but only a very small minority of such translation events result in stable and functional peptides. The outcome of the majority of the translation events and their potential biological purposes remain an intriguing topic for future investigation.

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele

Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele
Roy Ronen , Glenn Tesler , Ali Akbari , Shay Zakov , Noah A Rosenberg , Vineet Bafna

Methods for detecting the genomic signatures of natural selection have been heavily studied, and they have been successful in identifying many selective sweeps. For most of these sweeps, the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the evolutionary trajectory — for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical framework for computing expected HAF scores under different evolutionary scenarios, and we validate the theoretical predictions with simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

Adaptive evolution of anti-viral siRNAi genes in bumblebees

Adaptive evolution of anti-viral siRNAi genes in bumblebees
Sophie Helbing , Michael Lattorff
doi: http://dx.doi.org/10.1101/017681

The high density of frequently interacting and closely related individuals in social insects enhance pathogen transmission and establishment within colonies. Group-mediated behavior supporting immune defenses tend to decrease selection acting on immune genes. Along with low effective population sizes this will result in relaxed constraint and rapid evolution of genes of the immune system. Here we show that sociality is the main driver of selection in antiviral siRNAi genes in social bumblebees compared to their socially parasitic cuckoo bumblebees that lack a worker caste. RNAi genes show frequent positive selection at the codon level additionally supported by the occurrence of parallel evolution and their evolutionary rate is linked to their pathway specific position with genes directly interacting with viruses showing the highest rates of molecular evolution. We suggest that indeed higher pathogen load in social insects drive adaptive evolution of immune genes, if not compensated by behavior.

Simple genetic models for autism spectrum disorder

Simple genetic models for autism spectrum disorder
Swagatam Mukhopadhyay , Michael Wigler , Dan Levy
doi: http://dx.doi.org/10.1101/017301

To explore the interplay between new mutation, transmission, and gender bias in genetic disease requires formal quantitative modeling. Autism spectrum disorders offer an ideal case: they are genetic in origin, complex, and show a gender bias. The high reproductive costs of autism ensure that most strongly associated genetic mutations are short-lived, and indeed the disease exhibits both transmitted and de novo components. There is a large body of both epidemiologic and genomic data that greatly constrain the genetic mechanisms that may contribute to the disorder. We develop a computational framework that assumes classes of additive variants, each member of a class having equal effect. We restrict our initial exploration to single class models, each having three parameters. Only one model matches epidemiological data. It also independently matches the incidence of de novo mutation in simplex families, the gender bias in unaffected siblings in simplex populations, and rates of mutation in target genes. This model makes strong and as yet not fully tested predictions, namely that females are the primary carriers in cases of genetic transmission, and that the incidence of de novo mutation in target genes for families at high risk for autism are not especially elevated. In its simplicity, this model does not account for MZ twin concordance or the distorted gender bias of high functioning children with ASD, and does not accommodate all the known mechanisms contributing to ASD. We point to the next steps in applying the same computational framework to explore more complex models.

Genetic variability under the seed bank coalescent

Genetic variability under the seed bank coalescent
Jochen Blath , Bjarki Eldon , Adrian Casanova , Noemi Kurt , Maite Wilke-Berenguer
doi: http://dx.doi.org/10.1101/017244

We analyse patterns of genetic variability of populations in the presence of a large seed bank with the help of a new coalescent structure called the seed bank coalescent. This ancestral process appears naturally as scaling limit of the genealogy of large populations that sustain seed banks, if the seed bank size and individual dormancy times are of the same order as the active population. Mutations appear as Poisson processes on the active lineages, and potentially at reduced rate also on the dormant lineages. The presence of `dormant’ lineages leads to qualitatively altered times to the most recent common ancestor and non-classical patterns of genetic diversity. To illustrate this we provide a Wright-Fisher model with seed bank component and mutation, motivated from recent models of microbial dormancy, whose genealogy can be described by the seed bank coalescent. Based on our coalescent model, we derive recursions for the expectation and variance of the time to most recent common ancestor, number of segregating sites, pairwise differences, and singletons. Estimates (obtained by simulations) of the distributions of commonly employed distance statistics, in the presence and absence of a seed bank, are compared. The effect of a seed bank on the expected site-frequency spectrum is also investigated using simulations. Our results indicate that the presence of a large seed bank considerably alters the distribution of some distance statistics, as well as the site-frequency spectrum. Thus, one should be able to detect the presence of a large seed bank in genetic data.

A simple biophysical model predicts more rapid accumulation of hybrid incompatibilities in small populations

A simple biophysical model predicts more rapid accumulation of hybrid incompatibilities in small populations
Bhavin S. Khatri, Richard A. Goldstein
Comments: 13 pages, 6 figures
Subjects: Populations and Evolution (q-bio.PE)

Speciation is fundamental to the huge diversity of life on Earth. Evidence suggests reproductive isolation arises most commonly in allopatry with a higher speciation rate in small populations. Current theory does not address this dependence in the important weak mutation regime. Here, we examine a biophysical model of speciation based on the binding of a protein transcription factor to a DNA binding site, and how their independent co-evolution, in a stabilizing landscape, of two allopatric lineages leads to incompatibilities. Our results give a new prediction for the monomorphic regime of evolution, consistent with data, that smaller populations should develop incompatibilities more quickly. This arises as: 1) smaller populations having a greater initial drift load, as there are more sequences that bind poorly than well, so fewer substitutions are needed to reach incompatible regions of phenotype space; 2) slower divergence when the population size is larger than the inverse of discrete differences in fitness. Further, we find longer sequences develop incompatibilities more quickly at small population sizes, but more slowly at large population sizes. The biophysical model thus represents a robust mechanism of rapid reproductive isolation for small populations and large sequences, that does not require peak-shifts or positive selection.

Analysis of adaptive walks on NK fitness landscapes with different interaction schemes

Analysis of adaptive walks on NK fitness landscapes with different interaction schemes
Stefan Nowak, Joachim Krug
Comments: 29 pages, 9 figures
Subjects: Populations and Evolution (q-bio.PE); Disordered Systems and Neural Networks (cond-mat.dis-nn)

Fitness landscapes are genotype to fitness mappings commonly used in evolutionary biology and computer science which are closely related to spin glass models. In this paper, we study the NK model for fitness landscapes where the interaction scheme between genes can be explicitly defined. The focus is on how this scheme influences the overall shape of the landscape. Our main tool for the analysis are adaptive walks, an idealized dynamics by which the population moves uphill in fitness and terminates at a local fitness maximum. We use three different types of walks and investigate how their length (the number of steps required to reach a local peak) and height (the fitness at the endpoint of the walk) depend on the dimensionality and structure of the landscape. We find that the distribution of local maxima over the landscape is particularly sensitive to the choice of interaction pattern. Most quantities that we measure are simply correlated to the rank of the scheme, which is equal to the number of nonzero coefficients in the expansion of the fitness landscape in terms of Walsh functions.

Entire genome transcription across evolutionary time exposes non-coding DNA to de novo gene emergence

Entire genome transcription across evolutionary time exposes non-coding DNA to de novo gene emergence
Rafik Neme , Diethard Tautz
doi: http://dx.doi.org/10.1101/017152

Even in the best studied Mammalian genomes, less than 5% of the total genome length is annotated as exonic. However, deep sequencing analysis in humans has shown that around 40% of the genome may be covered by poly-adenylated non-coding transcripts occurring at low levels. Their functional significance is unclear, and there has been a dispute whether they should be considered as noise of the transcriptional machinery. We propose that if such transcripts show some evolutionary stability they will serve as substrates for de novo gene evolution, i.e. gene emergence out of non-coding DNA. Here, we characterize the phylogenetic turnover of low-level poly-adenylated transcripts in a comprehensive sampling of populations, sub-species and species of the genus Mus, spanning a phylogenetic distance of about 10 Myr. We find evidence for more evolutionary stable gains of transcription than losses among closely related taxa, balanced by a loss of older transcripts across the whole phylogeny. We show that adding taxa increases the genomic transcript coverage and that no major transcript-free islands exist over time. This suggests that the entire genome can be transcribed into poly-adenylated RNA when viewed at an evolutionary time scale. Thus, any part of the “non-coding” genome can become subject to evolutionary functionalization via de novo gene evolution.

MMR: A Tool for Read Multi-Mapper Resolution

MMR: A Tool for Read Multi-Mapper Resolution
Andre Kahles , Jonas Behr , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/017103

Motivation: Mapping high throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase, but also has significant impact on the results of downstream analyses. We present the multimapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads. Results: Filtering alignments with MMR can significantly improve the performance of downstream analyses like transcript quantitation and differential testing. We illustrate that the accuracy (Spearman correlation) of transcript quantification increases by 17% when using reads of length 51. In addition, MMR decreases the alignment file sizes by more than 50% and this leads to a reduced running time of the quantification tool. Our efficient implementation of the MMR algorithm is easily applicable as a post-processing step to existing alignment files in BAM format. Its complexity scales linearly with the number of alignments and requires no further inputs. Supplementary Material: Source code and documentation are available for download at http://github.com/ratschlab/mmr. Supplementary text and figures, comprehensive testing results and further information can be found at http://bioweb.me/mmr.

How complexity originates: The evolution of animal eyes

How complexity originates: The evolution of animal eyes
Todd H Oakley , Daniel I Speiser
doi: http://dx.doi.org/10.1101/017129

Learning how complex traits like eyes originate is fundamental for understanding evolution. Here, we first sketch historical perspectives on trait origins and argue that new technologies offer key new insights. Next, we articulate four open questions about trait origins. To address them, we define a research program to break complex traits into components and study the individual evolutionary histories of those parts. By doing so, we can learn when the parts came together and perhaps understand why they stayed together. We apply the approach to five structural innovations critical for complex eyes, reviewing the history of the parts of each of those innovations. Photoreceptors evolved within animals by bricolage, recombining genes that originated far earlier. Multiple genes used in eyes today had ancestral roles in stress responses. We hypothesize that photo-stress could have increased the chance those genes were expressed together in places on animals where light was abundant.