# Clusters of microRNAs emerge by new hairpins in existing transcripts

Clusters of microRNAs emerge by new hairpins in existing transcripts
Antonio Marco, Maria Ninova, Matthew Ronshaugen, Sam Griffiths-Jones
(Submitted on 9 Apr 2013)

Genetic linkage may result in the expression of multiple products from a single polycistronic transcript, under the control of a single promoter. In animals, protein-coding polycistronic transcripts are rare. However, microRNAs are frequently clustered in the genomes of animals and plants, and these clusters are often transcribed as a single unit. The evolution of microRNA clusters has been the subject of much speculation, and a selective advantage of clusters of functionally related microRNAs is often proposed. However, the origin of microRNA clusters has not been so far systematically explored. Here we study the evolution of all microRNA clusters in Drosophila melanogaster, and suggest a number of models for their emergence. We observed that a majority of microRNA clusters arose by the de novo formation of new microRNA-like hairpins in existing microRNA transcripts. Some clusters also emerged by tandem duplication of a single microRNA. Comparative genomics show that these clusters, once formed, are unlikely to split or undergo rearrangements. We did not find any instances of clusters appearing by rearrangement of pre-existing microRNA genes. We propose a model for microRNA cluster origin and evolution in which selection over one of the microRNAs in the cluster interferes with the evolution of the other tightly linked microRNAs. Our analysis suggests that the evolutionary study of microRNAs and other small RNAs must consider and account for linkage associations.

# An algebraic framework to sample the rearrangement histories of a cancer metagenome with double cut and join, duplication and deletion events

An algebraic framework to sample the rearrangement histories of a cancer metagenome with double cut and join, duplication and deletion events
Daniel R. Zerbino, Benedict Paten, Glenn Hickey, David Haussler
(Submitted on 22 Mar 2013)

Algorithms to study structural variants (SV) in whole genome sequencing (WGS) cancer datasets are currently unable to sample the entire space of rearrangements while allowing for copy number variations (CNV). In addition, rearrangement theory has up to now focused on fully assembled genomes, not on fragmentary observations on mixed genome populations. This affects the applicability of current methods to actual cancer datasets, which are produced from short read sequencing of a heterogeneous population of cells. We show how basic linear algebra can be used to describe and sample the set of possible sequences of SVs, extending the double cut and join (DCJ) model into the analysis of metagenomes. We also describe a functional pipeline which was run on simulated as well as experimental cancer datasets.

# Major changes in the core developmental pathways of nematodes: Romanomermis culicivorax reveals the derived status of the Caenorhabditis elegans model

Major changes in the core developmental pathways of nematodes: Romanomermis culicivorax reveals the derived status of the Caenorhabditis elegans model
Philipp H. Schiffer, Michael Kroiher, Christopher Kraus, Georgios D. Koutsovoulos, Sujai Kumar, Julia I. R. Camps, Ndifon A. Nsah, Dominik Stappert, Krystalynne Morris, Peter Heger, Janine Altmüller, Peter Frommolt, Peter Nürnberg, W. Kelley Thomas, Mark L. Blaxter, Einhard Schierenberg
(Submitted on 17 Mar 2013)

Background Despite its status as a model organism, the development of Caenorhabditis elegans is not necessarily archetypical for nematodes. The phylum Nematoda is divided into the Chromadorea (indcludes C. elegans) and the Enoplea. Compared to C. elegans, enoplean nematodes have very different patterns of cell division and determination. Embryogenesis of the enoplean Romanomermis culicivorax has been studied in great detail, but the genetic circuitry underpinning development in this species is unknown. Results We created a draft genome of R. culicivorax and compared its developmental gene content with that of two nematodes, C. elegans and Trichinella spiralis (another enoplean), and a representative arthropod Tribolium castaneum. This genome evidence shows that R. culicivorax retains components of the conserved metazoan developmental toolkit lost in C. elegans. T. spiralis has independently lost even more of the toolkit than has C. elegans. However, the C. elegans toolkit is not simply depauperate, as many genes essential for embryogenesis in C. elegans are unique to this lineage, or have only extremely divergent homologues in R. culicivorax and T. spiralis. These data imply fundamental differences in the genetic programmes for early cell specification, inductive interactions, vulva formation and sex determination. Conclusions Thus nematodes, despite their apparent phylum-wide morphological conservatism, have evolved major differences in the molecular logic of their development. R. culicivorax serves as a tractable, contrasting model to C. elegans for understanding how divergent genomic and thus regulatory backgrounds can generate a conserved phenotype. The availability of the draft genome will promote use of R. culicivorax as a research model.

# A Unifying Parsimony Model of Genome Evolution

A Unifying Parsimony Model of Genome Evolution
Benedict Paten, Daniel R. Zerbino, Glenn Hickey, David Haussler
(Submitted on 9 Mar 2013)

The study of molecular evolution rests on the classical fields of population genetics and systematics, but the increasing availability of DNA sequence data has broadened the field in the last decades, leading to new theories and methodologies. This includes parsimony and maximum likelihood methods of phylogenetic tree estimation, the theory of genome rearrangements, and the coalescent model with recombination. These all interact in the study of genome evolution, yet to date they have only been pursued in isolation. We present the first unified parsimony framework for the study of genome evolutionary histories that includes all of these aspects, proposing a graphical data structure called a history graph that is intended to form a practical basis for analysis. We define tractable upper and lower bound parsimony cost functions on history graphs that incorporate both substitutions and rearrangements. We demonstrate that these bounds become tight for a special unambiguous type of history graph called an ancestral variation graph (AVG), which captures in its combinatorial structure the operations required in an evolutionary history. For an input history graph G, we demonstrate that there exists a finite set of interpretations of G that contains all minimal (lacking extraneous elements) and most parsimonious AVG interpretations of G. We define a partial order over this set and an associated set of sampling moves that can be used to explore these DNA histories. These results generalise and conceptually simplify the problem so that we can sample evolutionary histories using parsimony cost functions that account for all substitutions and rearrangements in the presence of duplications.

# A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes

A Model-Based Analysis of GC-Biased Gene Conversion in the Human and Chimpanzee Genomes
John A. Capra, Melissa J. Hubisz, Dennis Kostka, Katherine S. Pollard, Adam Siepel
(Submitted on 9 Mar 2013)

GC-biased gene conversion (gBGC) is a recombination-associated process that favors the fixation of G/C alleles over A/T alleles. In mammals, gBGC is hypothesized to contribute to variation in GC content, rapidly evolving sequences, and the fixation of deleterious mutations, but its prevalence and general functional consequences remain poorly understood. gBGC is difficult to incorporate into models of molecular evolution and so far has primarily been studied using summary statistics from genomic comparisons. Here, we introduce a new probabilistic model that captures the joint effects of natural selection and gBGC on nucleotide substitution patterns, while allowing for correlations along the genome in these effects. We implemented our model in a computer program, called phastBias, that can accurately detect gBGC tracts ~1 kilobase or longer in simulated sequence alignments. When applied to real primate genome sequences, phastBias predicts gBGC tracts that cover roughly 0.3% of the human and chimpanzee genomes and account for 1.2% of human-chimpanzee nucleotide differences. These tracts fall in clusters, particularly in subtelomeric regions; they are enriched for recombination hotspots and fast-evolving sequences; and they display an ongoing fixation preference for G and C alleles. We also find some evidence that they contribute to the fixation of deleterious alleles, including an enrichment for disease-associated polymorphisms. These tracts provide a unique window into historical recombination processes along the human and chimpanzee lineages; they supply additional evidence of long-term conservation of megabase-scale recombination rates accompanied by rapid turnover of hotspots. Together, these findings shed new light on the evolutionary, functional, and disease implications of gBGC. The phastBias program and our predicted tracts are freely available.

# Slow evolution of vertebrates with large genomes

Slow evolution of vertebrates with large genomes
Bianca Sclavi, John Herrick

Darwin introduced the concept of the “living fossil” to describe species belonging to lineages that have experienced little evolutionary change, and suggested that species in more slowly evolving lineages are more prone to extinction (1). Recent studies revealed that some living fossils such as the lungfish are indeed evolving more slowly than other vertebrates (2, 3). The reason for the slower rate of evolution in these lineages remains unclear, but the same observations suggest a possible genome size effect on rates of evolution. Genome size (C-value) in vertebrates varies over 200 fold ranging from pufferfish (0.4 pg) to lungfish (132.8 pg) (4). Variation in genome size and architecture is a fundamental cellular adaptation that remains poorly understood (5). C-value is correlated with several allometric traits such as body size and developmental rates in many, but not all, organisms (6, 7). To date, no consensus exists concerning the mechanisms driving genome size evolution or the effect that genome size has on species traits such as evolutionary rates (8-12). In the following we show that: 1) within the same range of divergence times, genetic diversity decreases as genome size increases and 2) average rates of molecular evolution decline with increasing genome size in vertebrates. Together, these observations indicate that genome size is an important factor influencing rates of speciation and extinction.

# Our Paper: Transcript length mediates developmental timing of gene expression across Drosophila.

This guest post is a commentary by Carlo Artieri on “Transcript length mediates developmental timing of gene expression across Drosophila” by Artieri, C.G. and H.B. Fraser. The preprint is arXived here.

We have recently posted a preprint manuscript to arXiv that tests a decades-old hypothesis about how biological aspects of development constraint gene structure using several genome-scale transcriptional timecourses and interpret its effects in the context of Drosophila evolution. The paper may be of particular interest to researchers using genomic data in evo-devo studies.

During the early stages of identification and characterization of homeobox
domain (HOX) genes and their related regulators, it was noted that they activated in a temporally sequential manner roughly correlated to their pre-mRNA transcript length (i.e., short genes express early, followed by longer genes.) This led to the hypothesis that this pattern was produced by a purely physical mechanism (Gubb 1986): genes with long pre-mRNAs cannot complete transcription in the interval between the rapid cell cycles taking place during early insect development, leading to abortive, non-functional transcripts. As long pre-mRNAs result primarily from long introns, this was termed ‘Intron Delay’.

We explored patterns of expression of genes in D. melanogaster over two embryonic timescales: eight time points spanning the latter part of the early embryonic ‘syncytial cycles’, during which the most rapid cell cycles take place, and 12 time points spanning the ~24 hours of embryogenesis. Long genes (≥ 5 kb long pre-mRNA transcripts) expressed from the zygotic genome showed a lag in the time required to reach stable levels of expression relative to short genes (< 5 kb) in both timecourses; in fact, stable expression of long genes did not occur until ~12 hours into embryogenesis, or midway between fertilization and emergence of larva from the egg. No such pattern was observed among long or short genes that are maternally deposited in the embryo, as is expected if inability to terminate transcription is the driving mechanism behind this delay. Additional embryonic timecourse data from RNA-Seq libraries generated from non poly-A selected total RNA, and therefore not biased towards capture of processed RNAs, showed that only long zygotic
genes expressed during the earliest developmental time points show a marked deficiency in 3’ relative to 5’ derived reads. This is consistent with their inability to terminate transcription, but not with transcriptional delay due to reduced transcriptional activation during early development.

The analysis was extended using developmental expression data from 3 additional Drosophila species spanning ~60 million years of evolution and showed that this pattern of delayed expression of long zygotically expressed genes is conserved across the phylogeny. This led us to predict that short zygotically expressed genes that are conserved in their ability to escape intron delay would be under substantial evolutionary pressure to maintain their compact lengths, and found that this was the case when compared to long zygotic or either short or long maternally deposited genes.

We suggest that intron delay is an underappreciated mechanism affecting the expression level of a substantial fraction of the Drosophila embryonic transcriptome (~10%) and acts as a source of significant constraint on the structural evolution of important developmental genes.

References:
Gubb D. 1986. Intron‐delay and the precision of expression of homoeotic gene products in Drosophila. Developmental Genetics 7: 119–131

# Transcript length mediates developmental timing of gene expression across Drosophila

Transcript length mediates developmental timing of gene expression across Drosophila
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 18 Jan 2013)

The time required to transcribe genes with long primary transcripts may limit their ability to be expressed in cells with short mitotic cycles, a phenomenon termed intron delay. As such short cycles are a hallmark of the earliest stages of insect development, we used Drosophila developmental timecourse expression data to test whether intron delay affects gene expression genome-wide, and to determine its consequences for the evolution of gene structure. We find that long zygotically expressed, but not maternally deposited, genes show substantial delay in expression relative to their shorter counterparts and that this delay persists over a substantial portion of the ~24 hours of embryogenesis. Patterns of RNA-seq coverage from the 5′ and 3′ ends of transcripts show that this delay is consistent with their inability to terminate transcription, but not with transcriptional initiation-based regulatory control. Highly expressed zygotic genes are subject to purifying selection to maintain compact transcribed regions, allowing conservation of embryonic expression patterns across the Drosophila phylogeny. We propose that intron delay is an underappreciated physical mechanism affecting both patterns of expression as well as gene structure of many genes across Drosophila.

# Loss of amyloid disaggregases during the evolution of Metazoa

Loss of amyloid disaggregases during the evolution of Metazoa
Albert Erives, Jan Fassler
(Submitted on 15 Jan 2013)

In yeast, phenotypic adaptations can evolve by natural selection of conformational variant prions and their variant amyloid fibers. This system requires the Hsp104 disaggregase, which fragments amyloid fibers into smaller seed prions that are passed on to mitotic descendants and meiotic spores. Interestingly, Hsp104 is found in diverse eukaryotes except metazoans. To investigate whether a prion-based transmission “genetics” was incompatible with the evolution of Metazoa, we identify genes conserved in fungi and choanoflagellates but lost in animals. We show that both eukaryotic clpB amyloid disaggregases, HSP104 and its nuclear-encoded mitochondrial endo-ortholog HSP78, were lost in the stem-metazoan lineage along with only a small number of other relevant genes. We show that these gene losses are not unrelated historical accidents because these loci comprise a very small regulon devoted to prion transmission in yeast. We propose that evolution of developmental asymmetric cell-specifications necessitated the evolutionary deprecation of the ancient clpB system.

# Horizontal gene transfer may explain variation in θs

Horizontal gene transfer may explain variation in θs
Rohan Maddamsetti, Philip J. Hatcher, Stéphane Cruveiller, Claudine Médigue, Jeffrey E. Barrick, Richard E. Lenski
(Submitted on 28 Sep 2012)

Martincorena et al. estimated synonymous diversity ($\theta s = 2N \mu$) across 2,930 orthologous gene alignments from 34 Escherichia coli genomes, and found substantial variation among genes in the density of synonymous polymorphisms. They argue that this pattern reflects variation in the mutation rate per nucleotide ($\mu$) among genes. However, the effective population size (N) is not necessarily constant across the genome. In particular, different genes may have different histories of horizontal gene transfer (HGT), whereas Martincorena et al. used a model with random recombination to calculate $\theta s$. They did filter alignments in an effort to minimize the effects of HGT, but we doubt that any procedure can completely eliminate HGT among closely related genomes, such as E. coli living in the complex gut community.
Here we show that there is no significant variation among genes in rates of synonymous substitutions in a long-term evolution experiment with E. coli and that the per-gene rates are not correlated with $\theta s$ estimates from genome comparisons. However, there is a significant association between $\theta s$ and HGT events. Together, these findings imply that $\theta s$ variation reflects different histories of HGT, not local optimization of mutation rates to reduce the risk of deleterious mutations as proposed by Martincorena et al.