Slow evolution of vertebrates with large genomes

Slow evolution of vertebrates with large genomes
Bianca Sclavi, John Herrick

Darwin introduced the concept of the “living fossil” to describe species belonging to lineages that have experienced little evolutionary change, and suggested that species in more slowly evolving lineages are more prone to extinction (1). Recent studies revealed that some living fossils such as the lungfish are indeed evolving more slowly than other vertebrates (2, 3). The reason for the slower rate of evolution in these lineages remains unclear, but the same observations suggest a possible genome size effect on rates of evolution. Genome size (C-value) in vertebrates varies over 200 fold ranging from pufferfish (0.4 pg) to lungfish (132.8 pg) (4). Variation in genome size and architecture is a fundamental cellular adaptation that remains poorly understood (5). C-value is correlated with several allometric traits such as body size and developmental rates in many, but not all, organisms (6, 7). To date, no consensus exists concerning the mechanisms driving genome size evolution or the effect that genome size has on species traits such as evolutionary rates (8-12). In the following we show that: 1) within the same range of divergence times, genetic diversity decreases as genome size increases and 2) average rates of molecular evolution decline with increasing genome size in vertebrates. Together, these observations indicate that genome size is an important factor influencing rates of speciation and extinction.

Our Paper: Transcript length mediates developmental timing of gene expression across Drosophila.

This guest post is a commentary by Carlo Artieri on “Transcript length mediates developmental timing of gene expression across Drosophila” by Artieri, C.G. and H.B. Fraser. The preprint is arXived here.

We have recently posted a preprint manuscript to arXiv that tests a decades-old hypothesis about how biological aspects of development constraint gene structure using several genome-scale transcriptional timecourses and interpret its effects in the context of Drosophila evolution. The paper may be of particular interest to researchers using genomic data in evo-devo studies.

During the early stages of identification and characterization of homeobox
domain (HOX) genes and their related regulators, it was noted that they activated in a temporally sequential manner roughly correlated to their pre-mRNA transcript length (i.e., short genes express early, followed by longer genes.) This led to the hypothesis that this pattern was produced by a purely physical mechanism (Gubb 1986): genes with long pre-mRNAs cannot complete transcription in the interval between the rapid cell cycles taking place during early insect development, leading to abortive, non-functional transcripts. As long pre-mRNAs result primarily from long introns, this was termed ‘Intron Delay’.

We explored patterns of expression of genes in D. melanogaster over two embryonic timescales: eight time points spanning the latter part of the early embryonic ‘syncytial cycles’, during which the most rapid cell cycles take place, and 12 time points spanning the ~24 hours of embryogenesis. Long genes (≥ 5 kb long pre-mRNA transcripts) expressed from the zygotic genome showed a lag in the time required to reach stable levels of expression relative to short genes (< 5 kb) in both timecourses; in fact, stable expression of long genes did not occur until ~12 hours into embryogenesis, or midway between fertilization and emergence of larva from the egg. No such pattern was observed among long or short genes that are maternally deposited in the embryo, as is expected if inability to terminate transcription is the driving mechanism behind this delay. Additional embryonic timecourse data from RNA-Seq libraries generated from non poly-A selected total RNA, and therefore not biased towards capture of processed RNAs, showed that only long zygotic
genes expressed during the earliest developmental time points show a marked deficiency in 3’ relative to 5’ derived reads. This is consistent with their inability to terminate transcription, but not with transcriptional delay due to reduced transcriptional activation during early development.

The analysis was extended using developmental expression data from 3 additional Drosophila species spanning ~60 million years of evolution and showed that this pattern of delayed expression of long zygotically expressed genes is conserved across the phylogeny. This led us to predict that short zygotically expressed genes that are conserved in their ability to escape intron delay would be under substantial evolutionary pressure to maintain their compact lengths, and found that this was the case when compared to long zygotic or either short or long maternally deposited genes.

We suggest that intron delay is an underappreciated mechanism affecting the expression level of a substantial fraction of the Drosophila embryonic transcriptome (~10%) and acts as a source of significant constraint on the structural evolution of important developmental genes.

References:
Gubb D. 1986. Intron‐delay and the precision of expression of homoeotic gene products in Drosophila. Developmental Genetics 7: 119–131

Transcript length mediates developmental timing of gene expression across Drosophila

Transcript length mediates developmental timing of gene expression across Drosophila
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 18 Jan 2013)

The time required to transcribe genes with long primary transcripts may limit their ability to be expressed in cells with short mitotic cycles, a phenomenon termed intron delay. As such short cycles are a hallmark of the earliest stages of insect development, we used Drosophila developmental timecourse expression data to test whether intron delay affects gene expression genome-wide, and to determine its consequences for the evolution of gene structure. We find that long zygotically expressed, but not maternally deposited, genes show substantial delay in expression relative to their shorter counterparts and that this delay persists over a substantial portion of the ~24 hours of embryogenesis. Patterns of RNA-seq coverage from the 5′ and 3′ ends of transcripts show that this delay is consistent with their inability to terminate transcription, but not with transcriptional initiation-based regulatory control. Highly expressed zygotic genes are subject to purifying selection to maintain compact transcribed regions, allowing conservation of embryonic expression patterns across the Drosophila phylogeny. We propose that intron delay is an underappreciated physical mechanism affecting both patterns of expression as well as gene structure of many genes across Drosophila.

Loss of amyloid disaggregases during the evolution of Metazoa

Loss of amyloid disaggregases during the evolution of Metazoa
Albert Erives, Jan Fassler
(Submitted on 15 Jan 2013)

In yeast, phenotypic adaptations can evolve by natural selection of conformational variant prions and their variant amyloid fibers. This system requires the Hsp104 disaggregase, which fragments amyloid fibers into smaller seed prions that are passed on to mitotic descendants and meiotic spores. Interestingly, Hsp104 is found in diverse eukaryotes except metazoans. To investigate whether a prion-based transmission “genetics” was incompatible with the evolution of Metazoa, we identify genes conserved in fungi and choanoflagellates but lost in animals. We show that both eukaryotic clpB amyloid disaggregases, HSP104 and its nuclear-encoded mitochondrial endo-ortholog HSP78, were lost in the stem-metazoan lineage along with only a small number of other relevant genes. We show that these gene losses are not unrelated historical accidents because these loci comprise a very small regulon devoted to prion transmission in yeast. We propose that evolution of developmental asymmetric cell-specifications necessitated the evolutionary deprecation of the ancient clpB system.

Horizontal gene transfer may explain variation in θs

Horizontal gene transfer may explain variation in θs
Rohan Maddamsetti, Philip J. Hatcher, Stéphane Cruveiller, Claudine Médigue, Jeffrey E. Barrick, Richard E. Lenski
(Submitted on 28 Sep 2012)

Martincorena et al. estimated synonymous diversity (\theta s = 2N \mu ) across 2,930 orthologous gene alignments from 34 Escherichia coli genomes, and found substantial variation among genes in the density of synonymous polymorphisms. They argue that this pattern reflects variation in the mutation rate per nucleotide (\mu) among genes. However, the effective population size (N) is not necessarily constant across the genome. In particular, different genes may have different histories of horizontal gene transfer (HGT), whereas Martincorena et al. used a model with random recombination to calculate \theta s. They did filter alignments in an effort to minimize the effects of HGT, but we doubt that any procedure can completely eliminate HGT among closely related genomes, such as E. coli living in the complex gut community.
Here we show that there is no significant variation among genes in rates of synonymous substitutions in a long-term evolution experiment with E. coli and that the per-gene rates are not correlated with \theta s estimates from genome comparisons. However, there is a significant association between \theta s and HGT events. Together, these findings imply that \theta s variation reflects different histories of HGT, not local optimization of mutation rates to reduce the risk of deleterious mutations as proposed by Martincorena et al.

Our paper: An age-of-allele test of neutrality for transposable element insertions not at equilibrium

[This author post is by Justin Blumenstiel and Casey Bergman on An age-of-allele test of neutrality for transposable element insertions not at equilibrium, available from the arXiv here]

Studies over the past several decades in Drosophila melanogaster have demonstrated that TE insertion alleles in natural populations tend to segregate at low frequency, particularly in regions of the genome that have a high recombination rate where natural selection is most effective. These results have largely supported a model where natural selection acts to remove deleterious TE insertions from the genome.  The prevailing model of why TE insertions are deleterious is that they lead to chromosomal aberrations that occur when dispersed, non-allelic repeated sequences crossover with one another. This model is known as the ectopic recombination model and it has an important feature. Since each new insertion has the potential to recombine with all the other copies in the genome, fitness will go down faster and faster with each new copy. This yields a stable equilibrium in TE copy number.

But, are TEs at equilibrium in natural populations? Genome sequencing studies have shown that the rate of TE proliferation can vary widely over time and any given TE family may demonstrate non-equilibrium “boom and bust” behavior. How do we reconcile studies that assume equilibrium with the fact that we know TE dynamics are not at equilibrium? To deal with this problem, I began developing this model out of a class project with John Wakeley while I was a graduate student over a decade ago. This model arose of some work I published in 200­2 with Hartl and Lozovsky on the age structure of non-LTR elements in D. melanogaster. I wrote this model up for my Ph.D. thesis and presented a preliminary version in a paper with Neafsey and Hartl in 2004, but it sat on the back burner until I reviewed a paper by Bergman and Bensasson in 2007 that showed many TE families in D. melanogaster have recently inserted in the genome and may not be at equilibrium.

Shortly after their paper came out I contacted Casey with the model from my thesis and we decided to push this idea forward as a collaboration, which has taken several a few years to come to fruition (both being busy with other projects and starting our labs). Things started to really move ahead when Miaomiao He in Casey’s lab generated a crucial data set that could be specifically applied to the model – strain-specific presence/absence data for a very large number of TE insertions ascertained from the D. melanogaster genome sequence.  After a few more years with it on simmer, working out several kinks in the mean time (e.g. incorporating host  demography, trying many different methods for estimating the posterior distribution of TE ages), Casey and I finally wrapped it up just as Haldane’s Sieve is starting to hit its stride. I expect that all my papers in the future will be pre-released on arXiv.

I could speak at length on the specific results, but I would just be saying what is already in the abstract. So, I would like to bring up three points for potential conversation.

First, what does it mean for TEs to be at transposition-selection balance when we know different TE families show a signature of “boom and bust” in genome sequences? There may be one way to reconcile this apparent problem. Any particular TE family may in fact not be at transposition-selection balance. For example, the P element, which invaded Drosophila melanogaster only a few decades ago, is hardly at transposition-selection balance. Therefore, one must be careful in using insertion frequencies for P elements to describe general TE dynamics. However, by integrating over all TE families in the genome, one may in fact reach an approximation that might be reasonable for assuming equilibrium transposition-selection balance. But one must be careful of something I call “family ascertainment bias”. Sometimes the most recently activated TEs are the ones easiest to discover and annotate because these ones are easily cloned from insertion mutations or are most frequent in genome sequences.

Second, in this paper, we derive the probability distribution for each individual TE insertion frequency based on its age. We demonstrate that this provides a method for TE insertions that are either positively or negatively selected. In the case where we show allele frequencies are less than expected (i.e. predicted to be negatively selected), many of these are copies that have zero substitutions. In principle, all of these could have inserted one generation before the reference strain was collected for genome sequencing. The inference that selection is acting against these TEs implicitly assumes either: 1) this wasn’t the case for many of these insertions, and the posterior distribution of ages is a good representation of the true age distribution, or 2) it may have been the case, but natural selection has already acted to remove slightly older TEs from the population, therefore making them absent from the genome sequence.

Third, when putting the finishing touches on our analysis of TE insertion data in North America, we ran up against the issue that nobody has yet published an explicit demographic scenario for North American populations of D. melanogaster, similar to those that have been developed by Wolfgang Stephan‘s Lab and others for European and African populations. We found one paper by Yukilevich et al (2010) from John True’s Lab that generated similar findings to the demography of European populations, which is consistent with the idea that North America populations of D. melanogaster are mainly derived from European ancestors.  However, Yukilevich et al (2010) didn’t explicitly model the admixture with African populations, which is known to occur in North American populations as shown by Caracristi and Schlötterer in 2003. We were surprised that an explicit admixture scenario has not been published yet, especially since this is crucial for interpreting the data from population genomic projects like the Drosophila Genetic Reference Panel. This should be an important line of work for someone to pursue (if it isn’t being done already) and if anyone has information about this a demographic model for North American populations of D. melanogaster, we’d be keen to know more so we can see if might improve our analysis.

Justin and Casey

Protein function influences frequency of encoded regions containing VNTRs and number of unique interactions

Protein function influences frequency of encoded regions containing VNTRs and number of unique interactions

Suzanne Bowen
(Submitted on 25 Sep 2012)

Proteins encoded by genes containing regions of variable number tandem repeats (VNTRs) are known to be polymorphic within species but the influence of their instability in molecular interactions remains unclear. VNTRS are overrepresented in encoding sequence of particular functional groups where their presence could influence protein interactions. Using human consensus coding sequence, this work examines if genomic instability, determined by regions of VNTRs, influences the number of protein interactions. Findings reveal that, in relation to protein function, the frequency of unique interactions in human proteins increase with the number of repeated regions. This supports experimental evidence that repeat expansion may lead to an increase in molecular interactions. Genetic diversity, estimated by Ka/Ks, appeared to decrease as the number of protein-protein interactions increased. Additionally, G+C and CpG content were negatively correlated with increasing occurrence of VNTRs. This may indicate that nucleotide composition along with selective processes can increase genomic stability and thereby restrict the expansion of repeated regions. Proteins involved in acetylation are associated with a high number of repeated regions and interactions but a low G+C and CpG content. While in contrast, less interactive membrane proteins contain a lower number of repeated regions but higher levels of C+G and CpGs. This work provides further evidence that VNTRs may provide the genetic variability to generate unique interactions between proteins.

Our paper: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution

This guest post is by Daniël Melters [@DPMelters] and Keith Bradnam [@kbradnam] on their paper [along with co-authors]: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. ArXived here.

The centromere poses an interesting paradox; although its function is essential, its molecular components are fast evolving. Centromeres in many animal and plant genomes have been characterized by the presence of large tandem repeat arrays. Numerous studies have suggested that the composition and length of the repeat units that comprise these arrays vary between species.
In this paper we tried to answer three main questions:
1) Can we identify the candidate centromere repeat sequences in genomes from hundreds of different species?
2) Do candidate centromere repeat sequences from different species share any common properties (sequence composition, length, GC% etc)?
3) How do these tandem repeats evolve?
To answer these questions, we took advantage of the large number of species with publicly available whole genome shotgun sequence data from various sequencing platforms. In total we analyzed 282 animal and plant genomes for the presence of high copy tandem repeat sequences, with the assumption that the most abundant tandem repeat is a good candidate for the centromere repeat.

We found high copy tandem repeats in the vast majority of the 282 genomes that we analyzed. For the smaller number of species with published cytology data, we correctly identified the published repeat sequence in 38 out of 43 cases. This confirms our assumption that the most abundant tandem repeat in any genome is likely to be the centromere repeat. In the five cases were we did not find the published centromere tandem repeats, we did not have data from sequencing platforms that would have allowed us to identify these repeats.

If an individual sequencing read contains at least four tandem repeats, then there is the possibility of detecting higher order repeat (HOR) structure. I.e. where a tandem array is made up of two alternating types of related sequence (A and B) to produce an A->B->A->B structure. In these cases, the AB dimer is more similar to other AB dimers, than A is to B. We found that HOR structure was surprisingly common in the candidate centromere repeats of many different species. The very long reads from Pacific Biosciences (PacBio) sequencing allowed us to further characterize repeat structure in great detail (for a few selected species), and this revealed additional levels of HOR structure.

To address the important question of ‘how similar are centromere repeats in different species?’, we performed an all-vs-all comparison between the most abundant tandem repeat in every species. Surprisingly, we found only 26 groups of species that shared any significant sequence similarity in their candidate centromere repeat sequence. The species that make up these 26 groups were always closely related species which had diverged less than 50 million years ago. When comparing the repeat sequences in these groups of closely related species, we found that repeats evolve not only by accumulation of mutations, but also by the spread of indels or by repeat doubling.

These results are in line with the ‘library’ hypothesis, which aims to describe how ratios of repeat variants can change over time. In addition, PacBio sequencing found very long tandem repeats (~1,500 bp). Furthermore, in switchgrass (Panicum virgatum) we identified several centromere repeat variants, but PacBio sequences did not show any mixing of these repeat variants. In summary, tandem repeats are frequently associated with the centromere function and most probably evolve according to the “library” hypothesis (a.k.a. molecular drive).

This paper is dedicated to the late Simon Chan, who passed away on the 22nd of August 2012 at the young age of 38 (see here for more infomation).

Daniël Melters and Keith Bradnam
PS. Supplementary table can be provided upon email request.

Diversity and abundance of the Abnormal chromosome 10 meiotic drive complex in Zea mays

Diversity and abundance of the Abnormal chromosome 10 meiotic drive complex in Zea mays
Lisa B. Kanizay, Tanja Pyhäjärvi, Elizabeth G. Lowry, Matthew B. Hufford, Daniel G. Peterson, Jeffrey Ross-Ibarra, R. Kelly Dawe
(Submitted on 25 Sep 2012)

Maize Abnormal chromosome 10 (Ab10) contains a classic meiotic drive system that exploits asymmetry of meiosis to preferentially transmit itself and other chromosomes containing specialized heterochromatic regions called knobs. The structure and diversity of the Ab10 meiotic drive haplotype is poorly understood. We developed a BAC library from an Ab10 line and used the data to develop sequence-based markers, focusing on the proximal portion of the haplotype that shows partial homology to normal chromosome 10. These molecular and additional cytological data demonstrate that two previously identified Ab10 variants (Ab10-I and Ab10-II) share a common origin. Dominant PCR markers were used with FISH to assay 160 diverse teosinte and maize landrace populations from across the Americas, resulting in the identification of a previously unknown but prevalent form of Ab10 (Ab10-III). We find that Ab10 occurs in at least 75% of teosinte populations at a mean frequency of 15%. Ab10 was also found in 13% of the maize landraces, but does not appear to be fixed in any wild or cultivated population. Quantitative analyses suggest that the abundance and distribution of Ab10 is governed by a complex combination of intrinsic fitness effects as well as extrinsic environmental variability.

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution

Daniël P. Melters, Keith R. Bradnam, Hugh A. Young, Natalie Telis, Michael R. May, J. Graham Ruby, Robert Sebra, Paul Peluso, John Eid, David Rank, José Fernando Garcia, Joseph L. DeRisi, Timothy Smith, Christian Tobias, Jeffrey Ross-Ibarra, Ian F. Korf, Simon W.-L. Chan
(Submitted on 22 Sep 2012)

Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. The assumption that the most abundant tandem repeat is the centromere DNA was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and in length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond ~50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution, including the appearance of higher order repeat structures in which several polymorphic monomers make up a larger repeating unit. While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animals and plants. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.