This guest post is by Daniël Melters [@DPMelters] and Keith Bradnam [@kbradnam] on their paper [along with co-authors]: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. ArXived here.
The centromere poses an interesting paradox; although its function is essential, its molecular components are fast evolving. Centromeres in many animal and plant genomes have been characterized by the presence of large tandem repeat arrays. Numerous studies have suggested that the composition and length of the repeat units that comprise these arrays vary between species.
In this paper we tried to answer three main questions:
1) Can we identify the candidate centromere repeat sequences in genomes from hundreds of different species?
2) Do candidate centromere repeat sequences from different species share any common properties (sequence composition, length, GC% etc)?
3) How do these tandem repeats evolve?
To answer these questions, we took advantage of the large number of species with publicly available whole genome shotgun sequence data from various sequencing platforms. In total we analyzed 282 animal and plant genomes for the presence of high copy tandem repeat sequences, with the assumption that the most abundant tandem repeat is a good candidate for the centromere repeat.
We found high copy tandem repeats in the vast majority of the 282 genomes that we analyzed. For the smaller number of species with published cytology data, we correctly identified the published repeat sequence in 38 out of 43 cases. This confirms our assumption that the most abundant tandem repeat in any genome is likely to be the centromere repeat. In the five cases were we did not find the published centromere tandem repeats, we did not have data from sequencing platforms that would have allowed us to identify these repeats.
If an individual sequencing read contains at least four tandem repeats, then there is the possibility of detecting higher order repeat (HOR) structure. I.e. where a tandem array is made up of two alternating types of related sequence (A and B) to produce an A->B->A->B structure. In these cases, the AB dimer is more similar to other AB dimers, than A is to B. We found that HOR structure was surprisingly common in the candidate centromere repeats of many different species. The very long reads from Pacific Biosciences (PacBio) sequencing allowed us to further characterize repeat structure in great detail (for a few selected species), and this revealed additional levels of HOR structure.
To address the important question of ‘how similar are centromere repeats in different species?’, we performed an all-vs-all comparison between the most abundant tandem repeat in every species. Surprisingly, we found only 26 groups of species that shared any significant sequence similarity in their candidate centromere repeat sequence. The species that make up these 26 groups were always closely related species which had diverged less than 50 million years ago. When comparing the repeat sequences in these groups of closely related species, we found that repeats evolve not only by accumulation of mutations, but also by the spread of indels or by repeat doubling.
These results are in line with the ‘library’ hypothesis, which aims to describe how ratios of repeat variants can change over time. In addition, PacBio sequencing found very long tandem repeats (~1,500 bp). Furthermore, in switchgrass (Panicum virgatum) we identified several centromere repeat variants, but PacBio sequences did not show any mixing of these repeat variants. In summary, tandem repeats are frequently associated with the centromere function and most probably evolve according to the “library” hypothesis (a.k.a. molecular drive).
This paper is dedicated to the late Simon Chan, who passed away on the 22nd of August 2012 at the young age of 38 (see here for more infomation).
Daniël Melters and Keith Bradnam
PS. Supplementary table can be provided upon email request.