bioRxiv doi: http://dx.doi.org/10.1101/030163
The Chlamydomonas genome has been sequenced, assembled and annotated to produce a rich resource for genetics and molecular biology in this well-studied model organism. However, the current reference genome contains ~1000 blocks of unknown sequence (‘N-islands’), which are frequently placed in introns of annotated gene models. We developed a strategy, using careful bioinformatics analysis of short-sequence cDNA and genomic DNA reads, to search for previously unknown exons hidden within such blocks, and determine the sequence and exon/intron boundaries of such exons. These methods are based on assembly and alignment completely independent of prior reference assembly or reference annotation. Our evidence indicates that ~one-quarter of the annotated intronic N-islands actually contain hidden exons. For most of these our algorithm recovers full exonic sequence with associated splice junctions and exon-adjacent intron sequence, that can be joined to the reference genome assembly and annotated transcript models. These new exons represent de novo sequence generally present nowhere in the assembled genome, and the added sequence can be shown in many cases to greatly improve evolutionary conservation of the predicted encoded peptides. At the same time, our results confirm the purely intronic status for a substantial majority of N-islands annotated as intronic in the reference annotated genome, increasing confidence in this valuable resource.