Triticeae resources in Ensembl Plants
Dan M Bolser, Arnaud Kerhornou, Brandon Walts, Paul Kersey
Recent developments in DNA sequencing have enabled the large and complex genomes of many crop species to be determined for the first time, even those previously intractable due to their polyploid nature. Indeed, over the course of the last two years, the genome sequences of several commercially important cereals, notably barley and bread wheat, have become available, as well as those of related wild species. While still incomplete, comparison to other, more completely assembled species suggests that coverage of genic regions is likely to be high. Ensembl Plants (http://plants.ensembl.org) is an integrative resource organising, analysing and visualising genome-scale information for important crop and model plants. Available data includes reference genome sequence, variant loci, gene models and functional annotation. For variant loci, individual and population genotypes, linkage information and, where available, phenotypic information, are shown. Comparative analyses are performed on DNA and protein sequence alignments. The resulting genome alignments and gene trees, representing the implied evolutionary history the gene family, are made available for visualisation and analysis. Driven by the use case of bread wheat, specific extensions to the analysis pipelines and web interface have recently been developed to support polyploid genomes. Data in Ensembl Plants is accessible through a genome browser incorporating various specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These interfaces are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests and pollinators, facilitating the study of the plant in its environment.
The genetic architecture of local adaptation I: The genomic landscape of foxtail pine (Pinus balfouriana Grev. & Balf.) as revealed from a high-density linkage map
Christopher J Friedline, Brandon M Lind, Erin M Hobson, Douglas E Harwood, Annette Delfino-Mix, Patricia E Maloney, Andrew J Eckert
Explaining the origin and evolutionary dynamics of the genetic architecture of adaptation is a major research goal of evolutionary genetics. Despite controversy surrounding success of the attempts to accomplish this goal, a full understanding of adaptive genetic variation necessitates knowledge about the genomic location and patterns of dispersion for the genetic components affecting fitness-related phenotypic traits. Even with advances in next generation sequencing technologies, the production of full genome sequences for non-model species is often cost prohibitive, especially for tree species such as pines where genome size often exceeds 20 to 30 Gbp. We address this need by constructing a dense linkage map for fox- tail pine (Pinus balfouriana Grev. & Balf.), with the ultimate goal of uncovering and explaining the origin and evolutionary dynamics of adaptive genetic variation in natural populations of this forest tree species. We utilized megagametophyte arrays (n = 76–95 megagametophytes/tree) from four maternal trees in combination with double-digestion restriction site associated DNA sequencing (ddRADseq) to produce a consensus linkage map covering 98.58% of the foxtail pine genome, which was estimated to be 1276 cM in length (95% CI: 1174cM to 1378cM). A novel bioinformatic approach using iterative rounds of marker ordering and imputation was employed to produce single-tree linkage maps (507–17066 contigs/map; lengths: 1037.40–1572.80 cM). These linkage maps were collinear across maternal trees, with highly correlated marker orderings (Spearman’s ρ > 0.95). A consensus linkage map derived from these single-tree linkage maps contained 12 linkage groups along which 20 655 contigs were non-randomly distributed across 901 unique positions (n = 23 contigs/position), with an average spacing of 1.34 cM between adjacent positions. Of the 20 655 contigs positioned on the consensus linkage map, 5627 had enough sequence similarity to contigs contained within the most recent build of the loblolly pine (P. taeda L.) genome to identify them as putative homologs containing both genic and non-genic loci. Importantly, all 901 unique positions on the consensus linkage map had at least one contig with putative homology to loblolly pine. When combined with the other biological signals that predominate in our data (e.g., correlations of recombination fractions across single trees), we show that dense linkage maps for non-model forest tree species can be efficiently constructed using next generation sequencing technologies. We subsequently discuss the usefulness of these maps as community-wide resources and as tools with which to test hypotheses about the genetic architecture of adaptation.
Comparative genomics reveals the origins and diversity of arthropod immune systems
William J Palmer, Francis M Jiggins
While the innate immune system of insects is well-studied, comparatively little is known about how other arthropods defend themselves against infection. We have characterised key immune components in the genomes of five chelicerates, a myriapod and a crustacean. We found clear traces of an ancient origin of innate immunity, with some arthropods having Tolllike receptors and C3-complement factors that are more closely related in sequence or structure to vertebrates than other arthropods. Across the arthropods some components of the immune system, like the Toll signalling pathway, are highly conserved. However, there is also remarkable diversity. The chelicerates apparently lack the Imd signalling pathway and BGRPs–a key class of pathogen recognition receptors. Many genes have large copy number variation across species, and this may sometimes be accompanied by changes in function. For example, peptidoglycan recognition proteins (PGRPs) have frequently lost their catalytic activity and switch between secreted and intracellular forms. There has been extensive duplication of the cellular immune receptor Dscam in several species, which may be an alternative way to generate the high diversity that produced by alternative splicing in insects. Our results provide a detailed analysis of the immune systems of several important groups of animals and lay the foundations for functional work on these groups.
Landscape and evolutionary dynamics of terminal-repeat retrotransposons in miniature (TRIMs) in 48 whole plant genomes
Dongying Gao, Yupeng Li, Brian Abernathy, Scott Jackson
Terminal-repeat retrotransposons in miniature (TRIMs) are structurally similar to long terminal repeat (LTR) retrotransposons except that they are extremely small and difficult to identify. Thus far, only a few TRIMs have been characterized in the euphyllophytes and the evolutionary and biological impacts and transposition mechanism of TRIMs are poorly understood. In this study, we combined de novo and homology-based methods to annotate TRIMs in 48 plant genome sequences, spanning land plants to algae. We found 156 TRIM families, 146 previously undescribed. Notably, we identified the first TRIMs in a lycophyte and non-vascular plants. The majority of the TRIM families were highly conserved and shared within and between plant families. Even though TRIMs contribute only a small fraction of any plant genome, they are enriched in or near genes and may play important roles in gene evolution. TRIMs were frequently organized into tandem arrays we called TA-TRIMs, another unique feature distinguishing them from LTR retrotransposons. Importantly, we identified putative autonomous retrotransposons that may mobilize specific TRIM elements and detected very recent transpositions of a TRIM in O. sativa. Overall, this comprehensive analysis of TRIMs across the entire plant kingdom provides insight into the evolution and conservation of TRIMs and the functional roles they may play in gene evolution.
CNVkit: Copy number detection and visualization for targeted sequencing using off-target reads
Eric Talevich, A. Hunter Shain, Boris C. Bastian
Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massive parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for compatibility with other software. Availability: http://github.com/etal/cnvkit
RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing
Vikas Gupta, April Dawn Estrada, Ivory Clabaugh Blakley, Rob Reid, Ketan Patel, Mason D. Meyer, Stig Uggerhoj Andersen, Allan F. Brown, Mary Ann Lila, Ann Loraine
Background: Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable breeding berry varieties with enhanced health benefits. Results: Toward this end, we annotated a draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up and down regulation of metabolic pathway enzymes, cell growth-related genes, and putative transcriptional regulators. Analysis of RNA-seq alignments also identified developmentally regulated alternative splicing, promoter use, and 3′ end formation. Conclusions: We report genome sequence, gene models, functional annotations, and RNA-Seq expression data which provide an important new resource enabling high throughput studies in blueberry. RNA-Seq data are freely available for visualization in Integrated Genome Browser, and analysis code is available from the git repository at http://bitbucket.org/lorainelab/blueberrygenome.
Origins and impacts of new exons
Jason Merkin*, Ping Chen*, Sampsa Hautaniemi, Christopher Burge
Mammalian genes are typically broken into several protein-coding and non-coding exons, but the evolutionary origins and functions of new exons are not well understood. Here, we analyzed patterns of exon gain using deep cDNA sequencing data from several mammals and one bird, identifying thousands of species- and lineage-specific exons. While exons conserved across mammals are mostly protein-coding and constitutively spliced, species-specific exons were mostly located in 5′ untranslated regions and alternatively spliced. New exons most often derived from unique intronic sequence rather than repetitive elements, and were associated with upstream intronic deletions, increased nucleosome occupancy and RNA polymerase II pausing. Surprisingly, exon gain was associated with increased gene expression, but only in tissues where the exon was included, suggesting that splicing enhances steady-state mRNA levels and that changes in splicing represent a major contributor to the evolution of gene expression.