Landscape and evolutionary dynamics of terminal-repeat retrotransposons in miniature (TRIMs) in 48 whole plant genomes
Dongying Gao, Yupeng Li, Brian Abernathy, Scott Jackson
Terminal-repeat retrotransposons in miniature (TRIMs) are structurally similar to long terminal repeat (LTR) retrotransposons except that they are extremely small and difficult to identify. Thus far, only a few TRIMs have been characterized in the euphyllophytes and the evolutionary and biological impacts and transposition mechanism of TRIMs are poorly understood. In this study, we combined de novo and homology-based methods to annotate TRIMs in 48 plant genome sequences, spanning land plants to algae. We found 156 TRIM families, 146 previously undescribed. Notably, we identified the first TRIMs in a lycophyte and non-vascular plants. The majority of the TRIM families were highly conserved and shared within and between plant families. Even though TRIMs contribute only a small fraction of any plant genome, they are enriched in or near genes and may play important roles in gene evolution. TRIMs were frequently organized into tandem arrays we called TA-TRIMs, another unique feature distinguishing them from LTR retrotransposons. Importantly, we identified putative autonomous retrotransposons that may mobilize specific TRIM elements and detected very recent transpositions of a TRIM in O. sativa. Overall, this comprehensive analysis of TRIMs across the entire plant kingdom provides insight into the evolution and conservation of TRIMs and the functional roles they may play in gene evolution.
Sharing and specificity of co-expression networks across 35 human tissues
Emma Pierson, GTEx Consortium, Daphne Koller, Alexis Battle, Sara Mostafavi
To understand the regulation of tissue-specific gene expression, the GTEx Consortium generated RNA-seq expression data for more than thirty distinct human tissues. This data provides an opportunity for deriving shared and tissue-specific gene regulatory networks on the basis of co-expression between genes. However, a small number of samples are available for a majority of the tissues, and therefore statistical inference of networks in this setting is highly underpowered. To address this problem, we infer tissue-specific gene co-expression networks for 35 tissues in the GTEx dataset using a novel algorithm, GNAT, that uses a hierarchy of tissues to share data between related tissues. We show that this transfer learning approach increases the accuracy with which networks are learned. Analysis of these networks reveals that tissue-specific transcription factors are hubs that preferentially connect to genes with tissue-specific functions. Additionally, we observe that genes with tissue-specific functions lie at the peripheries of our networks. We identify numerous modules enriched for Gene Ontology functions, and show that modules conserved across tissues are especially likely to have functions common to all tissues, while modules that are upregulated in a particular tissue are often instrumental to tissue-specific function. Finally, we provide a web tool which allows exploration of gene function and regulation in a tissue-specific manner.
Exploring the phenotypic space and the evolutionary history of a natural mutation in Drosophila melanogaster
Anna Ullastres, Natalia Petit, Josefa González
A major challenge of modern Biology is elucidating the functional consequences of natural mutations. While we have a good understanding of the effects of lab-induced mutations on the molecular- and organismal-level phenotypes, the study of natural mutations has lagged behind. In this work, we explore the phenotypic space and the evolutionary history of a previously identified adaptive transposable element insertion. We first combined several tests that capture different signatures of selection to show that there is evidence of positive selection in the regions flanking FBti0019386 insertion. We then explored several phenotypes related to known phenotypic effects of nearby genes, and having plausible connections to fitness variation in nature. We found that flies with FBti0019386 insertion had a shorter developmental time and were more sensitive to stress, which are likely to be the adaptive effect and the cost of selection of this mutation, respectively. Interestingly, these phenotypic effects are not consistent with a role of FBti0019386 in temperate adaptation as has been previously suggested. Indeed, a global analysis of the population frequency of FBti0019386 showed that clinal frequency patterns are found in North America and Australia but not in Europe. Finally, we showed that FBti0019386 is associated with down-regulation of sra most likely because it induces the formation of heterochromatin by recruiting HP1a protein. Overall, our integrative approach allowed us to shed light on the evolutionary history, the relevant fitness effects and the likely molecular mechanisms of an adaptive mutation and highlights the complexity of natural genetic variants.
Impacts of terraces on phylogenetic inference
Michael J Sanderson, Michelle M. McMahon, Alexandros Stamatakis, Derrick J. Zwickl, Mike Steel
Comments: 50 pages, 9 figures
Subjects: Populations and Evolution (q-bio.PE)
Terraces are potentially large sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces add ambiguity and complexity to phylogenetic inference particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this paper we present five new findings about terraces and their impacts on phylogenetic inference. First we clarify assumptions about model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terraces on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges “attract” each other in Bayesian inference. Fifth, we show that even if data are not partitioned, patterns of missing data studied in the terrace problem can lead to instances of apparent statistical inconsistency when even a small element of heterotachy is introduced to the model generating the sequence data. Finally, we discuss strategies for remediation of some of these problems.
CNVkit: Copy number detection and visualization for targeted sequencing using off-target reads
Eric Talevich, A. Hunter Shain, Boris C. Bastian
Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massive parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for compatibility with other software. Availability: http://github.com/etal/cnvkit
Genome-wide association study of carbon and nitrogen metabolism in the maize nested association mapping population
Nengyi Zhang, Yves Gibon, Nicholas Lepak, Pinghua Li, Lauren Dedow, Charles Chen, Yoon-Sup So, Jason Wallace, Karl Kremling, Peter Bradbury, Thomas Brutnell, Mark Stitt, Edward Buckler
Carbon (C) and nitrogen (N) metabolism are critical to plant growth and development and at the basis of yield and adaptation. We have applied high throughput metabolite analyses to over 12,000 diverse field grown samples from the maize nested association mapping population. This allowed us to identify natural variation controlling the levels of twelve key C and N metabolites, often with single gene resolution. In addition to expected genes like invertases, critical natural variation was identified in key C4 metabolism genes like carbonic anhydrases and a malate transporter. Unlike prior maize studies, extensive pleiotropy was found for C and N metabolites. This integration of field-derived metabolite data with powerful mapping and genomics resources allows dissection of key metabolic pathways, providing avenues for future genetic improvement.
Introns structure patterns of variation in nucleotide composition in Arabidopsis thaliana and rice protein-coding genes
Adrienne Ressayre, Sylvain Glemin, Pierre Montalent, Laurana Serres-Giardi, Christine Dillmann, Johann Joets
Plant genomes are large, intron-rich and present a wide range of variation in coding region G+C content. Concerning coding regions, a sort of syndrome can be described in plants: the increase in G+C content is associated with both the increase in heterogeneity among genes within a genome and the increase in variation across genes. Taking advantage of the large number of genes composing plant genomes and the wide range of variation in gene intron number, we performed a comprehensive survey of the patterns of variation in G+C content at different scales from the nucleotide level to the genome scale in two species Arabidopsis thaliana and Oryza sativa, comparing the patterns in genes with different intron numbers. In both species, we observed a pervasive effect of gene intron number and location along genes on G+C content, codon and amino acid frequencies suggesting that in both species, introns have a barrier effect structuring G+C content along genes. In external gene regions (located upstream first or downstream last intron), species-specific factors are shaping G+C content while in internal gene regions (surrounded by introns), G+C content is constrained to remain within a range common to both species. In rice, introns appear as a major determinant of gene G+C content while in A. thaliana introns have a weaker but significant effect. The structuring effect of introns in both species is susceptible to explain the G+C content syndrome observed in plants.