Plastids perform crucial cellular functions, including photosynthesis, across a wide variety of eukaryotes. Since endosymbiosis, plastids have maintained independent genomes that now display a wide diversity of gene content, genome structure, gene regulation mechanisms, and transmission modes. The evolution of plastid genomes depends on an input of de novo mutation, but our knowledge of mutation in the plastid is limited to indirect inference from patterns of DNA divergence between species. Here, we use a mutation accumulation experiment, where selection acting on mutations is rendered ineffective, combined with whole-plastid genome sequencing to directly characterize de novo mutation in Chlamydomonas reinhardtii. We show that the mutation rates of the plastid and nuclear genomes are similar, but that the base spectra of mutations differ significantly. We integrate our measure of the mutation rate with a population genomic dataset of 20 individuals, and show that the plastid genome is subject to substantially stronger genetic drift than the nuclear genome. We also show that high levels of linkage disequilibrium in the plastid genome are not due to restricted recombination, but are instead a consequence of increased genetic drift. One likely explanation for increased drift in the plastid genome is that there are stronger effects of genetic hitchhiking. The presence of recombination in the plastid is consistent with laboratory studies in C. reinhardtii and demonstrates that although the plastid genome is thought to be uniparentally inherited, it recombines in nature at a rate similar to the nuclear genome.
Genome size evolution is a fundamental problem in molecular evolution. Statistical analysis of genome sizes brings new insight into the evolution of genome size. Although the variation of genome sizes is complicated, it is indicated that the genome size evolution can be explained more clearly at taxon level than at species level. I find that the genome size distribution for species in a taxon fits log-normal distribution. And I find a relationship between the phylogeny of life and the statistical features of genome size distributions among taxa. I observed different statistical features of genome size distributions between animal taxa and plant taxa. A log-normal stochastic process model is developed to simulate the genome size evolution. The simulation results on the log-normal distributions of genome sizes and their statistical features agree with the observations.
The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 5 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: github.com/bioinformatics-centre/kaiju
With Next Generation Sequencing Data (NGS) coming off age and being routinely used, evolutionary biology is transforming into a data-driven science. As a consequence, researchers have to rely on a growing number of increasingly complex software. All widely used tools in our field have grown considerably, in terms of the number of features as well as lines of code. In addition, analysis pipelines now include substantially more components than 5-10 years ago. A topic that has received little attention in this context is the code quality of widely used codes. Unfortunately, the majority of users tend to blindly trust software and the results it produces. To this end, we assessed the code quality of 15 highly cited tools (e.g., MrBayes, MAFFT, SweepFinder etc.) from the broader area of evolutionary biology that are used in current data analysis pipelines. We also discuss widely unknown problems associated with floating point arithmetics for representing real numbers on computer systems. Since, the software quality of the tools we analyzed is rather mediocre, we provide a list of best practices for improving the quality of existing tools, but also list techniques that can be deployed for developing reliable, high quality scientific software from scratch. Finally, we also discuss journal and science policy as well as funding issues that need to be addressed for improving software quality as well as ensuring support for developing new and maintaining existing software. Our intention is to raise the awareness of the community regarding software quality issues and to emphasize the substantial lack of funding for scientific software development.
Next-generation sequencing of DNA provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, when testing the association between rare variants and traits of interest, the current practice of first calling underlying genotypes and then treating the called values as known is prone to false positive findings, especially when genotyping errors are systematically different between cases and controls. This happens whenever cases and controls are sequenced at different depths or on different platforms. In this article, we provide a likelihood-based approach to testing rare variant associations that directly models sequencing reads without calling genotypes. We consider the (weighted) burden test statistic, which is the (weighted) sum of the score statistic for assessing effects of individual variants on the trait of interest. Because variant locations are unknown, we develop a simple, computationally efficient screening algorithm to estimate the loci that are variants. Because our burden statistic may not have mean zero after screening, we develop a novel bootstrap procedure for assessing the significance of the burden statistic. We demonstrate through extensive simulation studies that the proposed tests are robust to a wide range of differential sequencing qualities between cases and controls, and are at least as powerful as the standard genotype calling approach when the latter controls type I error. An application to the UK10K data reveals novel rare variants in gene BTBD18 associated with childhood onset obesity. The relevant software is freely available.
In the context of climate change and species invasions, range shifts increasingly gain attention because the rates at which they occur in the Anthropocene induce fast shifts in biological assemblages. During such range shifts, species experience multiple selection pressures. Especially for poleward expansions, a straightforward interpretation of the observed evolutionary dynamics is hampered because of the joint action of evolutionary processes related to spatial selection and to adaptation towards local climatic conditions. To disentangle the effects of these two processes, we integrated stochastic modeling and empirical approaches, using the spider mite Tetranychus urticae as a model species. We demonstrate considerable latitudinal quantitative genetic divergence in life-history traits in T. urticae, that was shaped by both spatial selection and local adaptation. The former mainly affected dispersal behavior, while development was mainly shaped by adaptation to the local climate. Divergence in life-history traits in species shifting their range poleward can consequently be jointly determined by fast local adaptation to the environmental gradient and contemporary evolutionary dynamics resulting from spatial selection. The integration of modeling with common garden experiments provides a powerful tool to study the contribution of these two evolutionary processes on life-history evolution during range expansion.
Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and allowing scaffold ordering and anchoring. Results We present MaGuS (map-guided scaffolding), a modular tool that uses a draft genome assembly, a genome map, and high-throughput paired-end sequencing data to estimate the quality and to enhance the continuity of an assembly. We generated several assemblies of the Arabidopsis genome using different scaffolding programs and applied MaGuS to select the best assembly using quality metrics. Then, we used MaGuS to perform map-guided scaffolding to increase continuity by creating new scaffold links in low-covered and highly repetitive regions where other commonly used scaffolding methods lack consistency. Conclusions MaGuS is a powerful reference-free evaluator of assembly quality and a map-guided scaffolder that is freely available at https://github.com/institut-de-genomique/MaGuS. Its use can be extended to other high-throughput sequencing data (e.g., long-read data) and also to other map data (e.g., genetic maps) to improve the quality and the continuity of large and complex genome assemblies.