Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

J. Dröge, I. Gregor, A. C. McHardy
(Submitted on 3 Apr 2014)

Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.

An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs

An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs

Jesse D Bloom

Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are unknown, and so most phylogenetic models treat this selection crudely with a variety of free parameters designed to represent general features of mutation and selection. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al, 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much better than existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species

Paulo Bandiera-Paiva, Marcelo R.S. Briones
(Submitted on 2 Apr 2014)

The Phylogenetic Genome Annotator (PGA) is a computer program that enables real-time comparison of ‘gene trees’ versus ‘species trees’ obtained from predicted open reading frames of whole genome data. The gene phylogenies are inferred for each individual genome predicted proteins whereas the species phylogenies are inferred from rDNA data. The correlated protein domains, defined by PFAM, are then displayed side-by-side with a phylogeny of the corresponding species. The statistical support of gene clusters (branches) is given by the quartet puzzling method. This analysis readily discriminates paralogs from orthologs, enabling the identification of proteins originated by gene duplications and the prediction of possible functional divergence in groups of similar sequences.

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

Steve Evans, Alexandru Hening, Sebastian Schreiber

We consider consider a population living in a patchy environment that varies stochastically in space and time. The population is composed of two morphs (that is, individuals of the same species with different genotypes). In terms of survival and reproductive success, the associated phenotypes differ only in their habitat selection strategies. We compute invasion rates corresponding to the rates at which the abundance of an initially rare morph increases in the presence of the other morph established at equilibrium. If both morphs have positive invasion rates when rare, then there is an equilibrium distribution such that the two morphs coexist; that is, there is a protected polymorphism for habitat selection. Alternatively, if one morph has a negative invasion rate when rare, then it is asymptotically displaced by the other morph under all initial conditions where both morphs are present. We refine the characterization of an evolutionary stable strategy for habitat selection from [Schreiber, 2012] in a mathematically rigorous manner. We provide a necessary and sufficient condition for the existence of an ESS that uses all patches and determine when using a single patch is an ESS. We also provide an explicit formula for the ESS when there are two habitat types. We show that adding environmental stochasticity results in an ESS that, when compared to the ESS for the corresponding model without stochasticity, spends less time in patches with larger carrying capacities and possibly makes use of sink patches, thereby practicing a spatial form of bet hedging.

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

Heng Li
(Submitted on 3 Apr 2014)

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.
Results: We made ten SNP and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb without significant compromise on the sensitivity.

Comparison of the theoretical and real-world evolutionary potential of a genetic circuit.

Comparison of the theoretical and real-world evolutionary potential of a genetic circuit.

Manuel Razo-Mejia, James Boedicker, Daniel Jones, Alexander de Luna, Justin Block Kinney, Rob Phillips

With the development of next-generation sequencing technologies, many large scale experimental efforts aim to map genotypic variability among individuals. This natural variability in populations fuels many fundamental biological processes, ranging from evolutionary adaptation and speciation to the spread of genetic diseases and drug resistance. An interesting and important component of this variability is present within the regulatory regions of genes. As these regions evolve, accumulated mutations lead to modulation of gene expression, which may have consequences for the phenotype. A simple model system where the link between genetic variability, gene regulation and function can be studied in detail is missing. In this article we develop a model to explore how the sequence of the wild-type lac promoter dictates the fold change in gene expression. The model combines single-base pair resolution maps of transcription factor and RNA polymerase binding energies with a comprehensive thermodynamic model of gene regulation. The model was validated by predicting and then measuring the variability of lac operon regulation in a collection of natural isolates. We then implement the model to analyze the sensitivity of the promoter sequence to the regulatory output, and predict the potential for regulation to evolve due to point mutations in the promoter region.

Multilocus Species Trees Show the Recent Adaptive Radiation of the Mimetic Heliconius Butterflies

Multilocus Species Trees Show the Recent Adaptive Radiation of the Mimetic Heliconius Butterflies

Krzysztof M Kozak, Niklas Wahlberg, Andrew Neild, Kanchon K Dasmahapatra, James Mallet, Chris D Jiggins

Müllerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, and is associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a dataset of 22 mitochondrial and nuclear markers from 92% of species in the tribe to re-examine the phylogeny of Heliconiini with both supermatrix and multi-species coalescent approaches, characterise the patterns of conflicting signal and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach fails to reflect the underlying variation in the history of individual loci. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Amazonia over the past 23 million years, or changes caused by diversity-dependent effects on the rate of diversification. We find that the tribe Heliconiini had doubled its rate of speciation around 11 Ma and that the presently most speciose genus Heliconius started diversifying rapidly at 10 Ma, likely in response to the recent drastic changes in topography of the region. Our study provides comprehensive evidence for a rapid adaptive radiation among an important insect radiation in the most biodiverse region of the planet.

New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica

Michael C Schatz, Lyza G Maron, Joshua C Stein, Alejandro Hernandez Wences, James Gurtowski, Eric Biggers, Hayan Lee, Melissa Kramer, Eric Antonio, Elena Ghiban, Mark H Wright, Jer-ming Chia, Doreen Ware, Susan R McCouch, William Richard McCombie

The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. Currently, when the genomes of different strains of a given organism are compared, whole genome resequencing data are aligned to an established reference sequence. However when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Here, we use rice as a model to explore the extent of structural variation among strains adapted to different ecologies and geographies, and show that this variation can be significant, often matching or exceeding the variation present in closely related human populations or other mammals. We demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared to provide an unbiased assessment. Using this approach, we are able to accurately assess the ?pan-genome? of three divergent rice varieties and document several megabases of each genome absent in the other two. Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard resequencing approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.

Group A Rotavirus NSP4 is Under Negative Selective Pressure

Group A Rotavirus NSP4 is Under Negative Selective Pressure

Jackson Cordeiro Lima, Paulo Bandiera-Paiva
(Submitted on 2 Apr 2014)

Rotavirus (RV) is the major etiologic agent of severe infantile gastroenteritis; its genome has 11 segments of double stranded RNA, encoding 12 proteins. The non-structural protein 4 (NSP4) encoded by segment 10 is multifunctional. The aim of this study is to analyze the selective pressure driving the NSP4 of RV, through the ratio of non-synonymous substitutions per synonymous substitutions (dN/dS). Our results show that NSP4 is under negative evolutionary pressure (84.57% of the amino acid sequence) and no site was found under positive selection. This may support other evolutionary studies of different RV proteins or viral agents.

Most viewed on Haldane’s Sieve: March 2014

The most viewed preprints on Haldane’s Sieve in March 2014 were (note that there are six rather than the usual five because two posts had the exact same number of views at the time of this writing):