PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

I. Gregor, J. Dröge, M. Schirmer, C. Quince, A. C. McHardy
Subjects: Quantitative Methods (q-bio.QM)

Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. For communities of up to medium diversity, e.g. excluding environments such as soil, this is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community from which they originate. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for the recovery of species bins from an individual metagenome sample is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in a composition-based taxonomic metagenome classifier and identifies the ‘training’ sequences using marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area may not have. With these challenges in mind, we have developed PhyloPythiaS+, a successor to our previously described method PhyloPythia(S). The newly developed + component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated k-mer counting 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion.

A novel method for the estimation of diversity in viral populations from next generation sequencing data

A novel method for the estimation of diversity in viral populations from next generation sequencing data
Jean P. Zukurov, Sieberth N. Brito, Luiz M. R. Janini, Fernando Antoneli
Comments: 17 pages, 6 figures, site: this http URL
Subjects: Quantitative Methods (q-bio.QM); Genomics (q-bio.GN)

In this paper we describe the structure and use of a computational tool for the analysis of viral genetic diversity on data generated by high- throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. This work focuses on two main fronts: the first is a novel alignment strategy that allows the recovery of the highest possible number of short-reads; the second is the estimation of the populational genetic diversity through a Bayesian approach based on Dirichlet distributions inspired by word count modeling. The software is available as an integrated platform capable of performing all operations described here, it is written in C# (Microsoft) and runs on Windows platforms. The executable, the documentation and the auxiliary files are freely available and may be obtained from:

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

J. Dröge, I. Gregor, A. C. McHardy
(Submitted on 3 Apr 2014)

Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.

CONCOCT: Clustering cONtigs on COverage and ComposiTion

CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible
Paul J. McMurdie, Susan Holmes
(Submitted on 1 Oct 2013)

The interpretation of count data originating from the current generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed relative to a simple Poisson model These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well-characterized and implemented for RNA-Seq data in R packages such as edgeR and DESeq.
We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. In simulations detecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided systematic improvement in performance over crude proportions or rarefied counts — both of which led to a high rate of false positives. In simulations evaluating clustering accuracy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by alternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling microbiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt to normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees
Dongying Wu, Ladan Doroud, Jonathan A. Eisen
(Submitted on 28 Aug 2013)

Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the small- subunit rRNAs (ssu-rRNAs). To address the limitation of ssu-rRNA as a phylogenetic marker, such as copy number variation among organisms and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations, we have identified protein- coding gene families as alternative Phylogenetic and Phylogenetic Ecology markers (PhyEco). Current nucleotide sequence similarity based Operational Taxonomic Unit (OTU) classification methods are not readily applicable to amino acid sequences of PhyEco markers. We report here the development of TreeOTU, a phylogenetic tree structure based OTU classification method that takes into account of differences in rates of evolution between taxa and between genes. OTU sets built by TreeOTU are more faithful to phylogenetic tree structures than sequence clustering (non phylogenetic) methods for ssu-rRNAs. OTUs built from phylogenetic trees of protein coding PhyEco markers are comparable to our current taxonomic classification at different levels. With the included OTU comparing tools, the TreeOTU is robust in phylogenetic referencing with different phylogenetic markers and trees.

Exploration and retrieval of whole-metagenome sequencing samples

Exploration and retrieval of whole-metagenome sequencing samples
Sohan Seth, Niko Välimäki, Samuel Kaski, Antti Honkela
(Submitted on 28 Aug 2013)

Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the next generation sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples, and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets and observe significant enrichment for diseased samples in results of queries with another diseased sample.

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data
John O’Brien, Xavier Didelot, Zamin Iqbal, LucasAmenga-Etego, Bartu Ahiska, Daniel Falush
(Submitted on 26 Jun 2013)

Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of strong mixing among samples. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads

Low-bandwidth and non-compute intensive remote identification of microbes from raw sequencing reads
Laurent Gautier, Ole Lund
(Submitted on 6 Jun 2013)

Cheap high-throughput DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples.
We propose a novel general approach to the analysis of sequencing data in which the reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data, and the hints can be used for more computationally-demanding work.
Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references known to the server. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment.
To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients, one of them running in a web browser, in order to demonstrate that gigabytes of raw sequencing reads of unknown origin could be identified without the need to transfer a very large volume of data, and on modestly powered computing devices.
A web access is available at this http URL. The source code for a python command-line client, a server, and supplementary data is available at this http URL.

The new science of metagenomics and the challenges of its use in both developed and developing countries

The new science of metagenomics and the challenges of its use in both developed and developing countries
Edi Prifti (MICA), Jean-Daniel Zucker (MSI, UMMISCO, Nutriomique, Eq. 7)
(Submitted on 10 May 2013)

Our view of the microbial world and its impact on human health is changing radically with the ability to sequence uncultured or unculturable microbes sampled directly from their habitats, ability made possible by fast and cheap next generation sequencing technologies. Such recent developments represents a paradigmatic shift in the analysis of habitat biodiversity, be it the human, soil or ocean microbiome. We review here some research examples and results that indicate the importance of the microbiome in our lives and then discus some of the challenges faced by metagenomic experiments and the subsequent analysis of the generated data. We then analyze the economic and social impact on genomic-medicine and research in both developing and developed countries. We support the idea that there are significant benefits in building capacities for developing high-level scientific research in metagenomics in developing countries. Indeed, the notion that developing countries should wait for developed countries to make advances in science and technology that they later import at great cost has recently been challenged.