Reagent contamination can critically impact sequence-based microbiome analyses


Reagent contamination can critically impact sequence-based microbiome analyses

Susannah Salter, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nick Loman, Alan W Walker

The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.

Phylogenetics and the human microbiome

Phylogenetics and the human microbiome
Frederick A Matsen IV
Comments: to appear in Systematic Biology
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this paper I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work.

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes


PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

I. Gregor, J. Dröge, M. Schirmer, C. Quince, A. C. McHardy
Subjects: Quantitative Methods (q-bio.QM)

Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. For communities of up to medium diversity, e.g. excluding environments such as soil, this is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community from which they originate. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for the recovery of species bins from an individual metagenome sample is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in a composition-based taxonomic metagenome classifier and identifies the ‘training’ sequences using marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area may not have. With these challenges in mind, we have developed PhyloPythiaS+, a successor to our previously described method PhyloPythia(S). The newly developed + component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated k-mer counting 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion.

A novel method for the estimation of diversity in viral populations from next generation sequencing data

A novel method for the estimation of diversity in viral populations from next generation sequencing data
Jean P. Zukurov, Sieberth N. Brito, Luiz M. R. Janini, Fernando Antoneli
Comments: 17 pages, 6 figures, site: this http URL
Subjects: Quantitative Methods (q-bio.QM); Genomics (q-bio.GN)

In this paper we describe the structure and use of a computational tool for the analysis of viral genetic diversity on data generated by high- throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. This work focuses on two main fronts: the first is a novel alignment strategy that allows the recovery of the highest possible number of short-reads; the second is the estimation of the populational genetic diversity through a Bayesian approach based on Dirichlet distributions inspired by word count modeling. The software is available as an integrated platform capable of performing all operations described here, it is written in C# (Microsoft) and runs on Windows platforms. The executable, the documentation and the auxiliary files are freely available and may be obtained from: biocomp.epm.br/tanden.

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

J. Dröge, I. Gregor, A. C. McHardy
(Submitted on 3 Apr 2014)

Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.

CONCOCT: Clustering cONtigs on COverage and ComposiTion

CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible
Paul J. McMurdie, Susan Holmes
(Submitted on 1 Oct 2013)

The interpretation of count data originating from the current generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed relative to a simple Poisson model These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well-characterized and implemented for RNA-Seq data in R packages such as edgeR and DESeq.
We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. In simulations detecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided systematic improvement in performance over crude proportions or rarefied counts — both of which led to a high rate of false positives. In simulations evaluating clustering accuracy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by alternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling microbiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt to normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees
Dongying Wu, Ladan Doroud, Jonathan A. Eisen
(Submitted on 28 Aug 2013)

Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the small- subunit rRNAs (ssu-rRNAs). To address the limitation of ssu-rRNA as a phylogenetic marker, such as copy number variation among organisms and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations, we have identified protein- coding gene families as alternative Phylogenetic and Phylogenetic Ecology markers (PhyEco). Current nucleotide sequence similarity based Operational Taxonomic Unit (OTU) classification methods are not readily applicable to amino acid sequences of PhyEco markers. We report here the development of TreeOTU, a phylogenetic tree structure based OTU classification method that takes into account of differences in rates of evolution between taxa and between genes. OTU sets built by TreeOTU are more faithful to phylogenetic tree structures than sequence clustering (non phylogenetic) methods for ssu-rRNAs. OTUs built from phylogenetic trees of protein coding PhyEco markers are comparable to our current taxonomic classification at different levels. With the included OTU comparing tools, the TreeOTU is robust in phylogenetic referencing with different phylogenetic markers and trees.

Exploration and retrieval of whole-metagenome sequencing samples

Exploration and retrieval of whole-metagenome sequencing samples
Sohan Seth, Niko Välimäki, Samuel Kaski, Antti Honkela
(Submitted on 28 Aug 2013)

Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the next generation sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples, and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets and observe significant enrichment for diseased samples in results of queries with another diseased sample.

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data
John O’Brien, Xavier Didelot, Zamin Iqbal, LucasAmenga-Etego, Bartu Ahiska, Daniel Falush
(Submitted on 26 Jun 2013)

Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of strong mixing among samples. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.