Bayesian mixture analysis for metagenomic community profiling.

Bayesian mixture analysis for metagenomic community profiling.

Sofia Morfopoulou, Vincent Plagnol

Deep sequencing of clinical samples is now an established tool for the detection of infectious pathogens, with direct medical applications. The large amount of data generated provides an opportunity to detect species even at very low levels, provided that computational tools can effectively interpret potentially complex metagenomic mixtures. Data interpretation is complicated by the fact that short sequencing reads can match multiple organisms and by the lack of completeness of existing databases, in particular for viral pathogens. This interpretation problem can be formulated statistically as a mixture model, where the species of origin of each read is missing, but the complete knowledge of all species present in the mixture helps with the individual reads assignment. Several analytical tools have been proposed to approximately solve this computational problem. Here, we show that the use of parallel Monte Carlo Markov chains (MCMC) for the exploration of the species space enables the identification of the set of species most likely to contribute to the mixture. The added accuracy comes at a cost of increased computation time. Our approach is useful for solving complex mixtures involving several related species. We designed our method specifically for the analysis of deep transcriptome sequencing datasets and with a particular focus on viral pathogen detection, but the principles are applicable more generally to all types of metagenomics mixtures. The code is available on github (http://github.com/smorfopoulou/metaMix) and the process is currently being implemented in a user friendly R package (metaMix, to be submitted to CRAN).

Reagent contamination can critically impact sequence-based microbiome analyses


Reagent contamination can critically impact sequence-based microbiome analyses

Susannah Salter, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nick Loman, Alan W Walker

The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.

Phylogenetics and the human microbiome

Phylogenetics and the human microbiome
Frederick A Matsen IV
Comments: to appear in Systematic Biology
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

The human microbiome is the ensemble of genes in the microbes that live inside and on the surface of humans. Because microbial sequencing information is now much easier to come by than phenotypic information, there has been an explosion of sequencing and genetic analysis of microbiome samples. Much of the analytical work for these sequences involves phylogenetics, at least indirectly, but methodology has developed in a somewhat different direction than for other applications of phylogenetics. In this paper I review the field and its methods from the perspective of a phylogeneticist, as well as describing current challenges for phylogenetics coming from this type of work.

PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes


PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes

I. Gregor, J. Dröge, M. Schirmer, C. Quince, A. C. McHardy
Subjects: Quantitative Methods (q-bio.QM)

Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. For communities of up to medium diversity, e.g. excluding environments such as soil, this is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community from which they originate. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for the recovery of species bins from an individual metagenome sample is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in a composition-based taxonomic metagenome classifier and identifies the ‘training’ sequences using marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area may not have. With these challenges in mind, we have developed PhyloPythiaS+, a successor to our previously described method PhyloPythia(S). The newly developed + component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated k-mer counting 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion.

A novel method for the estimation of diversity in viral populations from next generation sequencing data

A novel method for the estimation of diversity in viral populations from next generation sequencing data
Jean P. Zukurov, Sieberth N. Brito, Luiz M. R. Janini, Fernando Antoneli
Comments: 17 pages, 6 figures, site: this http URL
Subjects: Quantitative Methods (q-bio.QM); Genomics (q-bio.GN)

In this paper we describe the structure and use of a computational tool for the analysis of viral genetic diversity on data generated by high- throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. This work focuses on two main fronts: the first is a novel alignment strategy that allows the recovery of the highest possible number of short-reads; the second is the estimation of the populational genetic diversity through a Bayesian approach based on Dirichlet distributions inspired by word count modeling. The software is available as an integrated platform capable of performing all operations described here, it is written in C# (Microsoft) and runs on Windows platforms. The executable, the documentation and the auxiliary files are freely available and may be obtained from: biocomp.epm.br/tanden.

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods

J. Dröge, I. Gregor, A. C. McHardy
(Submitted on 3 Apr 2014)

Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.

CONCOCT: Clustering cONtigs on COverage and ComposiTion

CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.