Machine learning for metagenomics: methods and tools
Hayssam Soueidan, Macha Nikolski
While genomics is the research field relative to the study of the genome of any organism, metagenomics is the term for the research that focuses on many genomes at the same time, as typical in some sections of environmental study. Metagenomics recognizes the need to develop computational methods that enable understanding the genetic composition and activities of communities of species so complex that they can only be sampled, never completely characterized.
Machine learning currently offers some of the most computationally efficient tools for building predictive models for classification of biological data. Various biological applications cover the entire spectrum of machine learning problems including supervised learning, unsupervised learning (or clustering), and model construction. Moreover, most of biological data — and this is the case for metagenomics — are both unbalanced and heterogeneous, thus meeting the current challenges of machine learning in the era of Big Data.
The goal of this revue is to examine the contribution of machine learning techniques for metagenomics, that is answer the question “to what extent does machine learning contribute to the study of microbial communities and environmental samples?” We will first briefly introduce the scientific fundamentals of machine learning. In the following sections we will illustrate how these techniques are helpful in answering questions of metagenomic data analysis. We will describe a certain number of methods and tools to this end, though we will not cover them exhaustively. Finally, we will speculate on the possible future directions of this research.