Large-scale Machine Learning for Metagenomics Sequence Classification

Large-scale Machine Learning for Metagenomics Sequence Classification

Kévin Vervier (CBIO), Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras, Jean-Philippe Vert (CBIO)
(Submitted on 26 May 2015)

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional approaches benefit from increasing the number of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning these models involves training a machine learning model on about 10 8 samples in 10 7 dimensions, which is out of reach of standard soft-wares but can be done efficiently with modern implementations for large-scale machine learning. The resulting models are competitive in terms of accuracy with well-established alignment tools for problems involving a small to moderate number of candidate species, and for reasonable amounts of sequencing errors. We show, however, that compositional approaches are still limited in their ability to deal with problems involving a greater number of species, and more sensitive to sequencing errors. We finally confirm that compositional approach achieve faster prediction times, with a gain of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.

On the equivalence of Maximum Parsimony and Maximum Likelihood on phylogenetic networks

On the equivalence of Maximum Parsimony and Maximum Likelihood on phylogenetic networks

Mareike Fischer, Parisa Bazargani
(Submitted on 26 May 2015)

Phylogenetic inference aims at reconstructing the evolutionary relationships of different species given some data (e.g. DNA, RNA or proteins). Traditionally, the relationships between species were assumed to be treelike, so the most frequently used phylogenetic inference methods like e.g. Maximum Parsimony or Maximum Likelihood were originally introduced to reconstruct phylogenetic trees. However, it has been well-known that some evolutionary events like hybridization or horizontal gene transfer cannot be represented by a tree but rather require a phylogenetic network. Therefore, current research seeks to adapt tree inference methods to networks. In the present paper, we analyze Maximum Parsimony and Maximum Likelihood on networks for various network definitions which have recently been introduced, and we investigate the well-known Tuffley and Steel equivalence result concerning these methods under the setting of a phylogenetic network.

RAD sequencing enables unprecedented phylogenetic resolution and objective species delimitation in recalcitrant divergent taxa

RAD sequencing enables unprecedented phylogenetic resolution and objective species delimitation in recalcitrant divergent taxa

Santiago Herrera, Timothy M. Shank
doi: http://dx.doi.org/10.1101/019745

Species delimitation is problematic in many taxa due to the difficulty of evaluating predictions from species delimitation hypotheses, which chiefly relay on subjective interpretations of morphological observations and/or DNA sequence data. This problem is exacerbated in recalcitrant taxa for which genetic resources are scarce and inadequate to resolve questions regarding evolutionary relationships and uniqueness. In this case study we demonstrate the empirical utility of restriction site associated DNA sequencing (RAD-seq) by unambiguously resolving phylogenetic relationships among recalcitrant octocoral taxa with divergences greater than 80 million years. We objectively infer robust species boundaries in the genus Paragorgia, which contains some of the most important ecosystem engineers in the deep-sea, by testing alternative taxonomy-guided or unguided species delimitation hypotheses using the Bayes factors delimitation method (BFD*) with genome-wide single nucleotide polymorphism data. We present conclusive evidence rejecting the current morphological species delimitation model for the genus Paragorgia and indicating the presence of cryptic species boundaries associated with environmental variables. We argue that the suitability limits of RAD-seq for phylogenetic inferences in divergent taxa cannot be assessed in terms of absolute time, but depend on taxon-specific factors such as mutation rate, generation time and effective population size. We show that classic morphological taxonomy can greatly benefit from integrative approaches that provide objective tests to species delimitation hypothesis. Our results pave the way for addressing further questions in biogeography, species ranges, community ecology, population dynamics, conservation, and evolution in octocorals and other marine taxa.

Distance from Sub-Saharan Africa Predicts Mutational Load in Diverse Human Genomes

Distance from Sub-Saharan Africa Predicts Mutational Load in Diverse Human Genomes

Brenna M. Henn, Laura R Botigue, Stephan Peischl, Isabelle Dupanloup, Mikhail Lipatov, Brian K Maples, Alicia R Martin, Shaila Musharoff, Howard Cann, Michael Snyder, Laurent Excoffier, Jeffrey Kidd, Carlos D Bustamante
doi: http://dx.doi.org/10.1101/019711

The Out-of-Africa (OOA) dispersal ~50,000 years ago is characterized by a series of founder events as modern humans expanded into multiple continents. Population genetics theory predicts an increase of mutational load in populations undergoing serial founder effects during range expansions. To test this hypothesis, we have sequenced full genomes and high-coverage exomes from 7 geographically divergent human populations from Namibia, Congo, Algeria, Pakistan, Cambodia, Siberia and Mexico. We find that individual genomes vary modestly in the overall number of predicted deleterious alleles. We show via spatially explicit simulations that the observed distribution of deleterious allele frequencies is consistent with the OOA dispersal, particularly under a model where deleterious mutations are recessive. We conclude that there is a strong signal of purifying selection at conserved genomic positions within Africa, but that many predicted deleterious mutations have evolved as if they were neutral during the expansion out of Africa. Under a model where selection is inversely related to dominance, we show that OOA populations are likely to have a higher mutation load due to increased allele frequencies of nearly neutral variants that are recessive or partially recessive.

Determining Exon Connectivity in Complex mRNAs by Nanopore Sequencing

Determining Exon Connectivity in Complex mRNAs by Nanopore Sequencing

Mohan Bolisetty, Gopinath Rajadinakaran, Brenton Graveley
doi: http://dx.doi.org/10.1101/019752

Though powerful, short-read high throughput RNA sequencing is limited in its ability to directly measure exon connectivity in mRNAs containing multiple alternative exons located farther apart than the maximum read lengths. Here, we use the Oxford Nanopore MinION™ sequencer to identify 7,899 ‘full-length’ isoforms expressed from four Drosophila genes, Dscam1, MRP, Mhc, and Rdl. These results demonstrate that nanopore sequencing can be used to deconvolute individual isoforms and that it has the potential to be an important method for comprehensive transcriptome characterization.

Genomic epidemiology of the current wave of artemisinin resistant malaria

Genomic epidemiology of the current wave of artemisinin resistant malaria

Roberto Amato, Olivo Miotto, Charles Woodrow, Jacob Almagro-Garcia, Ipsita Sinha, Susana Campino, Daniel Mead, Eleanor Drury, Mihir Kekre, Mandy Sanders, Alfred Amambua-Ngwa, Chanaki Amaratunga, Lucas Amenga-Etego, Tim JC Anderson, Voahangy Andrianaranjaka, Tobias Apinjoh, Elizabeth Ashley, Sarah Auburn, Gordon A Awandare, Vito Baraka, Alyssa Barry, Maciej F Boni, Steffen Borrmann, Teun Bousema, Oralee Branch, Peter C Bull, Kesinee Chotivanich, David J Conway, Alister Craig, Nicholas P Day, Abdoulaye Djimdé, Christiane Dolecek, Arjen M Dondorp, Chris Drakeley, Patrick Duffy, Diego F Echeverri-Garcia, Thomas G Egwang, Rick M Fairhurst, Md. Abul Faiz, Caterina I Fanello, Tran Tinh Hien, Abraham Hodgson, Mallika Imwong, Deus Ishengoma, Pharath Lim, Chanthap Lon, Jutta Marfurt, Kevin Marsh, Mayfong Mayxay, Victor Mobegi, Olugbenga Mokuolu, Jacqui Montgomery, Ivo Mueller, Myat Phone Kyaw, Paul N Newton, Francois Nosten, Rintis Noviyanti, Alexis Nzila, Harold Ocholla, Abraham Oduro, Marie Onyamboko, Jean-Bosco Ouedraogo, Aung Pyae Phyo, Christopher V Plowe, Ric N Price, Sasithon Pukrittayakamee, Milijaona Randrianarivelojosia, Pascal Ringwald, Lastenia Ruiz, David Saunders, Alex Shayo, Peter Siba, Shannon Takala-Harrison, Thuy-Nhien Nguyen Thanh, Vandana Thathy, Federica Verra, Nicholas J White, Ye Htut, Victoria J Cornelius, Rachel Giacomantonio, Dawn Muddyman, Christa Henrichs, Cinzia Malangone, Dushyanth Jyothi, Richard D Pearson, Julian C Rayner, Gilean McVean, Kirk Rockett, Alistair Miles, Paul Vauterin, Ben Jeffery, Magnus Manske, Jim Stalker, Bronwyn MacInnis, Dominic P Kwiatkowski, for the MalariaGEN Plasmodium falciparum Community
doi: http://dx.doi.org/10.1101/019737

Artemisinin resistant Plasmodium falciparum is advancing across Southeast Asia in a soft selective sweep involving at least 20 independent kelch13 mutations. In a large global survey, we find that kelch13 mutations which cause resistance in Southeast Asia are present at low frequency in Africa. We show that African kelch13 mutations have originated locally, and that kelch13 shows a normal variation pattern relative to other genes in Africa, whereas in Southeast Asia there is a great excess of non‐synonymous mutations, many of which cause radical amino‐acid changes. Thus, kelch13 is not currently undergoing strong selection in Africa, despite a deep reservoir of standing variation that could potentially allow resistance to emerge rapidly. The practical implications are that public health surveillance for artemisinin resistance should not rely on kelch13 data alone, and interventions to prevent resistance must account for local evolutionary conditions, shown by genomic epidemiology to differ greatly between geographical regions.

Ancestral chromatin configuration constrains chromatin evolution on differentiating sex chromosomes in Drosophila

Ancestral chromatin configuration constrains chromatin evolution on differentiating sex chromosomes in Drosophila

Qi Zhou, Doris Bachtrog
doi: http://dx.doi.org/10.1101/019786

Sex chromosomes evolve distinctive types of chromatin from a pair of ancestral autosomes that are usually euchromatic. In Drosophila, the dosage-compensated X becomes enriched for hyperactive chromatin in males (mediated by H4K16ac), while the Y chromosome acquires silencing heterochromatin (enriched for H3K9me2/3). Drosophila autosomes are typically mostly euchromatic but the small dot chromosome has evolved a heterochromatin-like milieu (enriched for H3K9me2/3) that permits the normal expression of dot-linked genes, but which is different from typical pericentric heterochromatin. In Drosophila busckii, the dot chromosomes have fused to the ancestral sex chromosomes, creating a pair of ‘neo-sex’ chromosomes. Here we collect genomic, transcriptomic and epigenomic data from D. busckii, to investigate the evolutionary trajectory of sex chromosomes from a largely heterochromatic ancestor. We show that the neo-sex chromosomes formed <1 million years ago, but nearly 60% of neo-Y linked genes have already become non-functional. Expression levels are generally lower for the neo-Y alleles relative to their neo-X homologs, and the silencing heterochromatin mark H3K9me2, but not H3K9me3, is significantly enriched on silenced neo-Y genes. Despite rampant neo-Y degeneration, we find that the neo-X is deficient for the canonical histone modification mark of dosage compensation (H4K16ac), relative to autosomes or the compensated ancestral X chromosome, possibly reflecting constraints imposed on evolving hyperactive chromatin in an originally heterochromatic environment. Yet, neo-X genes are transcriptionally more active in males, relative to females, suggesting the evolution of incipient dosage compensation on the neo-X. Our data show that Y degeneration proceeds quickly after sex chromosomes become established through genomic and epigenetic changes, and are consistent with the idea that the evolution of sex-linked chromatin is influenced by its ancestral configuration.