Stephen Nayfach, Katherine S Pollard
Deep sequencing has the potential to shed light on the functional and phylogenetic heterogeneity of microbial populations in the environment. Here we present PhyloCNV, an integrated computational pipeline for quantifying species abundance and strain-level genomic variation from shotgun metagenomes. Our method leverages a comprehensive database of >30,000 reference genomes which we accurately clustered into species groups using a panel of universal-single-copy genes. Given a shotgun metagenome, PhyloCNV will rapidly and automatically identify gene copy number variants and single-nucleotide variants present in abundant bacterial species. We applied PhyloCNV to >500 faecal metagenomes from the United States, Europe, China, Peru, and Tanzania and present the first global analysis of strain-level variation and biogeography in the human gut microbiome. On average there is 8.5x more nucleotide diversity of strains between different individuals than within individuals, with elevated strain-level diversity in hosts from Peru and Tanzania that live rural lifestyles. For many, but not all common gut species, a significant proportion of inter-sample strain-level genetic diversity is explained by host geography. Eubacterium rectale, for example, has a highly structured population that tracks with host country, while strains of Bacteroides uniformis and other species are structured independently of their hosts. Finally, we discovered that the gene content of some bacterial strains diverges at short evolutionary timescales during which few nucleotide variants accumulate. These findings shed light onto the recent evolutionary history of microbes in the human gut and highlight the extensive differences in the gene content of closely related bacterial strains. PhyloCNV is freely available at: https://github.com/snayfach/PhyloCNV.