Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. To call genotypes correctly from short-read data requires modeling read counts for each base. True heterozygotes may be affected by mapping reference bias and library preparation, leading to a distribution of reads that does not fit a 1:1 binomial distribution, and potentially resulting failure to call the alternate allele. Homozygous sites can be affected by the alignment of paralogous genes and sequencing error, which could incorrectly suggest heterozygousity. Previous work has modeled increased variance and skewed allele ratios to some degree. Here, we were able to model reads for all data as a mixture of Dirichlet multinomial distributions. This model has a better fit to the data than previously used models. In most cases we observed two distributions: one corresponds to a large proportion of heterozygous sites with a low reference bias and close-to-binomial distribution, and the other to a small proportion of sites with a high bias and overdispersion. The sites with high reference bias have not been previously identified as SNPs in extensive human genome research; thus, we believe these sites are not heterozygous in our data for the individuals studied here, and are falsely identified as heterozygous sites. We propose that this approach to modeling the distribution of NGS data provides a better fit to the data, which should lead to improved genotyping. Furthermore, the mixture of distributions may be used to suggest true and false positive de novo mutations. This approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.
Natural selection reduces linked neutral divergence between distantly related species
A new model of human dispersal
Detecting Heterogeneity in Population Structure Across the Genome in Admixed Populations
Strong Selection is Necessary for Evolution of Blindness in Cave Dwellers
Marker-based estimates reveal significant non-additive effects in clonally propagated cassava (Manihot esculenta): implications for the prediction of total genetic value and the selection of varieties
Population genetic analyses of metagenomes reveal extensive strain-level variation in prevalent human-associated bacteria
Stephen Nayfach, Katherine S Pollard
doi: http://dx.doi.org/10.1101/031757
Deep sequencing has the potential to shed light on the functional and phylogenetic heterogeneity of microbial populations in the environment. Here we present PhyloCNV, an integrated computational pipeline for quantifying species abundance and strain-level genomic variation from shotgun metagenomes. Our method leverages a comprehensive database of >30,000 reference genomes which we accurately clustered into species groups using a panel of universal-single-copy genes. Given a shotgun metagenome, PhyloCNV will rapidly and automatically identify gene copy number variants and single-nucleotide variants present in abundant bacterial species. We applied PhyloCNV to >500 faecal metagenomes from the United States, Europe, China, Peru, and Tanzania and present the first global analysis of strain-level variation and biogeography in the human gut microbiome. On average there is 8.5x more nucleotide diversity of strains between different individuals than within individuals, with elevated strain-level diversity in hosts from Peru and Tanzania that live rural lifestyles. For many, but not all common gut species, a significant proportion of inter-sample strain-level genetic diversity is explained by host geography. Eubacterium rectale, for example, has a highly structured population that tracks with host country, while strains of Bacteroides uniformis and other species are structured independently of their hosts. Finally, we discovered that the gene content of some bacterial strains diverges at short evolutionary timescales during which few nucleotide variants accumulate. These findings shed light onto the recent evolutionary history of microbes in the human gut and highlight the extensive differences in the gene content of closely related bacterial strains. PhyloCNV is freely available at: https://github.com/snayfach/PhyloCNV.