A novel method to model read counts in genomic data to reduce false positive identification of heterozygotes
Accurate identification of genotypes is critical in identifying de novo mutations, linking mutations with disease, and determining mutation rates. To call genotypes correctly from short-read data requires modeling read counts for each base. True heterozygotes may be affected by mapping reference bias and library preparation, leading to a distribution of reads that does not fit a 1:1 binomial distribution, and potentially resulting failure to call the alternate allele. Homozygous sites can be affected by the alignment of paralogous genes and sequencing error, which could incorrectly suggest heterozygousity. Previous work has modeled increased variance and skewed allele ratios to some degree. Here, we were able to model reads for all data as a mixture of Dirichlet multinomial distributions. This model has a better fit to the data than previously used models. In most cases we observed two distributions: one corresponds to a large proportion of heterozygous sites with a low reference bias and close-to-binomial distribution, and the other to a small proportion of sites with a high bias and overdispersion. The sites with high reference bias have not been previously identified as SNPs in extensive human genome research; thus, we believe these sites are not heterozygous in our data for the individuals studied here, and are falsely identified as heterozygous sites. We propose that this approach to modeling the distribution of NGS data provides a better fit to the data, which should lead to improved genotyping. Furthermore, the mixture of distributions may be used to suggest true and false positive de novo mutations. This approach provides an expected distribution of reads that can be incorporated into a model to estimate de novo mutations using reads across a pedigree.