Inferring the frequency spectrum of derived variants to quantify adaptive molecular evolution in protein-coding genes of Drosophila melanogaster
Many approaches for inferring adaptive molecular evolution analyze the unfolded site frequency spectrum (SFS), a vector of counts of sites with different numbers of copies of derived alleles in a sample of alleles from a population. Accurate inference of the high copy number elements of the SFS is difficult, however, because of misassignment of alleles as derived versus ancestral. This is a known problem with parsimony using outgroup species. Here, we show that the problem is particularly serious if there is variation in the substitution rate among sites brought about by variation in selective constraint levels. We present a new method for inferring the SFS using one or two outgroups, which attempts to overcome the problem of misassignment. We show that two outgroups are required for accurate estimation of the SFS if there is substantial variation in selective constraints, which is expected to be the case for nonsynonymous sites of protein-coding genes. We apply the method to estimate unfolded SFSs for synonymous and nonsynonymous sites from Phase 2 of the Drosophila Population Genomics Project. We use the unfolded spectra to estimate the frequency and strength of advantageous and deleterious mutations, and estimate that ~50% of amino acid substitutions are positively selected, but that less than 0.5% of new amino acid mutations are beneficial, with a scaled selection strength of Nes ≈ 12.