Thoughts on: Probabilities of Fitness Consequences for Point Mutations Across the Human Genome

This guest post is by Greg Cooper, Martin Kircher, Daniela Witten, and Jay Shendure, and is a comment on Probabilities of Fitness Consequences for Point Mutations Across the Human Genome by Gulko et al.


Recently, Gulko et al. (2014) described an approach, FitCons, to estimate fitness consequences for point mutations using a combination of functional genomic annotations and inferences of selection based on human variant frequency spectra. On the basis of comparisons with several maps of regulatory element features, they concluded that FitCons is substantially better at inferring deleterious regulatory effects of variants than other metrics, including an annotation we developed named Combined Annotation Dependent Depletion (CADD, Kircher et al. 2014). However, we believe that the comparisons of FitCons and CADD for detecting deleterious regulatory variation are misleading, and that methods to predict fitness effects of point mutations should evaluate variants with demonstrable effects rather than variants assumed to have an effect by virtue of being within a functional element. We find that FitCons is substantially less effective than CADD at separating variants, both coding and regulatory, with functional and/or phenotypic effects from functionally inert and/or organismally benign variants. For example, CADD is much more effective at enriching for mutations in two enhancers and one promoter that have been experimentally shown to have large transcriptional effects. Further, in contrast with CADD, FitCons does not separate highly deleterious variants that cause Mendelian disease from high-frequency benign variants, nor does it separate complex-trait associated variants, which are enriched for deleterious regulatory effects, from matched control variants. We believe that it would be more appropriate to characterize FitCons as a predictor of cell-type specific regulatory elements, and to compare it to other tools directed specifically at this task, rather than variant fitness consequences.

Main text

FitCons, recently developed by Gulko et al (2014), is a method to estimate fitness effects of point mutations at both coding and non-coding positions in human genomes. FitCons works by first defining regional boundaries on the basis of clusters of functional genomic signals (“fingerprints”) and then estimating selective effects, inferred from allele frequency distributions within human populations, for variants with the same fingerprint. On the basis of comparisons with enhancers, transcription factor binding sites (TFBSs), and expression quantitative trait loci (eQTLs), Gulko et al. concluded that FitCons is substantially better at inferring variant regulatory effects than other metrics, including an annotation we developed named Combined Annotation Dependent Depletion (CADD, Kircher et al. 2014).

While FitCons is an interesting approach with potentially useful attributes, we believe that the comparisons of FitCons and CADD for detecting deleterious regulatory variation are misleading. Clarification is needed as to the purposes and performances of these metrics. Below, we first describe what we believe to be important general distinctions between CADD and FitCons and then detail their relative effectiveness at differentiating several sets of functional and/or pathogenic variants from inert and/or benign variants. Finally, we consider how correlations between the bin definitions and validation datasets used in Gulko et al., rather than fitness per se, may underlie the performance of FitCons for cell-type specific regulatory element prediction.

CADD is an allele-specific measure of variant deleteriousness that accounts for a wide variety of discrete and quantitative information at the allelic, site, and regional levels (Kircher et al. 2014). CADD scores can vary among the possible alleles at a given site, across sites within a given region or functional element, and between and across variants within differing classes of functional elements. FitCons, on the other hand is driven by a small number of cell-type-specific, regional features with reduced or absent variation within regions. As a result, FitCons is in practice a segmentation method: the median length of uniformly scored segments is 72 bases, the average segment length is 196 bases, and 50% of all scored bases in hg19 lie within a segment over 950 bases long. Furthermore, 30% of all bases in hg19 are assigned to the mode value (0.062), 60% are assigned one of two FitCons values, and over 80% are assigned one of 10 possible values. Thus, FitCons is in practice a regional annotation of cell-type-specific molecular activity, not a site or allele-specific metric of variant deleteriousness.

The basic structures of FitCons and CADD are crucial to interpreting the data presented by Gulko et al. In particular, they measure utility by assessing coverage of bases within functional elements, namely TFBSs and enhancers, relative to genomic background. While such an approach is reasonable to evaluate a method to annotate functional elements, it is not informative for a method to estimate organismal deleteriousness since many mutations within functional elements are evolutionarily neutral, including many that lack even a molecular effect. To wit, by FitCons/INSIGHT estimates, most sites within the enhancers and TFBSs evaluated have fitness effect probabilities below 0.2 (Gulko et al. 2014). While likely somewhat higher among high-information TF-binding motif positions and lower among the enhancers used (mean size of 888 bp), a decisive majority of positions in these nucleotide groups are mutable without consequence. Performance evaluations that reward uniformly high coverage of bases in these regions, rather than the particular subset of variants therein that actually have deleterious molecular effects, are therefore not meaningful for estimates of point mutation fitness consequences.

We firmly believe that methods to predict functional or fitness effects of mutations should be evaluated on mutations for which we have data relevant to function and fitness, not large aggregates of genomic regions or bases within which mutations are simply assumed to be phenotypically relevant. When tested on such mutation sets, we find that FitCons fails to capture a considerable amount of site- and allele-specific information that is captured by CADD (and between-species conservation metrics to a lesser extent). This loss of information, in turn, has profound effects on FitCons’ ability to identify variants with functional, pathogenic, or deleterious effects, including for regulatory changes.

First, FitCons has no predictive power for separating pathogenic variants in ClinVar (Landrum et al. 2014) from benign, high-frequency polymorphisms matched for genic effect category (e.g., missense, nonsense, etc): the distributions of FitCons scores for pathogenic and benign variants are nearly identical (Figure 1). While most of these variants are protein-altering, this same pattern holds for the subset of pathogenic/benign variants that do not directly alter proteins (Figure 1, right). In contrast, CADD and conservation measures like GERP (Cooper et al. 2005) strongly differentiate pathogenic from high-frequency variants, and, although more weakly, also differentiate non-protein-altering pathogenic from benign variants (for further details, see Kircher et al. 2014). The inability of FitCons to distinguish these highly pathogenic/deleterious variants from clearly benign variants runs counter to the general narrative in Gulko et al. in which FitCons scores are claimed to correlate with mutational fitness effect probabilities.

Figure 1. Boxplots showing the score distributions for CADD (top), FitCons (middle), and GERP (bottom), for pathogenic SNVs (red) vs. benign, high-frequency SNVs (blue) chosen to match one-to-one the genic consequence profile of the pathogenic variants. Score distributions for all SNVs are plotted on the left, while the subset of SNVs that are not missense, canonical splice, or nonsense (i.e., “non-protein-altering”) are on the right.

Figure 1. Boxplots showing the score distributions for CADD (top), FitCons (middle), and GERP (bottom), for pathogenic SNVs (red) vs. benign, high-frequency SNVs (blue) chosen to match one-to-one the genic consequence profile of the pathogenic variants. Score distributions for all SNVs are plotted on the left, while the subset of SNVs that are not missense, canonical splice, or nonsense (i.e., “non-protein-altering”) are on the right.

Second, as Gulko et al. emphasize the detection of regulatory variation (title of the manuscript not withstanding), we performed a detailed examination of three regulatory elements for which saturation mutagenesis data exist (Patwardhan et al. 2009; 2012). While not global, these data are comprised of directly measured, not assumed, regulatory effects.

In the 70-bp promoter region of HBB (Patwardhan et al. 2009), FitCons assigns all bases to the genome-wide mode (0.062). However, mutations in this region exhibit substantial variation in both transcriptional and disease consequences. Mutational effects on in vitro promoter activity range from no effect to a >2-fold change in transcription, and some of the strong in vitro effect mutations cause beta-thalassemia by disrupting normal transcript regulation. CADD and GERP correlate significantly with the regulatory (CADD Spearman’s rho=0.23, GERP rho=0.11) and disease consequences of these mutations (details in Kircher et al. 2014).

Within each of two enhancers tested by saturation mutagenesis (Patwardhan et al. 2012), FitCons scores are correlated with mutational effect (ECR11 rho=0.32, ALDOB rho=0.26) similar in magnitude to CADD (ECR11 rho=0.25, ALDOB rho=0.36). However, in both elements, the FitCons correlation is due to a higher score segment overlapping a more transcriptionally active region (Figure 2); no predictive power within the active regions exists. For example, most of the mutations with regulatory effects in the ECR11 enhancer reside in the last ~100 bases, which in turn reside within a single 168-bp FitCons segment. Within this segment, considerable mutational effect variation exists: 209 of 504 possible SNVs, distributed across 110 of the 168 sites, have no discernible effect on transcription (p >= 0.1). Concordantly, these inert mutations have significantly lower CADD scores (Wilcox test p=5.9 x 10-25) than their interdigitated SNV neighbors with at least some evidence for functional effect. Furthermore, within the set of mutations that have at least some evidence for effect (p<0.1; other arbitrary thresholds yield similar results), transcriptional effect sizes vary considerably and correlate with CADD (rho=0.33).

Figure 2. Transcriptional effects of individual mutations (log-fold change, y-axis on the left) are plotted (black) against genomic position (hg19, chromosome 2, x-axis).  FitCons scores are plotted along the same region in red (y-axis on the right).  Within this region there are three FitCons segments, with a segment spanning the last ~168 bases harboring the most active region.

Figure 2. Transcriptional effects of individual mutations (log-fold change, y-axis on the left) are plotted (black) against genomic position (hg19, chromosome 2, x-axis). FitCons scores are plotted along the same region in red (y-axis on the right). Within this region there are three FitCons segments, with a segment spanning the last ~168 bases harboring the most active region.

Next, as suggested by Gulko et al., we quantified coverage of discretely thresholded regulatory variants to evaluate the extent to which FitCons and CADD could enrich for “large-effect” regulatory mutations. Specifically, there are 108 mutations that alter transcriptional activity by at least two-fold within the three elements tested (29 mutations across 19 bases in ECR11, 76 mutations across 41 bases in ALDOB, and 3 mutations across 3 sites in HBB). We compared coverage of these 108 mutations at various thresholds relative to coverage of hg19, and find that CADD is much more effective at enriching for them than is FitCons (Figure 3). For example, 95 (88%) of the large-effect regulatory variants have a scaled CADD score above 10, a threshold that includes 10% of all possible hg19 SNVs (~9-fold enrichment above genomic background). Enrichment peaks at a CADD score of 15, a threshold that includes 53.7% of large-effect regulatory variants but only 3.2% of hg19 SNVs (~17-fold enrichment). In contrast, FitCons enrichment peaks at a threshold of 0.08, wherein only ~27% of all large-effect mutations are covered (~4-fold enrichment above background).

Figure 3. Comparison of coverage levels of large-effect regulatory mutations (y-axis) in two enhancers and one promoter relative to genomic background coverage levels (x-axis, log-scaled) for CADD (red) and FitCons (blue).

Figure 3. Comparison of coverage levels of large-effect regulatory mutations (y-axis) in two enhancers and one promoter relative to genomic background coverage levels (x-axis, log-scaled) for CADD (red) and FitCons (blue).

We next evaluated the ability of FitCons to distinguish trait-associated SNPs identified in genome-wide association studies (GWAS). Such SNPs are as a group enriched for regulatory variants with pathogenic, likely deleterious effects, a hypothesis supported by numerous anecdotal and systematic studies (e.g., Hindorff et al. 2009; Musunuru et al. 2010; Nicolae et al. 2010; ENCODE Project Consortium et al. 2012). These variants are overwhelmingly non-protein-altering (98%), with ~83% being intronic or intergenic SNPs not near to an exon. We previously showed that CADD scores significantly separate lead GWAS SNPs from matched neighboring control SNPs (Wilcoxon p=5.3 x 10-12). This separation remains highly significant for the 83% that are intronic or intergenic (p=1.26 x 10-9), indicating it is not driven solely by coding or near-coding variation. In contrast, FitCons scores do not separate lead GWAS SNPs from controls, either considering all variants (p=0.32) or intronic/intergenic only (p=0.57).

With respect to separation of eQTL SNPs from controls, Gulko et al. used all common variants that were part of the study as a background/control set. We believe the results from such a test are difficult to interpret. They do not control for effects of minor allele frequency, for example, a key property that correlates with both technical (e.g., eQTL discovery power) and biological effects (e.g., eQTL deleteriousness). Additionally, they do not control for genomic location. By virtue of the annotations it uses, FitCons scores will tend to be higher near transcribed genes in the cell-type of choice, which are in turn the only genes for which eQTLs can be identified. Therefore, this analysis confounds the information content resulting from a focus on cis, rather than trans, eQTLs, with that intrinsic to the scores themselves. While this may be an advantage, relative to a more general predictor like CADD, for predicting cell-type-specific function, it likely comes at a cost of reduced accuracy in terms of predicting deleteriousness per se (see below). Furthermore, in practical terms it is not likely to be useful given that cis-effects are the first and major focus of most eQTL discovery efforts, and it is furthermore unclear that FitCons would outperform other cell-type specific regulatory element annotations, such as those from integrative predictions of enhancers and promoters (Ernst and Kellis 2012; Hoffman et al. 2012). In any case, an analysis in which eQTL SNPs are matched for MAF, genomic location, distance to TSS and other confounders would provide a more meaningful evaluation of the utility of FitCons in a realistic eQTL discovery/fine-mapping analysis.

Finally, we believe that the discrepancies in performance metrics defined here vs. within Gulko et al. are also influenced by the potential circularity of cell-type-specific information within both the model definition and validation. Indeed, while INSIGHT adds value by scoring the 624 potential feature-defined bins, the correlations between the bin definitions (expression, DNase, and ChromHMM states) and validation data (TFBSs, enhancers, eQTL candidates) are quite likely the primary drivers of performance of FitCons as measured in Gulko et al. In fact, the strong correlation between INSIGHT and PhyloP as a bin scoring method suggests that metrics of evolutionary constraint could replace INSIGHT in the FitCons framework.

More generally, the use of cell-type specific features in both the metric definition and validation obscures a crucial trade-off in analyses of this sort. Evolutionary constraint-driven metrics (including CADD, which is strongly influenced by evolutionary constraint) emphasize variant effects on organismal fitness, which often have nothing to do with molecular function in any given cell-type; this means that constraint-driven metrics may be sub-optimal when trying to predict molecular function within said cell-type. However, the converse is also true, in that efforts to predict deleteriousness that emphasize cell-type specific molecular functionality will often be misleading when a variant has no molecular effect in that cell-type but strong fitness consequences due to functional effects elsewhere and/or has a molecular effect with no fitness consequence. Obviously, optimal choices in this trade-off depends greatly on analytical context and goals, but in our opinion the goal of predicting “fitness consequences for point mutations” dictates that performance metrics focused on organismal deleteriousness are more appropriate.

As a final illustrative example, Weedon et al. (2014) identified a set of noncoding SNVs that disrupt the function of an enhancer of PTF1A and cause pancreatic agenesis. CADD scores these variants between 23.2 and 24.5, higher than 99.5% of all possible hg19 SNVs and higher than 56% of pathogenic SNVs in ClinVar (most of which are protein-altering); much of the CADD signal for these variants results from measures of mammalian constraint (not shown). FitCons, on the other hand, places these variants in a 5-kb block of sites all scored at the genome-wide mode (0.062). This is in part a result of not having functional genomic data from cells in which the enhancer is active; however, the absence of such data in disease studies is common given that the relevant cell-types are frequently unknown or inaccessible. Further, even if DNase, RNA, and ChromHMM data were all generated for this cell type, given the general distributions of FitCons scores within regulatory elements observed in other cell types and lack of inter-species conservation information, it is unlikely that FitCons would have ranked these variants within the top 0.5% of all possible SNVs.

In any case, Gulko et al. demonstrate that FitCons is reasonably effective, and more so than CADD, at predicting the approximate boundaries of regulatory elements in cell types on which it is trained. However, claims that it better predicts functional or fitness effects of variants in either coding or non-coding regions are unsupported. Indeed, when challenged to separate point mutations with demonstrable effects from appropriate sets of control SNVs, CADD and other metrics that include evolutionary constraint information are substantially better as predictors of both coding and non-coding variant impact. We suggest that it would be more appropriate to characterize FitCons as a predictor of cell-type specific regulatory elements rather than variant fitness consequences, and to compare it to other tools directed at this task, such as ChromHMM (Ernst and Kellis 2012) or Segway (Hoffman et al. 2012).


We wish to thank Brad Gulko and Adam Siepel for readily sharing data and engaging in productive dialogue.


Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. 2005. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15: 901–913.

ENCODE Project Consortium, ENCODE Project Consortium, Dunham I, Dunham I, Kundaje A, Kundaje A, Aldred SF, Aldred SF, Collins PJ, Collins PJ, et al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.

Ernst J, Kellis M. 2012. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9: 215–216.

Gulko, B., Hubisz, M.J., Gronau, I., and Siepel, A. 2014. Probabilities of fitness consequences for point mutations across the human genome. bioRxiv.

Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106: 9362–9367.

Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. 2012. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 9: 473–476.

Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. 2014. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46: 310–315.

Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. 2014. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42: D980–D985.

Musunuru K, Strong A, Frank-Kamenetsky M, Lee NE, Ahfeldt T, Sachs KV, Li X, Li H, Kuperwasser N, Ruda VM, et al. 2010. From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466: 714–719.

Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ. 2010. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6: e1000888.

Patwardhan RP, Hiatt JB, Witten DM, Kim MJ, Smith RP, May D, Lee C, Andrie JM, Lee S-I, Cooper GM, et al. 2012. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30: 265–270.

Patwardhan RP, Lee C, Litvin O, Young DL, Pe apos er D, Shendure J. 2009. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol 27: 1173–1175.

Weedon MN, Cebola I, Patch A-M, Flanagan SE, De Franco E, Caswell R, Rodríguez-Seguí SA, Shaw-Smith C, Cho CH-H, Allen HL, et al. 2014. Recessive mutations in a distal PTF1A enhancer cause isolated pancreatic agenesis. Nat Genet 46: 61–64.


Thoughts on: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

I (@joe_pickrell) was recently asked to review a preprint by Decker et al., Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle for a journal. Below are the comments I sent the journal.

In this paper, the authors apply a suite of population genetics analyses to a set of cattle breeds. The basic data consists of around 1,500 individuals from 143 breeds typed at around 40,000 SNPs. The authors use this data to build population trees/graphs using TreeMix and visualize population structure with PCA/ADMIXTURE. They then interpret the results of these programs in light of their knowledge of the history of cattle domestication. I had no knowledge of cattle history prior to reading this manuscript, so I enjoyed reading it. I have first a few comments on the manuscript as a whole, then on individual points.

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. It’s probably most interesting to focus interpretation on the edges that are consistent.

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not. For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but it’s difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources. In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc…

Specific comments:

There are quite a few results claimed in this paper, so I’m going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine…”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We also identify the influence of species other than Bos taurus in the formation of Asian breeds.

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statstics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here. In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write “share ancestry with Shorthorn cattle” when using the Durham shorthorns as one reference?

Iberian and Italian cattle possess introgression from African taurine.

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

American Criollo cattle are shown to be of Iberian, and not African, decent.

I found this difficult to follow–the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

Indicine introgression into American cattle occurred in the Americas, and not Europe

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.

Some background on Bhaskar and Song’s paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”

This post is by Graham Coop [@Graham_Coop], and is an introduction to the background of Anand Bhaskar and Yun Song preprint: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

Anand and Yun’s preprint focuses on what we can learn about population demographic history from the site frequency spectrum. It takes as its starting point an article by Myers et al (Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.). I wrote about Myers et al’s article back in 2008, when it came out (see here). I thought I’d post an edited version of that post by way of additional background to Anand and Yun’s preprint and guest post.

Edited version of original post:
The best way to learn about demography from population genetic data is to look at patterns of diversity across many unlinked regions. The distribution of frequencies in a populations of unlinked neutral alleles at SNPs (the site frequency spectrum) is potentially very informative about population history. For example an excess of low frequency mutations is consistent with recent population growth, as the increase in population size introduces new mutations but these mutations have not yet had time to drift to higher frequencies. Many authors have made use of the frequency spectrum of unlinked, putatively neutral SNPs to learn about demography.

Back in 2008 a technical but elegant article by Myers et al shows that while informative about demography, the site frequency spectrum at unlinked SNPs can not help you chose between certain demographic histories. This is not a question of imperfect knowledge of the site frequency spectrum (which more data would solve) but because for any particular demographic model, as Myers et al formally show, there are a large family of demographic histories that can give rise to the same site frequency spectrum. They explained: ‘Informally, changes in population size at some past time are canceled out by other changes in the opposite direction’. I think that this lack of information comes from the fact that each unlinked SNP only tells you about the placement of a single mutation on the genealogy of the population at that site, and over sites you learn about the expected amount of time in different parts of the genealogy. By fluctuating the population size in just the right way, i.e. speeding up and slowing down the rate of coalescence, we can get very different population histories to give us the same expected coalescent times.

For example Myers et al showed that the population size history below, gives the same population frequency spectrum as a a constant population
Figure taken from Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

That’s somewhat worrying as it says that when we fit a model of population size changes over time there are actually a family of quite different looking population histories that would give us just as good a fit to the data. However, these alternative histories do look quite strange and may not be biologically reasonable.

Anand and Yun’s article takes this as their starting place. They show that if we write our population history as a series of piecewise functions that the parameters of these functions are identifable, and provide simple estimates of the sample size needed. You can read more about their results in their guest post here at Haldane’s sieve.

Thoughts on: Loss and Recovery of Genetic Diversity in Adapting Populations of HIV

This guest post is an exchange between Richard Neher and the authors of the preprint “Loss and Recovery of Genetic Diversity in Adapting Populations of HIV” (Pleuni Pennings, Sergey Kryazhimskiy, and John Wakeley). Below is the comment from Richard Neher, and then appended is the response from the authors of the study.

In this paper, the authors use a data set from a study of the anti-HIV drug efavirenz. This drug has a fairly stereotypic resistance profile which in most cases involves a mutation at amino acid 103 of the reverse transcriptase of HIV (K103N). The authors examine sequences from patients after treatment failure (drug resistant virus) and observe that in a large fraction of the cases, the drug resistance mutation K103N is present on multiple genetic backgrounds or in both of the possible codons for asparagine. This suggests frequent soft sweeps, i.e., evolution of drug resistance is not limited by the waiting time for a point mutation.

The observation of frequent soft sweeps allows to put a lower bound on the product of population size and mutation rate. Since the mutation rate is on the order of 1e-5, the lower bound for the population size is around N>1e5. The authors suggest that the fact that not all patients exhibit obvious soft-sweeps can be used to deduce an upper bound of N. However, one has to realize that the patient sample is heterogeneous, that additional drugs are used along with efavirenz, and that most likely additional mutations have swept through the population. Multiple soft-sweeps in rapid succession will look like hard sweeps. The lower bound makes a lot of sense and does away with a long held erroneous belief that the “effective” HIV population within an infected individual is small.

The debate about the size of the HIV population has some interesting history. In the mid 90ies, it was estimated that roughly 1e7 cells are infected by HIV within a chronically infected patient every day. Virologists studying HIV evolution concluded that every point mutations is explored many times every day (see Coffin, Science, 1995,, which was consistent with the frequent failure of mono-therapy, i.e., therapy with only one drug. Around the same time, it was observed that HIV sequences within a patient typically have a common ancestor about 3 years ago, which translates into roughly 500-1000 generations. Population geneticists then started to convince people that this rapid coalescence corresponds to an “effective” population size of the order of a 1000, and that this explains the observed stochasticity of HIV evolution. Not everybody was convinced and some went through great trouble to show that very rare alleles matter and that the population size is large, see for example In this paper, failure of efavirenz therapy is studied in monkeys. Despite the fact that the resistance mutations were at frequencies below 1e-5 before treatment, both codons for asparagine at position 103 are observed a few days after treatment. Via as similar argument as in the above paper, the authors conclude that the population size is large.

There is very little reason to believe that coalescence in HIV is driven by short term offspring number fluctuations (drift). Instead, the coalescence is most likely driven by selection in which case the time scale of coalescence depends weakly on the population size (see e.g.

The tendency of population geneticists to map everything to a neutral model has in this case of HIV produced much confusion. This confusion is easily avoided if people were willing to give up the concept of effective population size and simply call the time scale of coalescent what it is.

Response from Pennings et al.

Hi Richard,

Thanks a lot for your detailed comments.
We agree with most of your analysis, but let us explain why we believe that our estimate of the effective population size is not just a lower bound.
We use the observation of “soft” and “hard” sweeps to estimate the effective population size of the virus in a patient.
In 7 patients, a mixture of AAC and AAT alleles at the 103rd codon of RT replaced the wildtype, whereas in 10 patients either the AAT or the AAC allele replaced the wildtype.
When we only observe the AAC allele, it is still possible that this allele has several origins (the mutation from AAA to AAC occurred multiple times). This possibility is included in our analysis.
In addition, it is possible that originally, a mixture of AAC and AAT replaced the wildtype, but a subsequent sweep (at another site) removed one of the two alleles from the population (or reduced its frequency so that it doesn’t appear in the sample). You suggest that this process can explain our observation of hard sweeps.
We agree that this is a theoretical possibility, but we believe that our original interpretation is more parsimonious.

First, sweeps may be occurring regularly in all patients. In this case we do not expect any differences in diversity reduction in patients where the last sweep happened at a drug resistance codon versus patients in which another sweep was the last. Our data do not support this picture, because ongoing sweeps in all patients are not compatible with a significant and substantial reduction in diversity in the patients whose virus fixed a resistance mutation. Hence non-drug resistance related sweeps with a strong effect of diversity must be relatively rare in the viral populations.

We have plotted the reduction in diversity in intervals without the fixation of a resistance mutation and the in intervals with the fixation of a resistance mutation. The 10 patients in which only the AAC or AAT allele was observed are highlighted in red. The reduction of diversity in these intervals is quite severe and such severe reductions are not observed in the intervals without the fixation of a resistance mutations.


Second, it may be possible that, for some reason, the patients in which we see a hard sweep at site 103 actually had two or more sweeps (with the sweep at site 103 not being the last one) while patients in which we see a soft sweep had only one soft sweep at site 103. Then, indeed, the former set of patients would have a larger reduction in diversity than the latter set of patients, and this difference in reduction would NOT be due to fact that the former patients received only one resistance mutation. One potential scenario in which this could happen is if the time intervals during which the sweep occurred are systematically longer in patients in which we observe hard sweeps. However, this is not the case, see figure.


Another scenario is if there is a specific structure of epistasis among mutations in HIV. In particular, after the 103 mutation has fixed, another mutation or mutations become available which were not available before the K103N swept. These could be compensatory mutations, for example. In this case, in all patients there was a soft sweep at site 103. Following that, in some patients, the secondary mutation occurred and swept quickly, but in other it didn’t (just by chance). In those patients where it did occur and sweep, we see a larger reduction in diversity (including site 103) due to this secondary sweep. However, this would mean that the populations are limited by the supply of this secondary mutation rather then the K103N mutation, which seems unlikely (especially considering that after the K103N mutation the population size would have likely gone up). Also, if this were the case, the mutations that lead to the second sweep must occur relatively far away from the K103N site, otherwise they would have likely been discovered.

Finally, it can be that what looks like hard sweeps are indeed hard sweeps. We believe that this is, with our current knowledge, the most parsimonious explanation of our observations. Hence, the effective population size of the virus cannot be very large. This explanation is also compatible with the observation that resistance does not evolve in all patients.

Pleuni and Sergey (John is on vacation)

Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here


Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

Thoughts on: Polygenic modeling with Bayesian sparse linear mixed models

[This post is a commentary by Alkes L. Price on “Polygenic modeling with Bayesian sparse linear mixed models” by Zhou, Carbonetto, and Stephens. The preprint is available on the arXiv here.]

Linear mixed models (LMM) are widely used by geneticists, both for estimating the heritability explained by genotyped markers (h2g) and for phenotypic prediction (Best Linear Unbiased Prediction, BLUP); their application for computing association statistics is outside the focus of the current paper. LMM assume that effects sizes are normally distributed, but this assumption may not hold in practice. Improved modeling of the distribution of effect sizes may lead to more precise estimates of h2g and more accurate phenotypic predictions.

Previous work (nicely summarized by the authors in Table 1) has used various mixture distributions to model effect sizes. In the current paper, the authors advocate a mixture of two normal distributions (with independently parametrized variances), and provide a prior distribution for the hyper-parameters of this mixture distribution. This approach has the advantage of generalizing LMM, so that the method produces results similar to LMM when the effect sizes roughly follow a normal distribution. Posterior estimates of the hyper-parameters and effect sizes are obtained via MCMC.

The authors show via simulations and application to real phenotypes (e.g. WTCCC) that the method performs as well or better than other methods, both for estimating h2g and for predicting phenotype, under a range of genetic architectures. For diseases with large-effect loci (e.g. autoimmune diseases), results superior to LMM are obtained. When effect sizes are close to normally distributed, results are similar to LMM — and superior to a previous Bayesian method developed by the authors based on a mixture of normally distributed and zero effect sizes, with priors specifying a small mixing weight for non-zero effects.

Have methods for estimating h2g and building phenotypic predictions reached a stage of perfection that obviates the need for further research? The authors report a running time of 77 hours to analyze data from 3,925 individuals, so computational tractability on the much larger data sets of the future is a key area for possible improvement. I wonder whether it might be possible for a simpler method to achieve similar performance.

Alkes Price

Thoughts on: Finding the sources of missing heritability in a yeast cross

[This commentary post is by Joe Pickrell on Finding the sources of missing heritability in a yeast cross by Bloom et al., available from the arXiv here]

For decades, human geneticists have used twin studies and related methods to show that a considerable amount of the phenotypic variation in humans is driven by genetic variation (without any knowledge of the underlying loci). More recently, genome-wide association studies have made incredible progress in identifying specific genetic variants that are important in various traits. However, the loci identified in the latter studies often have small effects, and the sums of their effects rarely come close to the genetic effects known from the former. The difference between the genetic effects on a trait known from heritability studies and the effects estimated from individual loci has come to be known as the “missing heritability”, and much ink has been spilled on speculation as to its cause.

Bloom et al. take an elegant and straightforward approach to this question using a model system, the budding yeast Saccharomyces cerevisiae. The insight is that in order to make progress, one needs both an experimental design to isolate some of the possible causes of the “missing heritability”. To achieve this, Bloom et al. use a cross between two different yeast strains and grow the segregants in identical conditions. They thus remove much of the environmental variation in the phenotypes, while also removing the effect of allele frequency (since all alleles are at frequency 0.5 in the cross). While this means that they cannot address some controversies about potential sources of heritability in humans (for example, rare versus common variants), they are able to estimate how much phenotypic variation is due to detectable additive effects. The authors additionally develop high-throughput assays to measure phenotypes (in their case, growth rates in 46 different conditions) and genotypes in the segregants from the cross, so that they can perform high-powered mapping studies.

The main result relevant to the issue of “missing heritability” is presented in their Figure 3a (reproduced above). After performing a well-powered mapping study, the authors compare the effects from their identified loci to the narrow-sense heritability of each trait. As it turns out, the heritability is not missing! In this cross, the identified loci, though of small effect, add up to a substantial fraction of the overall (narrow-sense) heritability for most traits. The authors additional identify some gene-gene interactions that contribute to the broad-sense heritability (but by definition not the narrow-sense heritability) of many traits.

The authors provocatively interpret their results as supporting a model in which the majority of the “missing heritability” lies in a large number of variants with small effect sizes (in line with the model proposed most notably by Peter Visscher and colleagues, though the authors here make no claims about the allele frequencies of the relevant variants). While this seems to be true in yeast, it remains to be seen in humans. It’s of course easy to come up with reasons why this might not hold in humans–our species is a special snowflake and so forth–but this paper should be in the back of the mind of anyone who is thinking about this problem.

Thoughts on: The date of interbreeding between Neandertals and modern humans.

The following are my (Graham Coop, @graham_coop) brief thoughts on Sriram Sankararaman et al.’s arXived article: “The date of interbreeding between Neandertals and modern humans.”. You can read the authors’ guest post here, along with comments by Sriram and others.

Overall it’s a great article, so I thought I’d spend sometime talking about the interpretation of the results. Please feel free to comment, our main reason for doing these posts is to facilitate early discussion of preprints.

The authors analysis relies on measuring the correlation along the genome between alleles that may have been inherited from the putative admixture event [so called admixture. The idea being that if there was in fact no admixture and these alleles have just been inherited from the common ancestral population (>300kya) then these correlations should be very weak, as there has been plenty of time for recombination to break down the correlation between these markers. If there has been a single admixture event, the rate at which the correlation decays with the genetic distance between the markers is proportional to this admixture time [i.e. slower decay for a more recent event, as there is less time for recombination]. These ideas for testing for admixture have been around in the literature for sometime [e.g. Machado et al], its the application and genome-wide application that is novel.

As you can tell from the title and abstract of the paper, the authors find pretty robust evidence that this curve is decaying slower than we’d expect if there had been no gene flow, and estimate this “admixture time” to be 37k-86k years ago. However, as the authors are careful to note in their discussion, this is not a definitive answer to whether modern humans and Neandertals interbred, nor is this number a definite time of admixture. Obviously the biological implications of the admixture result will get a lot of discussion, so I thought I’d instead spend a moment on these caveats. [This post has run long, so I’ll only get to the 1st point in this post and perhaps return to write another post on this later].

Okay so did Neandertals actually mate with humans?

The difficulty [as briefly discussed by the authors] is that we cannot know for sure from this analysis that the time estimated is the time of gene flow from Neandertals, and not some [now extinct] population that is somewhat closer to Neandertals than any modern humans.

Consider the figure below. We would like to say that the cartoon history on the left is true, where gene flow has happened directly from Neandertals into some subset of humans. The difficulty is that the same decay curve could be generated by the scenario on the right, where gene flow has occurred from some other population that shares more of its population history with Neandertals than any current day human population does.

Why is this? Well allele frequency change that occurred in the red branch [e.g. due to genetic drift] means that the frequencies in population X and Neandertals are correlated. This means that when we ask questions about correlations along the genome between alleles shared between Neanderthals and humans, we are also asking questions about correlations along the genome between population X and modern humans. So under scenario B I think the rate of decay of the correlation calculated in the paper is a function only of the admixture time of population X with Europeans, and so there may have been no direct admixture from Neandertals into Eurasians*.

First thing is first, that doesn’t diminish how interesting the result is. If interpretation of the decay as a signal of admixture is correct, then it still means that Eurasians interbred with some ancient human population, which was closer to Neandertals than other modern humans. That seems pretty awesome, regardless of whether that population is Neanderthals or some yet undetermined group.

At this point you are likely saying: well we know that Neandertals existed as a [somewhat] separate population/species who are these population X you keep talking about and where are their remains? Population X could easily be a subset of what we call Neandertals, in which case you’ve been reading this all for no reason [if you only want to know if we interbred with Neandertals]. However, my view is that in the next decade of ancient human population history things are going to get really interesting. We have already seen this from the Denisovian papers [1,2], and the work of ancient admixture in Africa (e.g. Hammer et al. 2011, Lachance et al. 2012). We will likely discover a bunch of cryptic somewhat distinct ancient populations, that we’ve previously [rightly] grouped into a relatively small number of labels based on their morphology and timing in the fossil record. We are not going to have names for many of these groups, but with large amounts of genomic data [ancient and modern] we are going to find all sorts of population structure. The question then becomes not an issue of naming these populations, but understanding the divergence and population genetic relationship among them.

There’s a huge range of (likely more plausible) scenarios that are hybrids between A and B that I think would still give the same difficulties with interpretations. For example, ongoing low levels of gene flow from population X into the Ancestral “population” of modern humans, consistent with us calling population X modern humans [see Figure below, **]. But all of the scenarios likely involve some thing pretty interesting happening in the past 100,000 years, with some form of contact between Eurasians and a somewhat diverged population.

As I say, the authors to their credit take the time in the discussion to point out this caveat. I thought some clarification of why this is the case would be helpful. The tools to address this problem more thoroughly are under development by some of the authors on this paper [Patterson et al 2012] and others [Lawson et al.]. So these tools along with more sequencing of ancient remains will help clarify all of this. It is an exciting time for human population genomics!

* I think I’m right in saying that the intercept of the curve with zero is the only thing that changes between Fig 1A and Fig 1B.

** Note that in the case shown in Figure 2, I think Sriram et al are mostly dating the red arrow, not any of the earlier arrows. This is because they condition their subset of alleles to represent introgression into European and to be at low frequency in Africa. We would likely not be able to date the deeper admixture arrow into the ancestor on Eurasian/Africa using the authors approach, as [I think] it relies on having a relatively non-admixed population to use as a control.