This guest post is by Benjamin Peter on his preprint Trees, Population Structure, F-statistics!
I began thinking about this paper more than a year ago, when Joe Pickrell and David Reich posted their perspective paper on human genetic history on biorxiv. In that paper, they presented a very critical perspective of the serial founder model, the model I happened to be working on at the time. Needless to say, my perspective on the use (and usefulness) of the model was, and still is, quite different.
Part of their argument was based on the usage of the F3-statistic, and the fact that it is negative for many human populations, indicating admixture. Now, at that time, I was familiar with the basic idea of the statistic and had convinced myself – following the algebraic argument in Patterson et al. (2012) – that it should be positive under models of no admixture. However, I still had many open questions that this paper did not answer. Why should we use F2 as a measure of genetic drift to begin with? Why does F3 have this positivity property? How robust is this to other structure models? The ‘path’-diagrams that Patterson et al. (2012) used personally did not help me, because I am not familiar with Feynman diagrams, and I did not understand how drift could have ‘opposite’ directions.
The other primary sources did not help me, partly because they are buried in supplements and repetitive. Unfortunately, I initially missed what I now find the most comprehensive resource – the Supplementary Material of Reich et al. (2009), which did not help my understanding. However at that time – early summer last year – I had a thesis to finish, and so the F-statistics left my mind.
I finished my Ph. D. in July, moved to Chicago in October 2014 and forgot about F-statistics in the meantime. When I started my postdoc, John Novembre proposed that I have a look at EEMS, a program one of Matthew’s former students, Desi Petkova, had developed to visualize migration patterns. Strikingly, Desi also used a matrix of squared difference in allele frequency, but she did so in a coalescence framework and for diploid samples, as opposed to the diffusion framework and population sample used for the F-statistics. However, the connection is immediately obvious, and it took only a few pages of algebra to figure out what is now Equation 5 in the paper; namely that F2 has a very easy interpretation under the coalescent.
This was a very useful result, and was what eventually made me decide to start writing a paper, and research the other issues I did not understand about F-statistics. It takes very little algebra (or some digging through supplementary materials) to figure out that F3 and F4 can be written in terms of F2. The interesting bit, however, is the form of these expressions – they immediately reminded me of quantities that are used in distance-based phylogenetics – the Gromov product and tree splits, and made it obvious, that the statistics should be interpreted in that context as tests of treeness, with admixture as the alternative model, and that F3 and F4 are just lengths of external and internal branches on a tree, and that the workings of the tests can be neatly explained using that phylogenetic theory.
Now, essentially a year later, I finished a version of my paper that I am comfortable with sharing. Because of my initial difficulties with the subject – and my suspicion I might not be the only one that only has a vague understanding of the statistics – I kept the first part as basic as possible, starting with how drift is measured as decay in heterozygosity, as increase in uncertainty or relatedness, then explore in depth the phylogenetic theory underlying the null model of the admixture tests, and briefly talk about the path interpretation of the admixture model. Only then I present my main result, the interpretation in terms of coalescent times and internal branch lengths, some small simulations as sanity checks and some applications and population structure models.
A big challenge has been to attribute ideas correctly, sometimes because sources were sometimes difficult to find, and sometimes because key ideas were only implicitly stated. So if parts are unclear, or if I misattributed anything, please let me know, and I am happy to fix it. Similarly, if there are parts of the manuscript that are hard to understand, please contact me, the aim of this paper is meant to serve both as an useful introduction to the topic, and to present some interesting results.
Thanks for the post and the preprint, Ben, I agree it’s a problem when we don’t get enough crosstalk between phylogenetic and population genetic theory and keep reinventing the wheel instead. Along similar lines, I think the D statistic is really a test of whether a dataset satisfies the four point condition from phylogenetics. This means that the D test can also be viewed as a simple application of phylogenetic invariant theory, as introduced by Cavender and Felsenstein (1987) and studied over the past few years in the context of algebraic statistics (e.g. Allman and Rhodes 2003).
I completely agree, Kelley. In fact, there is a simple proof of that in the paper that F4 != 0 implies that the 4PC is violated, and F4 has the same null hypothesis as D. Not novel results, but interesting nontheless.
I really like Daniel Huson’s 2005 paper on this (“Reconstruction of Reticulate Networks from Gene Trees”), and their phylogenetic approach to the D test. Great work here, too!
Pingback: Trees, Population Structure, F-statistics! | Benjamin Peter