Validity of covariance models for the analysis of geographical variation

Gilles Guillot, René Schilling, Emilio Porcu, Moreno Bevilacqua

(Submitted on 17 Nov 2013)

Due to the availability of large molecular data-sets, covariance models are increasingly used to describe the structure of genetic variation as an alternative to more heavily parametrised biological models. We focus here on a class of parametric covariance models that received sustained attention lately and show that the conditions under which they are valid mathematical models have been overlooked so far. We provide rigorous results for the construction of valid covariance models in this family. We also outline how to construct alternative covariance models for the analysis of geographical variation that are both mathematically well behaved and easily implementable.

Guillot et al bring up an important point here, one that we overlooked in the paper describing BEDASSLE (me, Ralph, with Bradburd and Coop). The points they brought up made me think more deeply about the issue, but don’t think I have the whole picture; hence, this comment. There are two basic issues: one basically technical, and the other more fundamental.

As general framework: say we have allele frequencies at N locations; the NxN matrix of distances between these is D; and we model the transformed allele frequencies as Gaussian with NxN covariance matrix Sigma, whose (i,j)th entry is a function of the corresponding distance D[i,j] and some parameters q to be estimated: Sigma[i,j] = F( D[i,j], q ).

First, if one is not careful, one could arrive at a set of estimated parameters that produces a matrix Sigma that is not positive definite. This doesn’t make any sense, and can wreak havoc with simulation procedures, likelihood computations, etcetera. In BEDASSLE this is not an issue because we prohibit any moves of the MCMC that produce such Sigma; effectively, this places zero prior mass on parameters q that give nonsensical Sigma. (but, this wasn’t stated in the paper, as it should have been)

Second, however, it might happen that you estimate parameters q with one sampling scheme (i.e. D), but then when you compute the covariance matrix for another set of distances from a different sampling scheme, you get an invalid covariance matrix. In other words, we’d like to imagine our allele frequencies as observations of some underlying continuous random field whose covariance function is given by F( . , q ); but such a random field might not exist. Guillot et al give a range of parameter values for which the functional form used by Wasser et al (and, BEDASSLE) is always valid for any matrix D of distances in flat Euclidean space and on the surface of the sphere.

But, this latter caveat is what I’m hung up on: habitats are not billiard tables or billiard balls, and so the “true” set of parameters could in principle fall outside the range that give valid covariance structures here. For instance, using the notation of the paper, alpha_2 should be between 0 and 2 for Euclidean space; but for great-circle distances on the surface of the sphere, alpha_2 should be no more than 1. Since real populations live on a bumpy portion of the surface of a large sphere, it is not clear at all if some alpha2 strictly larger than 1 might be appropriate. Furthermore, the appropriate notion of distance is not straight-line, but something related to “resistance distance”, with ecological distances thrown in, and (as Guillot et al note) it’s anyone’s guess how to determine the set of parameters that give valid covariance structures on such spaces.

In summary: the practical question is whether to restrict to parameters that give valid covariance structures on the sphere or not. It is my inclination to allow a wider range of parameters, and to take care when interpreting the results; but this wouldn’t be the appropriate thing for all applications.

[cross-posted to the bioRxiv]

Pingback: What we’re reading: Covariance in geographic variation, adaptation to altitude, and the ivory frat house | The Molecular Ecologist