LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies

Brendan Bulik-Sullivan, Po-Ru Loh, Hilary Finucane, Stephan Ripke, Jian Yang, Schizophrenia Working Group Psychiatric Genomics Consortium, Nick Patterson, Mark J Daly, Alkes L Price, Benjamin M Neale

Both polygenicity (i.e. many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield inflated distributions of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from bias and true signal from polygenicity. We have developed an approach that quantifies the contributions of each by examining the relationship between test statistics and linkage disequilibrium (LD). We term this approach LD Score regression. LD Score regression provides an upper bound on the contribution of confounding bias to the observed inflation in test statistics and can be used to estimate a more powerful correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of test statistic inflation in many GWAS of large sample size.

I’ve only had a quick read of the article so far, but it seems like an interesting approach.

In the online methods (page 10) there is a discussion of the relationship between LD and genetic drift, which is arguing that the LD score is apriori uncorrelated with drift (e.g. because LD doesn’t appear in the variance in allele frequencies due to drift). I might be missing something but it seems like this argument is not quite right. The rate of drift does not depend on LD, however, LD is created by genetic drift. As you know, different loci have different realizations of the process of genetic drift, with some loci having larger deviations in allele frequencies by chance due to drift that others. Loci that by chance have large allele frequency differences will have more LD than those that have smaller deviations. So apriori I do expect population differentiation to be correlated with LD across the genome. It may be that this is not a big effect, and it may already be included in your simulations (e.g, as they are based on real data). Have I missed something?

Another (potentially) minor point, on page 10 you state that selection likely isn’t a concern because there is only a “small contribution from selection”. However, the two references you cite are concerned by detecting large allele frequency changes between populations. I don’t think they rule out that background selection (or other linked selection) plays a role in increasing the rate of drift (in fact Hernandez et al shows that background selection is systematically lowering pi around genes). Once again this may not be big effect here, and perhaps this is already factored into your simulations.

I’m looking forward to delving into the paper more, is the supplement posted somewhere?

Thanks for your comments, Graham. I’ve responded point-by-point below. I already mentioned this to you on twitter, but for other readers, the supplement is already available on the biorxiv (follow the link above then click on “data supplements”)

TL;DR

(1,2) estimating LD Score using an external reference panel and a 1 centiMorgan (cM) window solves a lot of problems with population structure and population stratification.

(3) In GWAS with pure population stratification (e.g., if phenotype is ancestry), lambda_GC will approximately equal the LD Score regression intercept, and both will tend to be a few percent less than mean chi-square, for reasons to do with natural selection.

Long version:

(1) re: adjusting for population structure when estimating LD Score — we looked into this and it actually winds up not making very much of a difference, at least for estimating LD Score from the 1000 Genomes Europeans. An even simpler alternative to regressing population structure out of your sample would be to compute LD in each subpopulation in the reference panel separately then take the mean. We tried this, and it turns out to give almost exactly the same result as estimating LD Score using all the 1000 Genomes Europeans naively lumped together (R^2 > 99%). The explanation is that for small values of FST (e.g., within Europe) the amount of LD created by population structure per SNP is quite small. The key is to estimate LD Score using a window size that is big enough to capture almost all real LD but small enough that (# of SNPs in window) * (LD from population structure per SNP) makes only a small contribution to your LD Score estimates. A 1cM window seems to do the job.

(2) re: drift creating LD — NB this is addressed perhaps a little more clearly in sections 2.2-2.4 of the supplementary note. I think the issue here is that the term LD is overloaded. There is an important distinction between genome-wide LD in the target (for GWAS) sample and short-range LD in the reference (for estimating LD Score) sample. Suppose you draw a structured target sample. For the reasons you describe (also ref section 2.2 of the supplementary note), there will be genome-wide LD in this target sample (i.e., LD between SNPs on different chromosomes). If you estimate LD Scores using genotypes from your target sample and an infinite window size (i.e., you include LD between SNPs on different chromosomes), then use these LD Scores to estimate the inflation in test statistics from confounding via LD Score regression, your estimate will be biased downwards, because the most differentiated SNPs will have the most genome-wide LD *in your structured sample*. Genome-wide LD in a structured sample is *not* a useful quantity for estimating the inflation in test statistics due to population stratification in that sample. We get around this problem by using LD Scores estimated with a 1cM window in a separate reference panel for LD Score regression. Modulo linked selection, the magnitude of allele frequency differences due to drift in your target sample should not be correlated with short-range LD in the reference sample.

(3) re: background selection/hitchhiking — results from our simulations with pure population stratification (using real genotypes from various European control cohorts from the schizophrenia GWAS) are presented in supplementary tables 1-3. The take-home points are (a) lambda_GC is approximately equal to the LD Score regression intercept and (b) both lambda_GC and the LD Score regression intercept are slightly less than the mean chi-square. I believe that this ordering of statistics can be explained by selection: mean chi-square will get pulled up by big outlier allele frequency differences from selection; lambda_GC will be not be so affected by outliers, because it is a median; and the LD Score regression intercept could be pushed slightly downwards by correlation between the background selection coefficient and LD Score. Based off these simulation results, there is probably some correlation between LD Score and the background selection coefficient (as one would expect), but this correlation cannot be large, else the LD Score regression intercept would perform very badly instead of being ~95% of the way to the mean chi-square. However, I agree with you that it would improve the paper if we were to directly examine the correlation between the background selection coefficient and LD Score, rather than simply concluding that the LD Score regression intercept performs about the same as lambda_GC. We will do this for the next draft. Thanks for the suggestion.

Thanks Brendan. I’ll give this some thought. The McVicker et al B values offers a way to use for looking for correlations with background-selection measures.

Initial thought: regarding your point 2. I was not so much worried about genome-wide LD created among populations by drift, but rather the fact that SNPs that drift more (by chance) will have higher LD with linked (i.e. local) SNPs within populations (i.e. higher LD scores). This seems to potentially not be a problem given you simulations, perhaps due to the low levels of drift, but it might be nice to flesh out the argument given in the online methods (and supplement) a bit more.

I also wonder (having not given it much thought) whether trying to remove the effect of population structure from LD, before computing the LD score, might be helpful. E.g. these two papers http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499835/ and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3282397/ look into this. I’ve not read them closely but I think they are just regressing structure out of allele freq. before calcing LD. This could potentially lead to a cleaner signal. Not sure if this makes sense or just complicates the issue.

Pingback: Most viewed on Haldane’s Sieve: March 2014 | Haldane's Sieve

Thanks to the authors for posting this preprint, it seems quite useful. A few quick comments:

1. I still don’t have a great intuition for why this works–ie. why there is a correlation between LD score and the chi^2 statistic under a polygenic model but not under a stratification model. Specifically, you write that “the more genetic variation an index variant tags, the higher the probability that this index variant will tag a causal variant”. This makes sense, but I feel like I could just as easily write “the more genetic variation an index variant tags, the higher the probability that this index variant will tag a variant that is stratified between the cases and controls for non-causal reasons”. Apparently this isn’t an issue in the simulations, but I wonder if this is generally true, or just true when doing GWAS in closely-related human populations.

2. I’m having a bit of trouble following the math in the supplementary note, though this may be because I’m generally terrible at following math written by other people. For example, the set of steps starting at Eq. 1.3 are totally opaque to me. In the second step there, what are the dimensions of the vectors being multiplied? I feel like something is missing, but again this is probably just me.

Can you show the derivation of the bias in your estimator of r^2 (Eq 1.6)?

re: Eq 1.3 — \hat{beta}_j is a real number; X is NxM X_j is Nx1. \phi is Nx1, so Var[\phi] is NxN. The third line of Eq 1.3 follows from substituting X\beta + \epsilon for \phi.

re: Eq 1.6 — the take home point here is that the squared sample correlation coefficient in a sample of size N is biased upwards by about 1/N if the true correlation coefficient is zero, and less if the true correlation coefficient is not zero. The “less if the true correlation coefficient is not zero” part winds up not mattering for the derivation; it makes only a very small difference and is ignored after Eq 1.8.

The sentence you point out in your point (1) is really a statement about the model of phenotype and the model of drift described in the supplement, so it’s hard to provide a counterargument without either waving my hands or pointing to the derivation of Eq 2.14.

Thanks, think I’ve worked through it now. Key seems to be that changes in allele frequency over time should be evenly distributed across SNPs with different LD scores (where LD scores are calculated in the ancestral population). I suspect this will not be true as a general principle, since changes in allele frequency can depend on things like background selection, etc. But this is probably a small effect in a well-designed association study.

Bunch of neat ideas in the paper, thanks again for posting.

Pingback: Most viewed on Haldane’s Sieve: April 2014 | Haldane's Sieve

Pingback: Sifting through 2014 on Haldane’s Sieve | Haldane's Sieve