The Landscape of Human STR Variation

The Landscape of Human STR Variation
Thomas F. Willems, Melissa Gymrek, Gareth Highnam, The 1000 Genomes Project The 1000 Genomes Project, David Mittelman, Yaniv Erlich

Short Tandem Repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across over 1,000 individuals in phase 1 of the 1000 Genomes Project. This process nearly saturated common STR variations. After employing a series of quality controls, we utilize this call set to analyze determinants of STR variation, assess the human reference genome?s representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations. The resource is publicly available at http://strcat.teamerlich.org/ both in raw format and via a graphical interface. 

5 thoughts on “The Landscape of Human STR Variation

  1. The authors used low-coverage next-gen sequencing data on the 1000 Genomes subjects to identify short tandem repeats (STRs) in the human genome, and to infer the genotypes of each subject at each STR. They use these data to characterize STR variation and the pattern of LD between SNPs and STRPs.

    This is a really interesting effort, but I am concerned about the quality of the genotype calls and the effect that genotyping errors have on the later inferences. Supp Table 4 is really important and deserves to be a proper table included with the main text, rather than left in the supplement.

    1. The authors make little comparison to what was previously known about STR variation. For example, I was surprised that there was no mention of Weber and Wong (Mutation of human short tandem repeats. Hum Mol Genet 2:1123-1128, 1993). I’m inclined to think that, while the detailed characterizations are extremely interesting, there are few if any real surprises here. di- and tri-STRs are more variable than tetra-STRs and higher, variability increases with allele length, and STRs are in lower LD with surrounding markers, in comparison to SNPs. It would be nice to see more connections to previous understanding of STR variation.

    2. The comparison of calls between lobSTR and Marshfield data are really quite terrible. Most concerning is Supplemental Table 4: that 76% of Marshfield A/B calls were called by lobSTR as A/A. The authors present a number of arguments to justify the value of the data in light of this high error rate, but really if Supp Table 4 is reflective of the general pattern of errors, it’s hard to see that the later inferences can be at all trustworthy. For example, the point (pg 10) that 30% of STRs have a common polymorphism: couldn’t that be a gross underestimate if lots of alleles are being missed?

    3. Given the high rate of A/B -> A/A errors in the lobSTR calls, I’m rather surprised that the heterozygosities in Fig 2C and 3A are as high as they are. Am I missing something? If you look at Supp Table 4, the heterozygosity in the Marshfield calls is 3527/5164 = 68%, while for the lobSTR calls it is 609/5164 = 12%.

    4. There is considerable (perhaps half?) missing data in the lobSTR calls (top of pg 7 and Figure 1). I expect that these are not missing at random, but relate to the underlying genotypes? Or is it related to low coverage, which might be basically random? This is an important point deserving discussion, as if the data are not missing at random, this could be another source of bias in conclusions.

    5. The quality assessment of STR loci (pgs 7-9) contains some really strange analyses that the authors should reconsider.

    a. First, the business of “dosages”: adding the two alleles within an individual. I don’t see how this has any useful biological meaning. A “dosage” of 1+5=6 is close to a dosage of 3+3=6, but the actual calls are totally different.

    b. R^2 is an ill-chosen measure of concordance. It’s a measure of linear association, and doesn’t take into account whether the numbers are actually the same. And what we care about is proportion of mismatches. R^2=0.71 for autosomal genotypes is really terrible anyway.

    c. “Heterozygosity rates were significantly correlated.” This is rather silly. We shouldn’t be thinking, “Is the correlation non-zero?” Rather, the question is: how close is it to 1? p < 10^-30 makes it sound good, but really R=0.68 is terrible. And as in point b, we shouldn't be thinking about association, but rather are the
    heterozygosities actually the same? Look at the RMS difference, or the average absolute difference.

    d. Regarding potential bias towards shorter alleles: "However, only 3-4% of the STRs in our catalog have a reference allele exceeding the lengths of these loci. Therefore we do not expect this bias to affect the allelic spectra of most loci." (pg 9) But don't we expect that the reference alleles will be similarly biased?

    e. The difference in heterozygosity between African and non-African subjects (Fig 3a) is really quite subtle. That the difference is in the correct direction is taken as evidence that the genotypes are useful, but does the quantitative shift match expectation? Similarly, the ability to infer ancestry from the genotype calls indicates that there is some signal within the noise, but not that noise is at a tolerable level for the other analyses.

    f. "The experiments above suggest that valuable summary statistics can be extracted from the call set" To me, the evidence for this is quite weak.

    6. The authors measured LD in terms of r^2. They might discuss the possibility that the SNP:SNP to SNP:STRP comparisons may have a calibration problem, as the STRPs are multi-allelic while SNPs are diallelic. How much would the LD results change if the the STRPs were converted to diallelic markers, by taking common allele vs other alleles?

  2. Karl, thanks a lot for reading our manuscript and providing your comments. The leading author of the manuscript is abroad and I am also traveling next week. We will follow up, but please give us a few days.

  3. Pingback: Most viewed on Haldane’s Sieve: May 2014 | Haldane's Sieve

  4. Pingback: Sifting through 2014 on Haldane’s Sieve | Haldane's Sieve

Leave a Reply to Yaniv Erlich Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s