Routes for breaching and protecting genetic privacy

Routes for breaching and protecting genetic privacy
Yaniv Erlich, Arvind Narayanan
(Submitted on 11 Oct 2013)

We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.

Our paper: The influence of relatives on the efficiency and error rate of familial searching

This guest post is by Rori Rohlfs on her paper (along with coauthors): Rohlfs et al. The influence of relatives on the efficiency and error rate of familial searching. arXived here.

One of the ways we in the U.S. (and elsewhere) are likely to encounter genetic technologies in our lives is through forensic DNA identification.  Without knowing a specific quantity, clearly a huge number of us encounter forensic uses of DNA through court cases using genetic evidence (as survivors, defendants, jury members, etc.), DNA sample seizure during a stop or arrest (currently being considered by the U.S. Supreme Court), or by being genetically related to someone in an offender or arrestee DNA database (>11 million profiles in U.S. national database).  Despite the social relevance of forensic uses of DNA, it seems to me that forensic genetics isn’t much discussed by the population and evolutionary genetics crowd these days.

A while back, I became interested in a newer forensic technique known as familial searching, particularly in how some pop gen assumptions affect outcomes.  Familial searching is performed in cases where police have some DNA evidence from an unknown individual they want to identify but have no leads.  First, they’ll search offender/arrestee DNA database(s) for someone with a matching genetic profile (which is verrrry unlikely between unrelated individuals with complete profiles), who they’d then investigate.  In some jurisdictions (where familial searching is legal or practiced without explicit policy), if there’s no complete profile match, they’ll search the database again for a partially matching profile.  The idea being that the partial match may be due to a close genetic relationship.  (Of course, two unrelated individuals could reasonably have partially matching profiles by chance.  More on that later.)  Again, depending on the policies of the jurisdiction, the relatives of some number of partially matching individuals are investigated.  In the most high-profile case of familial searching in the U.S., the suspected genetic relative was subject to surreptitious DNA collection (i.e. being followed until leaving a DNA sample (in that case, a pizza crust)).  Then this sample was tested directly against the original unknown sample, and showing a complete profile match a suspect was identified.

Because familial searching effectively extends offender/arrestee databases to the genetic kin of people in the databases, it raises important questions like:

For a population geneticist, attempting to identify unknown genetic relatives of individuals in the database (rather than the known individuals in the database) introduces more uncertainty and some additional questions come up like:

  • With the genetic information in forensic profiles (typically 13-15 autosomal STRs, sometimes with 17 Y-chromosome STRs), what’s the chance that an unrelated individual coincidentally has a partially matching profile resembling a genetic relative?
  • What background allele and haplotype frequencies are considered in profile likelihood calculations?
  • What statistical methodology will be used to identify [specific?  non-specific?] genetic relatives?

All these questions are especially relevant when considering intense multiple testing introduced by the relevant databases (>1.4 million profiles in the California offender database).  It can be challenging to get a handle on these questions because of widely varying policies and methodology between jurisdictions.  In New York City, it seems that an error-prone ‘allelic matching’ technique has been used to attempt to identify relatives in at least one case of robbery, leading to investigations of unrelated individuals.  While in California, familial searching is used specifically in cold cases of violent crimes with a continuing threat to public safety and in 2011 Myers et al. published the likelihood ratio-based test statistic and procedure used in practice.

When I arrived at the U.C. Berkeley for my postdoc, I met Monty Slatkin and Yun Song who, along with Erin Murphy, had attempted to estimate some error rates of familial searching, but were stymied by a lack of a well-described methods currently used in practice.  When the statistical procedure used by California was published, we were excited to collaborate using practically relevant methodology.  Specifically, we estimated the false positive rate and power of familial searching using the California state procedure.  Generally, we found high power to detect a specified first-degree relationship (.79 to .99) and low (but still substantial in a multiple testing context) false positive rates of calling unrelated individuals as first-degree relatives (<5e-9 to 1e-5).  We got thinking about more distant Y-chromosome-sharing relatives (half-siblings, cousins, second cousins) who (barring mutation) share Y-haplotypes and some portion of their autosomal STRs IBD.  We estimated that these distant relatives could be mistaken for close relatives fairly often, like in our simulations 14-42% of half-sibs and 3-18% of first cousins were misidentified as siblings.

These rates are non-trivial, especially if you consider the size of databases and the fact that there are more distant relatives than near (so distant relatives are more likely to be present in databases).  Further, some of these genetic relationships are not known (even to the individuals themselves) so are not useful to investigation, but may still be interpreted as evidence of familial involvement, leading to investigation of uninvolved individuals.  Lucky for us, our collaborator Erin Murphy has a background in law and thoughtfully outlined some of the practical ramifications in the introduction and discussion of our paper.  Not the least of which is how extended families and communities in groups which are over-represented in databases (perhaps most obviously African Americans and Latinos) would be disproportionately impacted by misidentification of distant relatives as near relatives.

We hope that this interdisciplinary manuscript broadens sorely needed technical and policy discussions of familial searching.

The influence of relatives on the efficiency and error rate of familial searching

The influence of relatives on the efficiency and error rate of familial searching
Rori V. Rohlfs, Erin Murphy, Yun S. Song, Montgomery Slatkin
(Submitted on 10 Apr 2013)

We investigate the consequences of adopting the criteria used by the state of California, as described by Myers et al. (2011), for conducting familial searches. We carried out a simulation study of randomly generated profiles of related and unrelated individuals with 13-locus CODIS genotypes and YFiler Y-chromosome haplotypes, on which the Myers protocol for relative identification was carried out. For Y-chromosome sharing first degree relatives, the Myers protocol has a high probability (80 – 99%) of identifying their relationship. For unrelated individuals, there is a low probability that an unrelated person in the database will be identified as a first-degree relative. For more distant Y-haplotype sharing relatives (half-siblings, first cousins, half-first cousins or second cousins) there is a substantial probability that the more distant relative will be incorrectly identified as a first-degree relative. For example, there is a 3 – 18% probability that a first cousin will be identified as a full sibling, with the probability depending on the population background. Although the California familial search policy is likely to identify a first degree relative if his profile is in the database, and it poses little risk of falsely identifying an unrelated individual in a database as a first-degree relative, there is a substantial risk of falsely identifying a more distant Y-haplotype sharing relative in the database as a first-degree relative, with the consequence that their immediate family may become the target for further investigation. This risk falls disproportionately on those ethnic groups that are currently overrepresented in state and federal databases.