In a previous post, I went over the definitions of anonymization, de-identification, redaction, pseudonymization, and tokenization. It’s a good place to start if you don’t know the difference between those terms.

Most relevant, the definition of anonymization is “[t]he process in which individually identifiable data is altered in such a way that it no longer can be related back to a given individual” (IAPP Glossary of Privacy Terms). In practice, anonymization really means that there is a *negligible* chance that the altered data can be related back to a given individual.

When I first started working with privacy enhancing technologies, I lived and breathed homomorphic encryption. I had tremendous mistrust of anonymization and thought anyone who believed otherwise simply did not know the facts. After all, there’s the famous quote “anonymized data isn’t” from Cynthia Dwork, known for her breakthrough research in differential privacy.

How many times have you seen headlines like the one below (Dark Daily)?

So anonymization doesn’t work, right?

I genuinely thought so. I believed the headlines and experts suggesting that anonymization was ineffective. As part of my PhD, I decided to look in detail at some of these re-identification attacks, at the data that had been re-identified, and at the percentage of records re-identified in the datasets:

“We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth” (Dark Daily)


“A 2016 study found a 42.8% risk of matching anesthesia records to the Texas Inpatient Public Use Data File for 2013 using data such as Age, Sex, Hospital, and Year.” (Re-Identification Risk in HIPAA De-Identified Datasets: The MVA Attack)


“[Sweeney] was then able to uniquely identify the medical record of Governor Weld. She did this by noting that only 6 voters in Cambridge had the same birth date as the governor, only 3 of those were male, and only one of those lived in the same ZIP code as him.” (Re-Identification Risk in HIPAA De-Identified Datasets: The MVA Attack)

I uncovered that the true problem with the datasets mentioned in these headlines is that they were never truly anonymous in the first place: they included quasi-identifiers (e.g., year of birth, age, sex, approximate location) which were published without regard for population and dataset statistics. So it’s not that the datasets were fully anonymized and that it was easy to re-identify them. The truth is that these datasets were mislabeled as anonymous. They never deserved that designation in the first place.

This is partially due to the learning curve the community went through to understand what needs to be done in order to completely anonymize a dataset. At first, it was believed that simply removing direct identifiers (full names, social security numbers, etc.) would do the trick. It took some stumbling to discover exactly how quasi-identifiers affect the likelihood of re-identifying an individual within a dataset.

These points are confirmed in “A Systematic Review of Re-Identification Attacks on Health Data” by El Emam, K., Jonker, E., Arbuckle, L. and Malin, B., which covers the misinterpretations of the severity or validity of a re-identification attacks. In it, the authors point out that only one re-identification attack has been successfully performed on data truly anonymized according to existing standards (see K in the table below), and even this one attack had a re-identification risk of 2/15,000 — well below HIPAA Safe Harbor’s acceptable risk threshold for re-identification of 0.04%.

They summarize their findings in the following table:



Image: List of re-identification attacks over de-identified datasets. Only two out of the 14 attacks were on datasets that were properly anonymized. Only one of those attacks (K) has the re-identification verified. 2 out of 15,000 records were re-identified. Thank you to Professor Khaled El Emam for allowing me to reproduce this table.

Sources 57 and 58 from that table are not easy to find online and are not listed on the University of Toronto’s library catalogue, so digging into the weakness that lead to the re-identification of those 2 out of 15,000 records is not easy to do. For reference, those two papers are:

  • 57. Kwok P, Davern M, Hair E, Lafky D. Harder than you think: a case study of re-identification risk of HIPAA-compliant records. Chicago: NORC at The University of Chicago. Abstract #302255; 2011. [Google Scholar]
  • 58. Lafky D. The Safe Harbor method of de-identification: An empirical test. 2010. Fourth National HIPAA Summit West.

It’s thanks to re-identification attacks by researchers and journalists, as well as concepts like k-anonymity, l-diversity, and t-closeness that the privacy community began to understand what anonymization means under a different lens. These concepts take into account what the likelihood of re-identification is based on an individual’s quasi-identifiers and how those can be optimally aggregated in order to minimize re-identification risk. One ideal paper to read in order to understand these three concepts, including their limitations and strengths, is the 2007 paper titled t-Closeness: Privacy Beyond k-anonymity and l-diversity, by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian.

Anonymization methods have been used successfully on a number of clinical trial datasets with the purpose of sharing data for research. Four recent examples are cited in the 2015 paper titled Anonymising and sharing individual patient data, by Khaled El Emam, Sam Rodgers, and Bradley Malin; namely:

In sum, anonymization is far from easy. It takes years of expertise and experience to even properly conceptualize how quasi-identifiers can affect re-identification risk. And the term is often completely misused by journalists and companies alike.


Thank you to John Stocks and Pieter Luitjens for their feedback on earlier drafts of this post.