De Identification is a lie

Published on

April 27, 2022

De-identification of personal data (sometimes referred to as anonymization) is commonly used as cover to allow sharing and transfer of extremely sensitive data. Everything from when you told your doctor you had symptoms of depression and anxiety during COVID to that STD test you had that one time. The most sensitive facts and test results you share with your physician, your psychiatrist or enter into online apps could end up shared, packaged and sold.

The problem is that de-identification is a lie. It is a remarkably trivial process to re-identify every single record. So much so that there really is no such thing as ‘de-identifying’ or ‘anonymizing’ personal data. As John Oliver noted on April 10 on Last Week Tonight, a vast, poorly regulated industry has grown up to exploit this reality and exploit each of us in turn. This ‘gray market’ earns, or extracts, depending on your view, tens of billions each year by selling and using our most sensitive medical and personal data. Data they didn’t create and ethically should have no claim on.

That this happens may surprise you. That it is not only allowed, but is encouraged by regulations such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) may surprise you even more. In fairness, HIPAA was written in the pre-internet, pre-genome, pre-big data age when de-identification was assumed, however naively, to impart anonymity and provide solid privacy protection. Under HIPAA, any data that removes some 17 directly identifiable fields such as name and social insurance number can be freely shared. Once ‘de-identified’ in this way, the data is no longer covered under HIPAA: “De-identified health information created following these methods is no longer protected by the Privacy Rule because it does not fall within the definition of PHI.” Meaning de-identified data can be copied, shared and even bought and sold. Witness Truveta, a startup that the very largest hospital systems in the country have helped create to monetize patient data. That this data is well known by the hospital system participants to be fully re-identifiable is not something that they advertise. As health data broker efforts such as this expand, a large proportion of Americans will see their data sold by entities beyond their control.

That de-identified data is problematic is widely known. As far back as 2000, just four years after HIPAA became law, it was already known that 87% of Americans can be uniquely identified via just their zip code, gender and date of birth. And it gets worse from here. A recent study using machine learning found that a staggering "99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes” even in incomplete and de-identified data sets. Keep in mind that a medical record may have dozens of demographic attributes. The authors go on to note that: "even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”

There are, or at least there were, good reasons why sharing de-identified data was allowed as a loophole in HIPAA as well as other privacy regulations such as General Data Protection Regulation (EU GDPR) and California Consumer Privacy Act. The de-identification of protected health information enables HIPAA covered entities to share health data for large-scale medical research studies, policy assessments, comparative effectiveness studies, and other studies and assessments without violating the privacy of patients or requiring authorizations to be obtained from each patient prior to data being disclosed. In the early day, shared data was mostly sparse claims data. But increasingly the data is much deeper - everything from excerpts from psychiatric reports to cancer pathology and what medication you are taking to genomic data. The lie of de-identification is a slap in the face if one considers genomic data to be de-identifiable. Your genome, after all, is you. It is a unique instruction set that describes your physical characteristics and even predicts your risk of developing disease such as various neuropsychiatric disorders or cancers.

What can be done if we are to balance use of data to drive research and find new cures with respecting the basic human right to medical privacy? Actually, a lot, given recent advances in technology. The very technologies that make re-identification so trivial can also be used to enforce privacy and prevent endless duplication of data. In decades past, sharing data was the only way to partner to use the data. That simply is not true any longer. Just as Apple uses machine learning AI to analyze and use your personal data on your iPhone without ever aggregating it at Apple, any data can now be used in place without transfer. And if data is used in place by algorithms in a double-blind manner, de-identification may not even be necessary in the future. Meaning faster discovery using better data at the source.

This solution is known as federated data use and it should be applied to health data of all forms, whether identified or identifiable (aka ‘de-identified’). We can fix data sharing by using data without ever sharing at all. It works like this: 1) Store the data where it lives - in hospitals under their control and governance or on the patient's own devices. Hospitals and patients have a relationship involving revocable consent and data is necessary for treatment. Just as you may share things with a lawyer without fear of exposure, you should be able to share with your doctor without fear that anything you say may end up in hundreds of locations being used for purposes you neither know about nor approved. 2) Use the data by distributing code, algorithms and AI to the data rather than moving the data to the code. This technology already exists, it just isn’t being applied because it requires rethinking the paradigm of data use. As William Gibson said “The future has arrived — it’s just not evenly distributed yet.”

As technology to allow use of data in situ gains traction, HIPAA, GDPR and state level privacy laws may move to close the egregious loophole of transfer of fully re-identifiable data. This needn’t be restricted to hospital patient data either. The same mechanisms could be applied to all holders of personal data. And once data is not transferred, revocable consent becomes a reality. Ethical data sharing means not sharing the data at all in the old-fashioned sense. Ethical data use means bringing the code to the data and no longer exposing private and sensitive data to the world. The technology already exists if people demand it and if regulators choose to act.

‍

De Identification is a lie

Begin to see data sharing with end-to-end control of your data.