Thoughts and Theory

Anonymized Data Is Useless: Fact or Fiction?

Digging into unstructured data

Patricia Thaine
Towards Data Science
5 min readAug 17, 2021

--

Source: https://unsplash.com/photos/qm1zk-2oWD4

“When is anonymization useful?” is a tricky question, because the answer is highly data-type- and task-dependent. Anonymized datasets are being used for academic research, industrial research, and real-world products in numerous areas, with clinical research often at the vanguard due to the high level of sensitivity and utility of the data. A 2016 NIST presentation mentions several other use cases in which anonymized data are useful, including:

  • Improving driving solutions for directions and traffic data.
  • Pothole alerts.
  • Releasing educational records.
  • Voluntary safety reports submitted to the Federal Aviation Administration.

While there have been years of research on proper methods for structured data anonymization (especially in the medical domain), research on unstructured data anonymization is just starting to ramp up. In this post we’ll dive into the research happening in speech, images/video, and text anonymization spaces.

Speech

For speech, anonymization means:

(1) Making a speaker’s voice unrecognizable (e.g., using the methodology proposed in Speaker Anonymization Using X-vector and Neural Waveform Models) and

(2) Removing direct and quasi identifiers from the speech by either bleeping them out or replacing them (i.e., pseudonymization).

Quick reminder if you haven’t read “Demystifying De-identification” or “Data Anonymization: Perspectives from a Former Skeptic” that direct identifiers are entities that directly identify an individual (full name, exact location, social security number, etc.) and that quasi-identifiers are entities that can identify an individual with exponential likelihood when combined together (age, approximate location, spoken languages, etc.).

If speech technology and privacy is your thing, take a look at the VoicePrivacy initiative and at the ISCA Special Interest Group on Security and Privacy in Speech Communication, which brings together professionals from a variety of backgrounds (from signal processing to law) to discuss privacy in speech technologies.

Images & Video

Anonymization in images and video is a complicated task, given the variance in identifiable information. While fully and properly blurring out whole human bodies in the pictures might do the trick for certain constrained use cases, there can still be re-identification risk from name tags on backpacks, differentiated lunchboxes, a house in the background, etc. Nevertheless, anonymization for these media has often just meant removing or replacing faces, which means it is limited to a particular part of the body rather than the reduction of re-identification risk to almost zero (e.g., face anonymization — see, for instance, CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks). This is a start, but considering that companies like Palantir Technologies can recognize people by their tattoos, removing or replacing a part of the body can often only really be called redaction, not anonymization.

That said, there are numerous machine learning tasks that make use of images and video without personal data or in which personal data is superfluous and could be removed/replaced without detriment to the task, including:

Just take the example provided in that vehicle counting GitHub repo for vehicle counting.

Image source: https://github.com/ahmetozlu/vehicle_counting_tensorflow (MIT license)

It’s clear that neither license plates nor people’s faces play a role in the task. And if we’re concerned about unique car colours being too telling, even a black and white video could do nicely, as would counting the vehicles on the edge (e.g., directly on the camera, before the data hits any servers).

Take this other image as an example:

Source: https://unsplash.com/photos/EJnhyLTbPH8

What can you detect in the image?

  • Number of houses
  • Type of farmland
  • Relief
  • Weather

There is a lot of information available about the terrain and there are plenty of similar images that can be used for determining ecosystem health, whether there are weeds growing in a crop, etc. Not to mention that lots of anonymous video feeds can be used as a partial training set for self-driving cars.

Text

Last but not least, let’s consider text anonymization. There has been some initial research on re-identification risk scores for text, including our work in the Journal of Data Protection and Privacy titled Reasoning about unstructured data de-identification (email me if you have a hard time accessing the paper). While proper anonymization for the purposes of data publication has required an expert to go over the data and calculate the risk of re-identification, we can say that automatically redacting text has a huge role to play in greatly improving data security through data minimization (i.e., reducing the amount of personal data you collect to just the essentials). Note that there have been tests conducted on the effectiveness of statistical and rule-based systems to automatically de-identify medical text corpora (three of these studies summarized here). These tests have to be redone to account for the vast improvements in statistical natural language processing systems from the past three years.

Anecdotally, let me give you a quick example of how much information a single anonymized email can carry:

“Hi [NAME],

Apologies, it had ended up in my spam!

I’m booked at [TIME] tomorrow, but [TIME] would work. I’ll send an updated invite for that time. Please let me know if that doesn’t work for you.

Thank you,

[NAME]”

Any idea who wrote that? It’s impossible to know unless you were the recipient or author.

But what useful information can you gather from this email?

  • A call is being rescheduled to tomorrow
  • The sender is polite (says please and thank you)
  • The recipient’s previous email ended up in a spam folder when it shouldn’t have!

What can, say, an email service provider use this information for? Well, it would be great if they could make sure that this recipient’s emails never end up in a spam folder again.

I have so many examples of these. From being able to identify how a person feels about a particular product, to determining which topics were covered in a conversation and determining a consumer’s sentiment over a chat or phone call.

Anonymous Data Is

It has taken time and ample research for the community to gain a greater understanding of what it means for data to be anonymous and useful. The process of iterating over a technology and understanding its limitations was felt in the field of cryptography just as it is in differential privacy and in anonymization. We no longer use DES to encrypt our data, but rather AES. And chances are that, in the next decade, we will have to rely more on lattice-based cryptography rather than RSA. As we find limitations in a technology, we do not throw the baby out with the bathwater, but rather look to gain a deeper understanding of what went wrong, innovate upon it, and make it stronger, more useful, and more accessible.

Acknowledgements
Thank you to John Stocks and Pieter Luitjens for their feedback on earlier drafts of this post.

--

--

Co-Founder and CEO of Private AI (www.private-ai.ca). PhD Candidate in Comp Sci at the University of Toronto working on privacy/security applications for NLP.