The world’s leading publication for data science, AI, and ML professionals.

Applications of Differential Privacy to European Privacy Law (GDPR) and Machine Learning

How differential privacy can protect your data, and help you comply with government regulations

Photo by Christina Morillo from Pexels
Photo by Christina Morillo from Pexels

What is Differential Privacy?

Differential privacy is a data anonymization technique that is used by major technology companies such as Apple and Google. The goal of differential privacy is simple: allow data analysts to build accurate models without sacrificing the privacy of the individual data points.

But what does "sacrificing the privacy of the data points" mean? Well, suppose I have a dataset that contains information (age, gender, treatment, marriage status, other medical conditions, etc.) about every person who was treated for breast cancer at Hospital X. In this case, a data point is a row of a spreadsheet (or DataFrame, or database, etc.) containing information about a person who was treated for breast cancer.

I want to build a machine learning model that predicts the odds of surviving breast cancer at Hospital X. What should I do if I want to preserve the privacy of the patients?

This is where differential privacy comes in. Differential privacy helps analysts build accurate models without sacrificing the privacy of individual data points by introducing randomness into the process of data retrieval.

In the framework of differential privacy, there are two actors: a curator and a data analyst. The curator owns all of the data, with all of the original (true) values, and the data analyst wants to interpret that data. The data analyst can "query" the database; that is, the data analyst can ask the curator for a subset of the data. Differential privacy doesn’t give any information away directly; instead, for every variable the data analyst is interested in, such as age, diagnosis, treatment, etc., they might get the true value, or they might not. The likelihood that the analyst actually recieves the true value varies based on how much noise is introduced.

The more times you query the database, the less private the data is. There is a lot more complexity to differential privacy than I just covered, but you get the basic approach.

More resources on Differential Privacy

  • I would encourage you to read this article if you want to learn more about differential privacy. It explains it at the five-year-old, intermediate, and professional levels of understanding.
  • For a good in-depth technical explanation, read this article.

But for now, let us move on to a quick explanation of European privacy legislation.

Background on European Data Privacy Legislation i.e. the GDPR

The General Data Privacy Regulation, abbreviated as the GDPR, is a piece of data privacy legislation that applies to all companies based in or operating in the entire European Union. The GDPR outlines how companies are allowed to use, store, and process data about people. The goal of the legislation is to give people more control over what companies know about them, and standardize those practices across Europe.

Key Terms

In order to talk coherently about the GDPR, we need to quickly define three terms: personal data, data subject, and processor. While we all know the colloquial definitions of these terms, the GDPR defines them precisely, so we want to be familiar with their definitions.

  • Personal data is any data that could possibly be related to a person whether directly or indirectly. It could be an identifier for your phone, your literal fingerprint, or a record of a payment for gasoline. This definition is important because it covers most of the data that is collected, period. For better or worse, humans are kind of self-obsessed.
  • A data subject is a person, specifically a person that resides in the European Union. Personal data is collected about data subjects.
  • Finally, a processor is anyone (person, corporation, algorithm) that does anything with personal data. They could be storing it, analyzing it, changing it, taking a quick selfie with, and pretty much everything else – they’re a processor.¹

And as they say, with great processing comes great responsibility.

Responsibilities of Companies (Processors)

The GDPR makes it very clear that the processor is completely responsible for ensuring and maintaining the privacy of the personal data it handles. Not only that, but the processor actually needs to care about the privacy of that data. The legislation mandates companies to design their products and processes with privacy in mind, sets limits on how long the processor can hold data for, and gives data subjects many rights concerning how their data is used, stored, and disclosed.

Privacy Goals of Individuals and Companies

In general, we can assume that the data subject wants their personal data to be maximally private while the processor wants the data to be maximally public.

In general, we can assume that the data subject wants their personal data to be maximally private while the processor wants the data to be maximally public.

The processor wants the data to be as public as possible because the less restrictions there are on the data, the easier it is for the processor to use. However, it’s important to note that the interests of the data subject and the processor are not entirely in opposition – after all, the data subjects want the products and services they use to work as well as possible, and many are willing to give up some privacy in exchange for that.

Additionally, most companies are not evil by nature, and they care (at least a little bit) about maintaining the privacy of the people they provide services to, regardless of the law.

How does differential privacy relate to the GDPR?

You may have noticed that the GDPR and Differential Privacy both share a word in common: privacy. Privacy is one of those words that you can come with a million contradicting definitions for, but for now:

Privacy – the right of individual people to selectively disclose information about themselves to the world.

Combining all the terms we just introduced:

Differential privacy is a property of an algorithm(s) that processes data, used by one or more processors, to protect the personal data of the data subject. In other words:

Differential privacy can be thought of as a technique that an algorithm uses for maintaining an individual’s GDPR-mandated privacy.

It’s important to understand that both the curator and the data analyst of the differential privacy world are by definition processors of the Gdpr world, and that they could be either the same entity or different ones. For example, if my Android phone collects location data about me and then a Google engineer analyzes it, both the curator and data analyst are Google. But if Google decides to give a third-party company like Tinder access to my data, then Google is the curator and Tinder is the data analyst. Both Google and Tinder are processors.

Advantages and Disadvantages of Using Differential Privacy, in terms of the GDPR

In this section, I’m going to go over the pros and cons of using differential privacy for the different stakeholders, specifically companies and users.

Benefits for Companies (Processors)

  • Differential privacy could give the processor significantly more freedom by transforming personal data into aggregate data. The laws governing aggregate data are significantly less strict than the laws governing personal data.
  • Differential privacy gives companies a way to directly quantify data privacy, increasing the ability of companies to prove that they are complying with the GDPR.

Benefits for Users (Data Subjects)

  • Differential privacy ensures that it is very difficult recreate an individual data subject’s personal data, given enough noise and a small enough number of queries.
  • If the curator and the data analyst are different processors, then the curator could set a maximum number of queries to increase the privacy of the personal data in its possession.

Benefits for Both Companies and Users

  • Differential privacy could allow companies to make better models by increasing the generalizability of the data it already has. This could work because differential privacy forces the model to rely on the distribution of values in a dataset rather than the true values. This means that they can have the best of both worlds – the data subject’s personal data is less compromised, and the data analyst/processor ends up with a better model than it would have produced with the original data.
  • Companies could access a high amount of sensitive data for research without a privacy breach.
  • Differential privacy provides a mathematically provable guarantee of privacy protection against a wide range of privacy attacks. This means that the data is less susceptible to privacy attacks, which benefits both the processor and data subject.²

However, there are also definite disadvantages to using differential privacy related to security and quality of analysis.

Disadvantages of Using Differential Privacy

Security Challenges of Differential Privacy

  • Differential Privacy is susceptible to a number of mis-use problems including: a) the company stores the original data in an insecure way, despite using differential privacy when analyzing it, opening up to a different class of privacy attacks, b) the data analyst queries the data too frequently for differential privacy to effectively protect the privacy of the data subject, c) the data analyst uses too little noise to sufficiently protect the privacy of the data subjects,

  • Differential Privacy is also susceptible to the unintended inference problem, which is the problem of needing only very small amounts of accurate information to precisely identify individuals. Since statistically many of the individual pieces of personal data will be accurate, this presents a significant security problem.
  • Additionally, the successful use of differential privacy could inadvertently disincentivize companies from collecting less data about people, period, because there is a secure process available to analyze data. The less data that is collected, the more security the data subject has over their personal data. The GDPR endeavors to encourage companies to collect less data about individuals overall, and a false sense of (computer) security generated by the use of differential privacy in analysis could work against that goal.

Additional Challenges When Using Differential Privacy

  • Differential privacy could lead to dangerously inaccurate models: Models could be easily produced that are very private but hazardously inaccurate because there is a direct tradeoff between privacy and accuracy in differential privacy.
  • It could be challenging for data analysts to work with the data mediated by differential privacy. Many data analysts often "describe their interactions with the data in fundamentally non-algorithmic terms, such as, ‘First, I look at the data,’" as Dwork and Roth noted. However, the authors added that "in general, it seems plausible that on high-dimensional and on internet-scale datasets non-algorithmic interactions will be the exception."³

Conclusion

Overall, I believe that the use of differential privacy within companies, particularly companies that share data with others, can be an excellent tool for both increasing privacy and complying with the GDPR, as long as it is used appropriately.

If you have any questions, comments, or want more resources about anything I discussed, please reach out.

References

[1] GDPR, Article 4 (2016), GDPR Info website

[2] A. Nguyen, [Understanding Differential Privacy](http://Understanding Differential Privacy) (2019), Towards Data Science

[3] C. Dwork and A. Roth, The Algorithmic Foundations of Differential Privacy (2014), p.254

Resources on learning more about Differential Privacy


Related Articles