A High-level Introduction to Differential Privacy

Striking a balance between privacy and utility

Ria Cheruvu
Towards Data Science

--

Photo by ev on Unsplash

In the eyes of a data scientist, every moment of your life is a data point. From the brand of your toothpaste to the number of times you wave your hand, details that we often take for granted are crucial factors that can be used to infer our behavior and intentions. These insights, mined by multinational organizations, can be used to make aspects of our collective lives more convenient and interesting at the price of our private information being exposed or even exploited.

Consider Target, the 2nd largest department store retailer in the US, that decided to implement a seemingly intuitive approach towards tailoring customers’ shopping experience to their specific needs in 2012. The goal was to figure out which customers were most likely to be expecting a child, and consequently send these customers maternity-related advertisements based on the calculated likelihood. We can interpret Target’s idea as a two-pronged approach: 1. Storing or amalgamating data to analyze pregnant buyers’ trends. 2. Applying the technique or algorithm to correlate new customers’ data points with buying patterns of previous customers to determine the likelihood of a person expecting a child. Target’s initiative became the center of much discussion at the intersection of privacy and machine learning algorithms when Target sent maternity-related advertisements (mixed with advertisements from different departments) to a teen girl before her family knew she was expecting a child (Source: Forbes’ article on “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did”). These privacy concerns are not just applicable for data collection and storage for marketing purposes, but also for applications ranging from census data to social media.

Consequently, it makes sense that the advent of big data, machine learning, and data science necessitates a reconsideration of privacy.

Differential Privacy is considered to be one of the state-of-the-art concepts that can help us achieve this goal, and is already being used by Apple, Uber, the US Census Bureau, and other organizations.

In this post, I’ll cover a high-level overview of what is Differential Privacy, why it’s relevant, how it’s implemented, and how it’s used.

What is Differential Privacy?

The primary premise of Differential Privacy (DP) involves making sure a data subject is not affected (e.g. not harmed) by their entry or participation in a database, while maximizing utility/data accuracy (as opposed to random/empty outputs) for the queries.

DP guarantees that:

  • The raw data will not be viewed (and does not need to be modified).
  • Maintaining the subject’s privacy will be valued over mining important insights from data.
  • Resilience to post-processing; post-processing the output of a differentially private algorithm will not affect the differential privacy of the algorithm. In other words, a data analyst that does not have additional knowledge about the database cannot simply increase the privacy loss by thinking about the output of the DP algorithm (Source: “The Algorithmic Foundations of Differential Privacy”).

Based on the outlined guarantees, this means that if a data analyst/adversary were relying on databases that differ by only one data entry, the probability of the end result changing would not be affected by the presence/absence/modification of that data entry (the outcome would change by at most a multiplicative factor). In other words, the analyst or adversary can not differentiate between the databases based on the output of the query when utilizing DP.

Taken from Ian Goodfellow and Nicholas Papernot’s blog post “Privacy and machine learning: two unexpected allies”

However, the individual’s harm or gain in relation to his/her perception of the conclusions mined from the data is not protected by the DP concept. The formal (mathematical) definition of DP is given below:

Taken from Differential Privacy: A Survey of Results

Cynthia Dwork demonstrates DP with a simple example in the paper “The Promise of Differential Privacy A Tutorial on Algorithmic Techniques”:

Data is being collected from individuals for a study (Mary, who is a smoker, is one of the study participants). The study leads to the conclusion that smoking has an extremely high likelihood of causing cancer. Differential Privacy guarantees that:

  • Data scientists or database managers analyzing the trends cannot directly access the raw data.
  • Services such as Mary’s insurance premiums will not change based on her participation in the dataset.

However, the utility of the study’s conclusive analysis (“Smoking can cause cancer”) for harm/good is left to Mary’s perception; Mary can choose to believe “Based on this study’s findings, I can stop smoking to prevent potential cancer”, the exact opposite of this statement, or something else.

Why is this important?

“Data is the pollution problem of the information age, and protecting privacy is the environmental challenge.” — Bruce Schneier

Ensuring privacy is crucial for numerous applications, including maintaining the integrity of sensitive information, eliminating the opportunity for adversaries to track users based on personally identifiable information (PII), etc. DP is appealing for a variety of applications because it guarantees the following privacy “needs”:

DP eliminates any potential methods a data analyst might have to distinguish a particular individual from other participants, or associate particular behaviors with a previously identified individual in order to tailor content/services accordingly; consequently, DP makes sure that the privacy risk associated with participating in the study is not significantly increased (“Differential Privacy” by Cynthia Dwork).

This is due to the fact that the raw data cannot be accessed directly and using findings unique to an individual to specifically target or influence that particular individual violates the rule that the individual’s presence in the dataset should not impact their chances of receiving a service. Furthermore, you do not need to be a privacy or DP expert (i.e. have a deep understanding of the inner mechanics of DP) to use DP and, as mentioned above, DP is resilient to post-processing. DP also extends to group privacy, although increasing the size of the group will weaken the DP bounds (“Differential Privacy: A Survey of Results” by Cynthia Dwork).

We can now see that the example of Target’s prediction of maternity from buying behavior presented above may be a violation of the differential privacy concept. In the case of the prediction of a teen girl’s pregnancy (as outlined here), Target supposedly correlated the individual’s buying trends with other patterns of other customers expecting a child and began tailoring advertisements to her specifically based on her participation in the dataset. If she was removed from the dataset, Target would not have sent her maternity-related advertisements, which is against the concept of differential privacy.

As an aside, it’s important to note that thinking about ethical scenarios like these is not this simple. The goal of sales is to motivate a customer to buy a product by convincing the customer that he/she needs the item. By providing targeted advertisements to individuals who had bought items similar to others who were expecting, Target may have violated differential privacy using data collected from/by the organization in order to help customers become informed about potential benefits the company could offer and tailor the shopping experience to their personal needs.

In addition, there are counterarguments to the stance I mentioned. One argument emphasizes that if Target did use the algorithm, it would not be completely accurate (no algorithm is completely accurate) and would thus make many mistakes (i.e. false positives). Another perspective (explained in The Financial Times’ article “Big data: are we making a big mistake?”) states that everyone on Target’s mailing list receives maternity-related and other types of advertisements. not just customers who are expecting.

How does this work?

While DP is not an algorithm itself, it is helpful to think of an application of DP as a piece of software that lies in between the data analyst and the database to understand how it works:

Taken from Microsoft’s “Differential Privacy for Everyone”, page 4

Queries are the inputs passed by the user to the DP guard; an example query might be “Count the number of rows in the database”. There are many variants of differential privacy — this section addresses ε-differential privacy, which is distinct from (ε, δ)-differential privacy and other (weaker) versions of DP (please see “Our Data, Ourselves: Privacy via Distributed Noise Generation” if you would like to look into the other versions of DP).

One of the primary goals of DP is to balance between privacy loss (measured by ε, for which a value, such as 0.01, needs to be chosen) and maximizing utility (which can be measured through many parameters such as data accuracy). A value of ε = 0 leads to complete privacy but a lack of utility, whereas higher values of ε lead to a lack in privacy but good utility. (Taken from “The Promise of Differential Privacy: A Tutorial on Algorithmic Techniques”).

As described in Microsoft’s white paper “Differential Privacy for Everyone”, the DP technique takes as input the raw data, finds the answers from the original input data, and then introduces distortion based on a variety of factors (e.g. the size of the database, type of query, etc.)

On query function f the privacy mechanism K responds with f(X) + (Lap(Δf/ε))^k adding noise with distribution Lap(Δf/ε) independently to each of the k components of f(X).

– Taken from “Differential Privacy: A Survey of Results” by Cynthia Dwork

In other words, the role of the DP “guard” is to add noise to the true answer (defined as f(X)) to a query (defined as function f) for each k output term in f(X) + (Lap(Δf/ε))^k; the output of the DP guard or the end result is called the response.

Let’s break this down a bit.

Noise is added through the Laplace distribution (represented as Lap in the above equation), which is similar to the normal distribution/bell-curve — both the normal and Laplace distribution are types of probability distributions that are uni-model (have one peak) and symmetrical (the left and right sides of the distributions are equally balanced). However, the primary difference is that the Laplace distribution has a sharper peak, as Statistics How To explains in their article.

A picture of the Laplace Distribution from Frank McSherry’s blog post “Differential Privacy for Dummies”

It is worth noting that other mechanisms can also be used in place of the Laplace distribution. For example, while the Laplacian Mechanism works for any function with a real number as an output, the Exponential Mechanism can be used in the opposite case (functions without real number as an output) and is situations where the addition of noise is not practical; the Exponential Mechanism functions similar to given a dataset, produce an output/object using a utility function that measures the quality of the output (Source: Taken from “Differential Privacy: A Survey of Results” and “Differential Privacy in the Wild A Tutorial on Current Practices and Open Challenges”).

When using DP, we need to calculate the sensitivity of the query (the sensitivity is represented as Δf, the query is represented as function f)

Taken from Differential Privacy: A Survey of Results

The sensitivity of a query helps us understand how much an individual’s data influences the calculations and consequently the amount of noise that needs to be added. The sensitivity of f is normally small and consequently, in most scenarios, the DP algorithm does not need to add much noise; large sensitivity when the value of ε (privacy loss) is fixed serves as a warning that more noise needs to be added to mask the data (Source: “Building Blocks of Privacy: Differentially Private Mechanisms” by Graham Cormode).

An important point to note is that DP’s privacy guarantee weakens as more questions are asked. This brings us to the formal definitions of sequential and parallel composition:

“Sequential composition: If you perform two (or more) differentially private computations, they collectively provide differential privacy with parameter bounded by the sum of the parameters you used” (“Differential privacy: An illustrated primer”).

Based on the definition of sequential composition, if you had two differentially private computations, this means the result would be ε1 + ε2 differentially private (taken from “Building Blocks of Privacy: Differentially Private Mechanisms” by Graham Cormode). Parallel composition involves “releasing each cell with ε-differential privacy”, instead of the sum proposed by sequential composition (as stated in “Building Blocks of Privacy: Differentially Private Mechanisms” by Graham Cormode). As stated in “Differential Privacy: A Survey of Results,” differential privacy can be ensured for any query sequence by running a differentially private algorithm with the Laplacian noise distribution on each query. Another point to add that is mentioned in the survey is that sensitivity and privacy loss (which are the parameters needed for the addition of noise) are independent from the size of the database and the database itself; a larger database leads to a good amount of accuracy for a differentially private algorithm on common queries. Further discussion of this point in addition to details on how to maximize DP’s potential on insensitive queries and the mathematical proofs/theorems to yield differential privacy can be found in the survey.

This section can be summarized with this theorem:

Taken from Differential Privacy: A Survey of Results

How is this being used?

  • The U.S. Census Bureau implemented DP in their OnTheMap application in 2008 to ensure privacy for residential population data. Read more details in this post: “Protecting the Confidentiality of America’s Statistics: Adopting Modern Disclosure Avoidance Methods at the Census Bureau written by Dr. John M. Abowd.
  • Apple describes three use-cases of leveraging DP in their article “Learning with Privacy at Scale” in the Apple Machine Learning Journal: discovering popular emojis, identifying resource-intensive websites accessed using Safari, and discovering the use of new words. A novel aspect of Apple’s implementation is their use of three differentially private algorithms — Private Count Mean Sketch algorithm, Private Hadamard Count Mean Sketch, and Private Sequence Fragment Puzzle.
Taken from the article “Learning with Privacy at Scale” from Apple’s Machine Learning Journal
  • Microsoft applied DP to mask the location of individuals in their geolocation databases; the approach involved randomly removing, adding, and shuffling individual data points. A novel aspect of their implementation is the creation of PrivTree, which, given the original data and a few other parameters (the scale of Laplacian noise to be used, a threshold used to decide whether the splitting of a node should occur, etc.), can implement a differentially private algorithm and output noisy data for almost any kind of location data.
Pictures taken from Microsoft’s blog post “Project PrivTree: Blurring your “where” for location privacy”
  • Uber uses differential privacy as part of their data analysis pipeline and other development workflows. A novel aspect of their implementation is the use of Elastic Sensitivity, a technique that allows you to compute the sensitivity of a query and met Uber’s demanding performance and scalability requirements.
Taken from Uber Security’s Medium article “Uber Releases Open Source Project for Differential Privacy”
  • Google implemented DP as part of their Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) technology, which allows data analysts to study clients’ data without needing to look at the individual data points. A novel aspect of their implementation is that DP is implemented to guarantee client privacy in two important steps of RAPPOR’s execution: The permanent randomized response makes sure privacy is guaranteed from the generated noise and the instantaneous randomized response protects against an attacker using the permanent randomized response.
Taken from the paper “RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response” by Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova

Additional Resources

This post only scratched the surface of differential privacy. There are plenty of great resources if you’d like to explore the inner mechanics and applications of DP further; some of which include:

Thank you for reading!

Originally published at demystifymachinelearning.wordpress.com on November 20, 2018.

--

--

AI SW Architect and Evangelist at Intel, master’s in data science. Opinions are my own.