Navigating the Data Privacy Maze: Tips for Data Scientists

Published in

Towards Data Science

8 min readAug 2, 2018

Determining how to best secure data and preserve privacy is tough. If you are a data scientist working with potentially sensitive data, it means determining what data you can use and how to best protect it. This can lead you to throw up your hands in defeat — giving up completely and opting to use raw data, which, in turn, exposes your data analysis and models to information leakage and makes them vulnerable to attack.

Our goal at KIProtect is to try and make this process easier for you — how can you spend time focusing on the data science but still ensure your data is protected? I gave a recent talk at PyData Edinburgh on just this — which inspired me to write up this post (full slides on GitHub).

We’ll begin by outlining your plight: you are a data scientist focused on analyzing or building models with customer data. You know privacy and security are important (in theory), but you haven’t spent much time thinking about security and you generally leave that to your IT team. That said, you know you are managing customer data and you want to Do The Right Thing™.

So how do you begin? We’ll explore your travels via a flowchart, because — well — they are fun. And very informative, of course.

Your first choice is about your data. The best policy is to not use any sensitive data to begin with. Let’s start there:

So, you may be saying — well YES I need the data! Obviously! But I really want you to think about it. So, I’ll ask again — this time in a slightly different way.

Because.. well, I really want you to think about it. If you didn’t have the data, would you actually need it enough to go through the hassle of collecting it? A lot of times it might be *nice* to have some extra data or variables, but you don’t really need it — or you can’t make the case for collecting it.

Now, I get a lot of questions at this step involving the unknown. It goes something like this: “but kjam, I just don’t know yet if I need it! What if it turns out later that the last character of a person’s first name really does determine their creditworthiness?” (We can get into a whole lot of conversations on ethics of this type of feature engineering, but I will leave that to another piece).

What I am asking you to do is to treat your initial analysis like a baseline; the same way you would test out a new model or technique. First, you establish a baseline which is your data analysis on data which poses the least amount of privacy risk. Minimal exposure, if you will. I promise you can always add data back in later. But depending on what you are building or using the analysis for, it can be tricky to remove data later. Establishing a low privacy risk baseline and finding out yes, your data does have enough information while still protecting privacy is the optimal outcome. So the case for removing sensitive data is… far less risk!

Let’s move onto the next step of our flowchart — let’s say you determine you really do need the private data. Let’s think about ways we can protect it!

Hopefully, you don’t need clearly private information like person names, addresses or identification numbers. Perhaps you want to make some features out of these attributes (like say, geographic location or gender or level of education). Whenever possible, try removing them first and see what your privacy baseline says about the other less sensitive features. But if you’ve determined you need them, move along to the next step of our flowchart…

If you must retain sensitive data and you need to use true values (i.e. postal code or city or level of education). For this data, determine if K-Anonymity can help preserve privacy. The concept was first presented by Dr. Latanya Sweeney in her paper k-anonymity: A model for protecting privacy. She proposed this approach after exposing Massachusetts Governor William Weld’s health data could be identified by linking voter registration records with “anonymized” hospital records (this was after he claimed the health record release would be fully anonymized).

K-Anonymity protects identification attacks by creating buckets where any single person is instead represented within a group of K individuals. This means you are essentially trying to zoom out of the data — using city instead of postal code or age range (35–45) instead of actual age. This allows for some plausible deniability (it wasn’t my record exactly — it could be any of the other 4 individuals in my group!). When looking into K-Anonymity, it is important to also review l-diversity and t-closeness, which allow for enhanced privacy — particularly if your groups have fairly imbalanced targets or features (i.e. all individuals in a group have the same target variable so knowing the group exposes the target).

If you don’t need to retain sensitive variables, using homomorphic pseudonymization or synthetic data would allow you to retain valid values in your dataset without exposing the true values. Synthetic data is usually fictitious data which is still valid — so a fictitious name replaces a real name and so on. This is usually not reversible (or, if so, it is often referred to as “tokenization” and requires having a very large lookup table where names or other data are mapped to a different value and can be mapped back. Of course, this means names are stored in a big lookup table…not a great idea in terms of security and compounds in size and complexity as your input space grows.)

At KIProtect, we have released our homomorphic pseudonymization API, which allows you to preserve certain validity and structure of your data while still preserving privacy. Our cryptographic-based method means you can also de-pseudonymize the data to reveal the true values via a key (this is great if you are unsure if you need the true values later, as you can recover them from the pseudonymized value). The pseudonymization method allows for prefix preserving data types — such that, for example, all years in the dataset can be mapped to the same new year pseudonym — retaining information for those dates or other data where the relationship with nearby points is an important feature (Interested to try it out? Sign up for our Beta API).

But say you aren’t yet sure about whether you need to retain true values or not… Let’s continue to explore via our flowchart!

Again, let’s focus on the baseline use case. If this is just for research or exploration, please use pseudonymized or synthetic data first! You can, again, always add more information later — but if you get useful results without the sensitive data then you can keep that data protected.

If you aren’t testing or researching and you still need to use private data, your next step is to begin investigating how to safely release this data analysis or model. Let us continue…

If you are releasing a machine learning model which has been built on sensitive data, you should be aware of privacy-preserving machine learning — an active field of research which aims to build models which preserve individual privacy (or privacy of the data from the machine learning service or platform). There are many active researchers on the topic including Nicholas Papernot (currently at Google), the SAP Leonardo team and Professor Reza Shokri at the University of Singapore and many more. However, if you haven’t trained your model on private data (or you have properly anonymized the data beforehand), then you are good to go here! (+1 for privacy-preserving baselines!)

If you aren’t releasing a model, but instead showing data analysis results, take a look at aggregated anonymization — like the emoji study by Apple’s differential privacy team. What they were able to show is that differentially-private aggregated results still gave enough information to make product changes — for example, predicting the kissy-face emoji for French keyboard users and the sobbing face for their English keyboard counterparts.

Apple Differential Privacy Team: Top Emoji Use by Keyboard Language

Finally, let’s say you are working on an internal use only model or data analysis which now is popular and the company wants to make it public or share it with third parties. How can you go about protecting it?

Your best strategy is to try and fix the model by employing those same methods (privacy-preserving ML or aggregated anonymization). This also points back to why having a baseline of properly anonymized (or pseudonymized when that is applicable) data is a great starting point. If the information doesn’t lie in the private or sensitive information, then you can release it publicly without fear of an individual privacy breach. (You should still consider when the data itself may have security risks, such as the recent Strava running map which revealed secret military bases even though it did not necessarily pose an individual privacy risk).

If you somehow already released the model trained on the private data into production, PLEASE lock down your API and chat with your security and engineering teams about preventing malicious access to the model. A machine learning model with an open API is basically like having a nice database with default credentials sitting out there on the public network — so act accordingly (Want to learn more about privacy risks and information gain from black box attacks? See Shokri’s membership inference attacks paper).

Conclusion

I hope I’ve made you laugh, cry and think about how your current data use practices may need some tweaking in regards to data privacy and security. My goal here is not to publicly shame you, but instead to give you some navigation tips and helpful pokes in the proper direction.

At the end of the day, sensitive data is like radiation. You don’t want to have long exposures (or any if possible), you want to do what you can to mitigate the adverse effects and, in general, less *really is* more. Treat your sensitive data like radiation and you will avoid potentially embarrassing leaks and security breaches from the start.

For more information on the data security and privacy solutions we are working on at KIProtect, check us out at https://kiprotect.com. Our Beta API is available for structured pseudonymization and we have a few other privacy-preserving data science tools up our sleeves, so feel free to sign up and find out more.

Navigating the Data Privacy Maze: Tips for Data Scientists

Written by Katharine Jarmul