Where “Privacy First” Fails

On the challenges of data minimization, granularity, and decentralized processing

Published in

Towards Data Science

8 min readOct 26, 2021

“We are a privacy-first company”. You hear that all the time today. “We use privacy by design principles”. “We practice data minimization”. These are well-meaning statements, and often their implementation is also correct to the letter, yet this approach often leaves issues such as loss of data integrity or a lack of true de-identification, which sees “privacy first” fail. Let’s break these concepts down and explore some alternatives.

For the purpose of this article, let’s use a tangible use case of a fatigue application. This fatigue application uses 24-hour time series heart-rate data, from which it generates a fatigue score. This fatigue score can be used for managing fatigue in e.g. drivers, to lower accidents.

“We anonymize all our data”

OK. So let’s consider this. You have wearable data, for example activity data on your individual customers. You can strip out unique identifiers such as personally identifiable information, call the user by a pseudonym “Mary Adams”, but what does this mean for the attribute level?

As an example, consider this completely plausible output which is actually completely synthetic.

{“data”: {
“person”: {
“sex”: “Female”,
“device”: “Android”
},“deviceLocation”: {
“coordinates”: {
“latitude”: 51.5072682,
“longitude”: -0.1657303
},“altitude”: 16,
“accuracy”: 95,
“verticalAccuracy”: 90,
“velocity”: 4
},“activity”: {
“timestamp”: “2021–01–02T23:37:00+00:00”,
“type”: “Sleeping”
},“heartRate”: {
“currentHeartRate”: 58,
“minHeartRate”: 45,
“maxHeartRate”: 83},
 }
 }

You may have pseudonymized the user’s name, but how do you anonymize GPS coordinates? This data above clearly shows where your user sleeps. There are only so many people that spend significant time at your home address, your work address, that combining this to identify the person after the fact (a process called re-identification) is really kindergarten level.

The NY Times ran a great article with accompanying examples on this topic, showing how anonymous location data could easily identify military officials working at the most secure locations. If the Pentagon can’t get this right, would you trust your average company?

The solution: let’s delete the sensitive data objects

In cleaning the data, one could take the (and often this is done) approach of stripping the data objects and attributes, like location history, which are easily re-identified later. This seems like a solution, yet this has the obvious disadvantage: we are losing valuable data for both the user and the application.

Your output looks like the following. It is indeed far less invasive, but it is also far less rich. To put it bluntly, you lost your “Mary Adams” on the map.

{“data”: {
“person”: {
“sex”: “Female”,
“device”: “Android”
},“altitude”: 16,
“accuracy”: 95,
“verticalAccuracy”: 90,
“velocity”: 4
},“activity”: {
“timestamp”: “2021–01–02T23:37:00+00:00”,
“type”: “Sleeping”
},“heartRate”: {
“currentHeartRate”: 58,
“minHeartRate”: 45,
“maxHeartRate”: 83},
 }
 }

Wait: now we’ve lost all GPS location data! For more sophisticated applications or use cases that require sophisticated data sets and contextual information, deleting data is not the optimal solution, so we may take another approach of adjusting the data granularity.

A Note on Data Granularity

Data granularity is a spectrum, starting from the absolute, detailed measurement and leading all the way up to a useless abstraction that has no information value. However, as a spectrum, you can scale granularity up and down.

A set of GPS coordinates is a good example of one extreme of data, very precise and rich, whereas a ZIP code is still information on location, but far less precise and less unique to one individual.

“deviceLocation”: {
“coordinates”: {
“place": 94105
},

This same logic can be applied to practically all data objects, where you take the attribute level and create a grouping, such as an income range or an age group. However, there are areas where this becomes more binary, as whether the individual belongs to a minority or has reported income. Some of these binary groupings can also become disproportionate signals, such as if a user has any illnesses or reported offenses. They are also strong identifiers.

Considering our individual fatigue application, this time series data is critical to be able to determine fatigue, in order to be able to judge the variance throughout the past 24 hours as well as the relative deviation in heart rate intensity. Using less granular data, summarizing the past 24 hours, would not produce meaningful or accurate fatigue scores. While heart rates are difficult to link to an individual, lowering the data quality is not an option.

Data Results Shouldn’t Reveal Cohorts Where N<50

Different regulations and industries have different rules. One that’s frequently utilized is to limit the size of the resulting cohorts to guarantee de-identification. This limit could be a population of 50, where your algorithm cannot reveal a smaller population than that in its results.

The reality however is not so simple. What if you are utilizing location data, how many people have comparable location patterns? Then you have to abstract or aggregate attributes into groups, e.g. dealing with zip codes. But even that may not be enough to guarantee an N<50.

What about other data points, such as health or wellness data? Individual patterns are far more specific, especially when taking into account conditions such as diabetes or impairments. Abstracting up by, for example, simply noting if a ‘condition’ exists or not, may help avoid a smaller group size.

Using Subjective Attributes over Absolute Attributes

Objective attributes such as a heart rate in bpm is easy to understand, similar is a location in latitude and longitude coordinates. However, a relative heart rate of +-15% of resting heart rate or a relative location, of 5 km from 1 Market Street in San Francisco, are both meaningful yet relatively difficult to tie to a particular individual from the outside.

“deviceLocation”: {
“coordinates”: {
“distanceFromHome": 1-5km
},

Abstracting data to subjective attributes may therefore be quite appealing, yet it will increase overhead significantly when managing large datasets and it is also non-trivial, as combinations of subjective measurements will increasingly converge on a particular individual.

Managing Cohorts is Tricky With Combined Data

Even if you abstract underlying attributes into groups, population sizes easily become identifiable if there are enough data points. For example, consider a time series of location history (abstracted to a zip code level) and activity data (abstracted to if the user is active or not). Even this rather vague data set could easily create a population smaller than 50.

What you easily end up with is the concept of data minimization. In order to achieve these goals of limiting the sensitivity of the data, we limit the quality of the data itself or even the data dimensions we include, potentially lowering the value we can deliver.

Using Synthetic Data Sets to Guarantee Privacy

One of the exciting approaches in the data market is synthetic data, which allows developers and data engineers to create synthetic datasets based on real datasets, to allow data engineers to retain data privacy. An example of a company pioneering this is Gretel.

This approach deals with a more fundamental aspect of data privacy, that the underlying dataset may be impossible to anonymize or strip of personally identifiable information. Therefore creating a comparable (caution: data integrity issue subject to understanding the use case in question) data set of synthetic data, is a conceptually appealing option.

Now this option has its limitations, mainly linked to what types of use cases it actually lends itself to and produces adequate results for. It is also often met with skepticism about retaining the statistical integrity of the data, as modifying data even slightly, may lose a lot of links and significance.

Synthetic data can serve as a starting point, however there is no substitute for real data that is closely linked to the question to be answered. Understanding how synthetic data can be generated depends upon understanding real data, which increases complexity by orders of magnitude for each different data object and source.

As a partner stated, “we wouldn’t like to predict a heart attack using synthetic data”, emphasizing that in health related use cases real data is the only real option.

Decentralized Data Processing As An Option

Decentralized data processing refers to processing data outside of a central server, for example on the user’s own device, computer, phone or browser. Decentralized processing can also benefit from having no external sharing, meaning there are no shared logs beyond the user’s own device. Therefore data and the use of data, stays with the individual.

We can observe the processing of data outside of the central server (“on the users side”) in some personal data use cases. By being able to process data without ever removing it from the user, it also makes the argument on data privacy far different. As an operation it is individual-first, where the individual themselves gains agency and utility from data.

Decentralized Apps Offer More Privacy by Default

It’s also important to keep in mind that not all privacy comes from handling of data, some of it comes from more mundane aspects, such as how your user is connecting to your app or server. However well you deconstruct the data, if your end user is communicating directly with you from their IP address, that anonymity may be lost. Yes, there are VPNs that in part solve this problem, however, VPN usage is a far cry from normal and most users can be identified from their direct network connection.

Removing Data From User Requires Justification

There are various data engineering methods for processing data and also deploying and training algorithms on nodes, such as federated learning. Given one can process high definition data in a decentralized manner on the user’s side, transferring data from the user to a centralized server requires justification, not only because it inevitably exposes the user to more risk, but it also will require reducing the data granularity due to privacy and ethical concerns.

We’ve discussed data engineering on various forums, such as with the inventor of Privacy By Design Dr Ann Cavoukian on the Liberty Equality Data podcast, where we often come to the conclusion that many methods exist and defaulting to centralized processing may simply be a lack of familiarity.

For our fatigue application, we print out a green light / red light type of result (too tired or not), to show the user depending on their fatigue score. However, utilizing the time series heart rate data for the past 24 hours we can not only determine fatigue, we can also identify alcohol and substance use.

Sharing the green light / red light type of a result effectively shares the fatigue score for the individual. However, sharing the entire data set (the time series heart rate data), exposes also those other factors embedded into the result such as alcohol use.

This means that transmitting the original data away from the user may incur additional risks for both parties, whereas the end result of a fatigue score or a red light or green light system, can be delivered without revealing the underlying data.

Sophisticated Apps Require Data Integrity

We cannot make the conclusion that one approach is the right for all use cases, of course.

Sophisticated applications rely on high definition contextual data availability and stripping data of sensitive data may damage the overall data integrity. Data engineering on synthetic data sets to get the initial assumptions set is a highly valuable option for some use cases and building models, however it is no replacement for real data. When deploying the model or launching the application, being able to take a different approach to utilizing data to begin with, by limiting sharing, can have tremendous applicability.