Data for Change
Every new technology introduced has had a purpose(almost always). Usually, it is a solution to a certain problem(s) identified by its creator while brainstorming. But when we talk about Artificial Intelligence, which is a largely unexplored yet constantly evolving field of computing, we often get detracted by the spruced-up object detection projects.
We all know that correctly classifying handwritten digits or distinguishing between animals are not the applications that researchers like Geoffrey Hinton and Alexey Ivakhnenko had in mind while dedicating their lives to the field. AI has the potential to change the way we live our lives. The proof lies in your pockets. From your favorite video streaming platforms to your online cabby services, basics principles of AI are being used everywhere.
But what about those who don’t have the 20th century’s supercomputer in their pockets or anything in their pockets for that matter. What about those who sleep with raging hunger every night and wake up to the monstrosities of life.
We need to understand that Artificial Intelligence and its subsets are not just about math or algorithms that magically predict something. The data is the real deal. These algorithms without data are just like your phone without the internet. Absolutely useless. (Unless you have offline games installed)
But that’s where the problem arises. If we want to use AI to diagnose/cure diseases, prevent the drastic impacts of climate change or track incoming pandemics, we need real private data. Only if there was a way to train a deep learning model on data that cannot be retrieved or seen by engineers.
In recent years, there has been considerable research to preserve the privacy of a user to solve real-world problems that require private user data. Companies like Google and Apple have been heavily investing in decentralized, privacy-preserving tools, and approaches to extract the new "oil" without actually "extracting" it from the owner. We would be exploring a few of these approaches below.
Federated Learning
In its essence, federated learning is an approach to answer questions with data that resides with the users across the world (on different devices). Secure Multiparty Computation (SMPC) is a cryptography protocol that different workers (or data producers) use to train over the decentralized datasets while protecting the privacy of the user. This protocol is used in the Federated Learning approach to make sure that private data residing with the user cannot be seen by the deep learning practitioners trying to derive insights and solving global issues.
However, this technique is prone to reverse engineering as stated by this paper on "A generic framework for privacy-preserving deep learning" by members of an organization called OpenMined. They have been trying to popularize the concept of privacy in AI by developing such frameworks on top of common tools used by researchers and practitioners on a daily basis.
The best example of federated learning is your Gboard (if you use an android device). In simple terms, it takes in the way you type your sentences, trying to analyze what kind of words you use and then tries to predict the next word you are likely to type. The fun fact is that the training happens on the device itself (edge devices) without revealing the data to the engineer who designed the algorithm. So complex yet snazzy, eh?
Differential Privacy
Imagine you have a dataset with certain entries, perhaps something like whether or not a person had a genetic disadvantage while they suffered from COVID-19. Now without actually looking at what information the data carries about each individual, you can try to understand what significance an individual entry has on the output result of the entire dataset. Once we do that, we could gain some insight into how that result came into being.
As stated in the book, The Algorithmic Foundations of Differential Privacy, Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population. To quote Cyntia Dwork’s exact definition of Differential Privacy,
"Differential Privacy" describes a promise, made by a data holder, or curator, to a data subject and the promise is like this:
"You will not be affected, adversely or otherwise, by allowing your data to be used in any study, or analysis, no matter what other studies, datasets, or information sources, are available"
However, differential privacy can have many cons as it depends highly on the nature of the dataset. If the individual entries are way too diverse (which is very possible) the overall result would be biased towards certain entries, which can lead to an information leak about these diverse entries. And if there is too much noise in the data, there is a possibility that it won’t help gain useful insights.
It is very difficult to implement differential privacy schemes from scratch. Just like cryptography, it is complicated and difficult to build these schemes the way they are supposed to be. Hence, it is advisable to use some of the popular privacy-preserving libraries like TF Privacy and Google’s Differential Privacy while training your deep learning models.
Differential privacy can be used in many instances where we want to protect the privacy of the user and the dataset isn’t too diverse. An example use-case can be in genomics where machine learning can be used to determine a unique treatment plan for a patient depending on their genetic features. There were many proposed solutions to implement this application with k-anonymity but anonymization of data isn’t the optimal way to preserve the privacy of a user as it is prone to linkage attacks. Differential privacy adds a level of uncertainty in the "process" of training data which makes the lives of attackers difficult, hence ensuring privacy.
Homomorphic Encryption
Data Encryption is the basic security standard in organizations these days without which they can even get blacklisted especially if they are dealing with private user data. But if in case these organizations want to use this encrypted data, probably to enhance the user experience of the products they offer, we’ll first have to decrypt it to be able to manipulate it. This raises serious concerns as information can be leaked in the process of decryption.
Homomorphic Encryption is basically a type of encryption where we can perform certain operations on encrypted data to derive relevant insights and answer questions that really matter without compromising the privacy of a user. Even quantum computers cannot break this type of encryption.
Now, this looks very convenient as the data is encrypted and doesn’t require the engineers to look at the information provided by the data. But again, it needs a lot of computational power and definitely requires a lot of time to perform these computations.
There are two types of Homomorphic Encryption, depending on the type of computations that can be performed using them – Partial Homomorphic Encryption and Fully Homomorphic Encryption. The former only supports basic operations like addition, subtraction, etc. whereas the latter can theoretically perform arbitrary operations but with significant overhead.
A hypothetical but very important use-case of Homomorphic Encryption can be during General Elections in different countries so that the voting system cannot be tampered with. Many significant (and factually correct) insights could also be derived from homomorphically encrypted data without leaking any information.
These are just a few of many ways in which one can gain useful insights to solve global crises without compromising the privacy of a user. I would be exploring the different frameworks and tools to leverage these concepts so that we could build something that would actually have an impact on people’s lives. And, I’ll be sharing my insights in the upcoming articles, so stay tuned!
References
[1] OpenMined: What is Homomorphic Encryption?https://www.youtube.com/watch?v=2TVqFGu1vhw&feature=emb_title
[2] Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series https://www.youtube.com/watch?v=4zrU54VIK6k
[3] Google Federated Learning: https://federated.withgoogle.com/
[4] The Algorithmic Foundations of Differential Privacy: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
[5] What Is Homomorphic Encryption? And Why Is It So Transformative?: https://www.forbes.com/sites/bernardmarr/2019/11/15/what-is-homomorphic-encryption-and-why-is-it-so-transformative/#3f69895b7e93
[6] Why differential privacy is awesome? Ted is writing things: https://desfontain.es/privacy/differential-privacy-awesomeness.html
[7] Introducing TensorFlow Privacy: Learning with Differential Privacy for Training Data: https://blog.tensorflow.org/2019/03/introducing-tensorflow-privacy-learning.html
[8] Azencott C.-A.2018 Machine Learning and genomics: precision medicine versus patient privacy_Phil. Trans. R. Soc. A._37620170350