The world’s leading publication for data science, AI, and ML professionals.

FEDERATED LEARNING : Your data stays with you

This is 21st century and "Data is the new gold". With the emergence of new technologies, higher computational power and of course huge…

Federated Learning : Your Data Stays with You

With federated learning we can improve centralized machine learning model performance in an alternative way without sharing user’s data.

Photo by John Salvino on Unsplash
Photo by John Salvino on Unsplash

This is the 21st century and "Data is the new gold". With the emergence of new technologies, higher computational power and of course huge amount of data, Artificial Intelligence has gained a lot of momentum and the global artificial intelligence (AI) market is expected to grow from USD 58.3 billion in 2021 to USD 309.6 billion by 2026, at a Compound Annual Growth Rate (CAGR) of 39.7% during the forecast period as per MarketsandMarkets.

While Artificial Intelligence brings a lot of comfort in our daily life without us even realizing it, there are some challenges as well. And data privacy is one of them. Apps gather your data ( of course not to sell them to third party applications, or do they? ) to give you more personalized recommendations and results ( along with ads 😐 ).


SO THE QUESTION ARISES, DOES THE DATA LEAVE YOUR PHONE ?

The answer is yes. For example when you grant access of your location information to any application, it collects your location data. Now it’s upon the application how they want their AI algorithm to use it. There are two options :

  1. On server : The machine learning/Deep Learning model is deployed in the server where it trains the model on the data it receives from billions of smartphones.
  2. On-device : The ML/DL model is deployed on the phone where the user data is used for training and improving the model for better recommendations.

Both have their own advantages and disadvantages. Training on the server needs huge amount of storage to store the data and a world-class security to safeguard them from data breaches. Whereas on-device training is trained on limited amount of data and the model performance is compromised.

Solution : Training a centralized model on decentralized data. Boom!!!!

Okay let me explain it.

For better user experience a company will want data from billions of smartphones to be trained on a centralized model present in the server. But for that the data has to leave the smartphones. But we don’t want that right? Instead if a copy of the centralized model is present in all the devices on which the training is happening then we have already solved the problem of performance. Now somehow we have to aggregate all the results from each smartphone into a single one. Now the the training results (as we Machine Learning engineers call it : weights) can obviously be sent to the server where they get merged. Now the weights are highly encrypted and the key lies with the model that is present on the device.

Ohhhhh yeahh!!!

And to further improve user privacy secure-aggregation protocol is used which enables the server to combine the encrypted results only to decrypt the aggregate by adding zero-sum masks. To read more on this please refer to this paper.

Finally the aggregated weights are sent back to the model on the device and we now have a new improved model update.


With every new start, comes up new challenges. And trust the beauty of new challenges, they are here to help us grow.

So let’s discuss about the challenges.

  1. Some data is very specific to a particular user. And that will bring down the overall model performance. We don’t want the model to memorize the rare data from a particular user.

Solution : a) By devising a mechanism to control the amount an individual user can contribute towards the overall result. b) By adding noise to more specific data. This is also referred to as Differential Privacy. I found this article quite intuitive.

  1. Now we have our new model formed from aggregated results. But how can we see how the model is performing on new data before rolling out the update?

Solution: Simple! We can apply the same concept of train validation split that we all know of. Instead here we will have the users as our experiment!!!! Sounds interesting, hunh? We will split users as training and validation. From a universal set of smartphone users, we have a small proportion of them who will validate the result. And the rest will train the model. So the model is tested on real time data.

  1. Can simple averaging aggregation work for all the algorithms? Let’s explain this with two examples:

a) Let’s take Normal bayes (openCV). The mean vector and the covariance matrix are highly influenced by the number of samples per class. Now let’s say we have two smartphones and a binary classification problem with class A and class B.

User 1:

Image by author : Contains three samples of class A and one of class B
Image by author : Contains three samples of class A and one of class B

User 2:

Image by Author : contains 3 samples of class B and one of class A.
Image by Author : contains 3 samples of class B and one of class A.

Where x ki ( j ) represents the value of the _i-_th feature attribute of the _j-_th sample belonging to the class k in the training sample, and the final n -dimensional (a total of n feature attributes) mean vector of class k ‘μk’ is estimated as:

So the mean vector’s values are highly influenced by the number of samples per class. So the mean vector μA(1) of user 1 has 75% influence of class A and _μA(2) 25%. So w_hen we merge them by taking the average 1/2(μA(1) + μA(2)) are we able to retain the information specific to the classes?

b) For algorithms like SVM whose weight is nothing but the support vectors that depend on the number of samples in the dataset, can we have a weight matrix that is of constant size for all the results? We night need to device aggregating algorithms specific to Machine learning task at hand.

  1. Trade off between Privacy and Accuracy :

Sometimes to increase the privacy of the user’s data, some noise is added which results in the data being deviated from it’s actual behavior thus resulting in some accuracy drop.

Conclusion :

Federated Learning can solve a lot of problems related to user’s privacy while improving the model performance for better recommendations. This is a fairly new domain, and more research can solve a lot of challenges faced by what we call as Collaborative learning.

References:

Federated Learning: Collaborative Machine Learning without Centralized Training Data

Federated Learning: Strategies for Improving Communication Efficiency

TensorFlow Federated


Related Articles