The world’s leading publication for data science, AI, and ML professionals.

Scaling Flower with Multiprocessing

Learn how to scale your Federated Learning experiments using the Flower framework and multiprocessing with PyTorch.

Making Sense of Big Data

Photo by Mitchel Lensink on Unsplash
Photo by Mitchel Lensink on Unsplash

A short Introduction to Federated Learning:

Recent technologies produce more and more data as they evolve, collecting them in large quantities in order to train accurate models is becoming increasingly accessible. However, this raises privacy concerns and, to ensure their protection, people are currently covered by a number of laws depending on where they live (for instance GDPR in Europe). The traditional approach of machine learning consisting in accumulating data in a single spot to train models can’t be blindly applied when personal data is involved.

To tackle this issue, Google released in 2016 a new paradigm to train models called Federated Learning and applied it to its Google Keyboard app [1a] [1b]. It was introduced to leverage the problem in domain difference between the publicly available datasets they trained their model on, and the private data users would produce.

As specified in the Federated Learning book [2], in order for this paradigm to work, it needs to respect 4 main principles which are:

At least 2 entities want to train a model, have data they own and are ready to use it.

• During training, the data doesn’t leave its original owner.

• The model can be transferred from one entity to another through protected means.

• The resulting model performances are a good approximation of the ideal model trained with all data owned by a single entity.

The last point is also telling us that Federated Learning can’t always be applied. Its biggest drawbacks are that, at least for now, Federated Learning is sensitive to attacks from the inside [3], is not guaranteed to converge [4] and needs enough clients to achieve its results [5]. However, when applied correctly, it can produce models that wouldn’t have been obtained through regular means as shown by Google and their Google Keyboard.

As of now, only a few frameworks exist to implement it, since it’s a fairly new concept. TensorFlow has developed its own version called TensorFlow Federated. PyTorch has yet to get its own implementation, but they do exist compatible frameworks such as PySyft, developed by OpenMined, and Flower which will be the main focus of this post.

Why use Flower:

Flower is a recent framework for Federated Learning, created in 2020. Contrary to TensorFlow Federated and PySyft which are linked to a single framework, Flower can be used with all of them by design. It focuses on giving tools for applying Federated Learning efficiently and allows you to focus on the training itself. Implementing a basic Federated setting with Flower is really simple (20 lines of code is enough) and the rewriting needed to adapt a centralized code to a federated one is minimal.

As well, the range of compatible devices is quite large: from mobile devices, to Raspberry Pi, servers and others. Its architecture also allows scalability up to 1000s of clients as shown in their paper [6]. It is overall a really great framework to experiment with.

The GPU problem:

If you want to emulate a federated learning setting locally, it’s really easy to scale to as much clients as your CPU allows you to. For basic models, CPU is more than enough and there is no need to extend training on GPU. However when using bigger models or bigger datasets, you might want to move to GPU in order to greatly improve the training speed.

This is where you can encounter a problem in scaling your Federated setting. Indeed, unlike some other frameworks, Flower’s goal is to allow easy deployment from research/prototype to production so they treat clients as independent processes. Additionally, when accessing the GPU, CUDA will automatically allocate a fixed amount of memory so that it has enough room to work with before asking for more.

However, this memory can’t be freed at all, at least not until the process exits. This means that if you are launching 100 clients and sample 10 of them per round and are using the GPU, every time a client will access it, there will be leftover memory that can’t be released and it will keep growing as new clients are sampled. In the long term, your GPU needs as much memory as there are clients launched.

Here is a short code snippet that shows the problem, ran on Google Colaboratory:

Monitoring memory shows that there is a leftover of 7% even after clearing out memory used by PyTorch. Image by author.
Monitoring memory shows that there is a leftover of 7% even after clearing out memory used by PyTorch. Image by author.

How to solve the issue:

This problem that you might have encountered, can be solved quite easily. Since the memory is not released until the process accessing it is released, then we simply need to encapsulate the part of our code that need to access the GPU in a sub-process, waiting for it to be terminated until we can continue to execute our program. Multiprocessing is the solution, and I will show you how to do it using PyTorch and Flower.

Since this example is based on the Quickstart Pytorch tutorial from the Flower documentation, I highly recommend to check it out before continuing, since it shows the basics.

Helper file

First we’re gonna build ourselves a _flowerhelpers.py file where we’ll put some functions and classes that will come handy later. Starting with the imports, we have:

Basic imports: torch imports for working on CIFAR10 and a flower strategy import since we will need to slightly change the FedAvg strategy for our use case. We then define the device on which we want to compute the training and test steps:

Next, we need to define how we’re going to load the data:

A simple CNN model from "PyTorch: A 60 Minute Blitz":

So far, nothing changed from the original flower tutorial, from now on things are going to become different. Since we can’t keep the model in the client’s memory, we need to define a way to get our model’s weights so that the client can keep track of them. For that we move the get_parameters and set_parameters function from the flower tutorial and put them in our helper file:

Now we can define the training function that will be called every time a client wants to train its model:

This function takes three parameters, the number of local epochs we want to train, the new parameters of the global model and a return dictionary that will act as our return values to give back to the client the updated model, the size of the local dataset and other metrics we’d like to include like the loss or accuracy. Same goes for the testing function:

Finally, we need to define our custom strategy. These are not mentioned in the QuickStart tutorial but strategies are the classes that determine how the server will aggregate the new weights, how it will evaluate clients, etc. The most basic strategy is the FedAvg (for Federated Averaging [1b]) which we will use to implement our own. Flower already gives you a way to define the number of clients you want to use to evaluate your model through the initial parameters of the FedAvg strategy, however this is only true for evaluations carried between each round.

Indeed, after the final round, the flower server will perform a last evaluation step, sampling all clients available to verify the model’s performances. That wouldn’t be a problem in a real case scenario but this could actually backfire in ours, we want to especially avoid a scenario that could involve an overflow of GPU memory requirements.

This is why we will be performing evaluation in this tutorial only on the server side and we will remove this functionality. This is done through the configure_evaluate method of the strategy, which we need to override:

Client file

We’re done with the helper file, we can now switch to the client side and create client.py. Starting with imports:

The next step is to implement our own Client class so it can connect to a flower server. We need to derive from the NumpyClient flower class and implement 4 methods, namely get_parameters, set_parameters, fit and evaluate. We will also add an attribute called parameters, where we will keep track of the model’s weights:

get_parameters and set_parameters are straightforward, they are just a getter and a setter:

Then the fit method is where the model is trained, it receives two parameters: the new parameters from the global model and a config dictionary containing the configuration for the current round. It’s inside fit that we’re going to launch a sub-process so we can use GPU without being worried about lingering memory:

As you can see, the function returns the newly updated parameters, the size of the local dataset and a dictionary (here empty) which could contain different metrics. Last we have the evaluate method, similar to fit but used for evaluation. In our case we could chose to simply implement the minimum required as we won’t evaluate our clients. But I will give here the full implementation:

We only have to wrap all of this in main, set the spawning way to create new sub-processes (not the default with Python under 3.8) and start our client on local port 8080:

Server file

With the client being done we can now work our way toward the server, which we will simply call server.py! In the original tutorial, starting the server takes as much as a single line of code! But here we will perform server side evaluation and using a custom strategy so things are slightly different. Beginning with imports:

First, we need to define the way we are going to evaluate our model on the server side and encapsulate the function in get_eval_fn which tells the server how to retrieve the function. The evaluation is almost identical to the one I gave for the client side and you could actually merge part of it:

Then we can start the __main__ and load arguments and set the spawn method:

We then get a fresh network so we can initialize the weights for the federated loop:

Finally, define the strategy and launch the server on port 8080:

Bash file

The only thing left to do is to launch our server and clients! We write a nice bash file so we only have to run it to start experimenting:

Running

Now, by simply running ./run.sh in your terminal once you turned it into an executable (chmod u+x run.sh), you should see the following output:

Terminal output from running the script. Image by author.
Terminal output from running the script. Image by author.

Opening a new terminal and using the nvtop command, we can monitor our GPU usage in real-time:

GPU memory usage with nvtop, in blue we have the GPU computational usage and in yellow the memory usage. Image by author.
GPU memory usage with nvtop, in blue we have the GPU computational usage and in yellow the memory usage. Image by author.

We can see that our clients are correctly spawning sub-processes and that the memory is freed every time they finish their training.

If you get an error caused by "no module named backports.lzma" you can add the package with the poetry add backports.lzma command.

If for some reason, you get an error telling you the clients can’t connect, make sure the server has enough time to set up before clients try to connect to it. Another reason might be because of a known bug with GRPC and Python, you can try adding the following lines in your server and client files:

All the code is available on GitHub. You can now launch as many clients as your CPU allows you and managing your GPU memory as it fits your needs. This concludes this tutorial. Hope it will come handy for you, don’t hesitate to leave a feedback!

To go further

Of course it was just a demonstration of a workaround and it’s not yet ready for real federated experiments, but if you need to go further you can try making your own federated datasets (right now we load the same data for all clients) or use a benchmark like LEAF. Wrap the training and testing steps with tqdm to get a better feedback, include a more detailed report (precision, recall, F1 score…). Add a way to protect your model with encryption or Differential Privacy. And so on. You can also check out more Flower tutorials to get a better grasp of the framework’s possibilities.

Finally, Flower’s team addressed scaling in their latest Summit, and it seems that the release of the Virtual Client Manager will allow to resolve this issue and to even improve further scaling by allowing the use of thousands of clients per round, while still taking into account available resources.

References

[1a] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated Learning: Strategies for Improving Communication Efficiency (2017), arXiv:1610.05492 [cs]
[1b] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data (2017), arXiv:1602.05629 [cs]
[2] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, "Federated Learning (2019), Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 13, no. 3, pp. 1–207
[3] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, Analyzing Federated Learning through an Adversarial Lens (2019) arXiv:1811.12470 [cs, stat]
[4] C. Yu et al., Distributed Learning over Unreliable Networks (2019), arXiv:1810.07766 [cs]
[5] K. Bonawitz et al., Towards Federated Learning at Scale: System Design (2019), arXiv:1902.01046 [cs, stat]
[6] D. J. Beutel et al., Flower: A Friendly Federated Learning Research Framework (2021), arXiv:2007.14390 [cs, stat]

Related Articles