Go Federated with OpenFL

Put your Deep Learning pipeline on Federated rails

Igor Davidyuk
Towards Data Science

--

Introduction

OpenFL is an open-source framework for Federated Learning (FL) developed at Intel. FL is a technique for training statistical models on sharded datasets, distributed across several nodes. Moreover, data may be not identically distributed between different shards and cannot be moved between nodes, due to privacy / legal concerns (laws such as HIPAA or GDPR), size of the dataset, or other reasons. OpenFL is designed to solve so-called cross-silo federated learning problems when data is split between organizations or remote data centers.

OpenFL aims to provide an effective and secure infrastructure for data scientists. The extensible nature of OpenFL should ease and democratize research in the field of Federated Learning

With the v1.2 update OpenFL team endeavors to raise the framework’s learnability and decouple the procedure of setting up the Federation and using it to run FL experiments. At the OpenFL team, we worked on minimizing user entry points, simplifying the process of establishing a Federation, and registering and running FL experiments.

v1.2 OpenFL update introduces user sessions, long-living components that allow conducting several subsequent experiments while reusing existing connections, and interactive Python API developed to ease adapting their one-node training code with minimal changes and provide data scientists the single-node experience.

Working With Remote Data Easily

Let us think of a bunch of nodes equipped with computing units and keeping unique data sets. Such a group of nodes may make a Federation if the data they hold allows solving a particular data science problem and people controlling these nodes are willing to collaboratively work on this problem (at least allow local training on their data).

Let’s imagine, several node owners agreed to create a Federation, thus making their unique data sets be shards of a virtual global data set. Now, we take a closer look at the most important part — the abovementioned data shards. Chances are the data shards are kept in different formats, especially if nodes belong to different organizations. Data shards are heterogeneous (this is the reason why we want to combine them in the first place), they differ in their origin, structure on disk, labeling schema, and so on. To facilitate such diversity and describe a single experiment that will use all the shards we have to choose one of the following:

  1. The data loading procedure in an experiment should have a switch-case logic to load different data shards. This necessity not only breaks the single-node experience of experiment definition but also implies that a data scientist knows the structure of every dataset shard.
  2. Dataset shard owners agree on a common data structure and prepare their data accordingly. Although the data loading procedure in experiments would be the same for all the dataset shards, their owners must describe data reading and dumping scripts and keep a copy of their dataset for every given data interface.

OpenFL tries to solve this problem by taking the best from these two approaches by including the data preparation procedure in the federation set up pipeline. OpenFL now provides the Shard Descriptor interface for dataset shard owners. It allows to define data reading procedure and describe a data acquisition method compliant to the unified data interface. Without dumping a formatted copy of the dataset shards, Shard Descriptor provides a single data access method for an experiment on all the nodes in the Federation. In this schema, data samples are loaded only in runtime.

To sum up, the Shard Descriptor is introduced to tackle data heterogeneity and provide a unified data interface for data scientists defining FL experiments. Dataset shard owners can introduce their data reading procedure, preprocessing, and even utilize differential privacy techniques if needed.

Main Usage Scenario

In this section, we will walk through the main steps required to set up a Federation and conduct an FL experiment using OpenFL to understand the workflow.

Setting up a Federation

We are in a situation when there is a group of data owners who agreed to work together to solve some data science problem and their labeled datasets fit this purpose. The first step is to install OpenFL on all the machines that will be used for Federated model training, we will call those machines ‘collaborator nodes’ further.

OpenFL may be installed from the PyPI (a conda package and a docker image are also available):

$pip install openfl

Then the data owners need to implement Shard Descriptors Python classes.

At this point, the Federation participants must choose the central node in the federation that will be used as an experiment server and will aggregate model updates from all the collaborator nodes.

The experiment server which is called the Director service now should be started on the central node using the OpenFL command-line interface (CLI). It takes the unified dataset format in the federation and an open port on the central node as parameters.

As the Director is running, its ‘clients’ named Envoys now may be started on the collaborator nodes. Envoys are also started using the OpenFL CLI with a config file and Director’s network address as parameters. The mentioned config file should contain the import address of the local Shard Descriptor class and its parameters if needed. When started, Envoy tries to establish a gRPC connection with the Director, which may accept Envoy as a participant in the federation if the Shard Descriptor conforms to the unified data interface.

Coming so far, we have a star-like network of nodes: several Envoys connected to the Director and waiting for incoming experiments. We call such a network a Federation. The Federation may host several experiments and stop existing when the Director is shut down.

Registering an FL experiment

At this point, data scientists may register their experiments to be executed in the federation. OpenFL provides a separate frontend Director’s client as a part of Python API to register experiments. A data scientist may connect to the Director from another machine (including a laptop with a limited amount of computational resources) and define their experiment in an interactive environment such as a Jupyter notebook or a Python script. We will take a closer look at the frontend Python API in the next section.

Several users may be connected to the same Director, but registered experiments are executed one by one in the Federation (at least for OpenFL v1.2). When an experiment ends, the user may retrieve training artifacts: checkpoints and training logs.

What happens when an experiment is accepted by the Director

When the user reports an FL experiment to the Director, the experiment workspace and information for reproducing the Python environment are packed into an archive and sent to the Director. Director broadcasts the experiment archive to the involved Envoys. The Director then starts an Aggregator service that will orchestrate the training process. Envoys start Collaborator processes that will train the model on local data.

Green blocks are long-existing components. Yellow blocks are short-living components spawned for a particular experiment. Bidirectional arrows are gRPC connections. (Image by Author)

Check out detailed diagrams in OpenFL docs here and here.

Interactive Frontend API

In the previous section we surveyed interactions with OpenFL from the data holder’s side, now let us overview ways data scientists may utilize the created infrastructure.

As stated above, OpenFL provides a distinct Interactive Python API, designed to simplify FL experiments. With this update, we attempt to decouple researchers’ interface from the process of establishing the network, so efforts to define an FL experiment do not scale with the number of collaborators in a Federation.

Efforts to define an FL experiment do not scale with the number of collaborators in a Federation

Generally speaking, frontend Python API allows users to register a statistical model and training logic so it may be trained in a Federated manner. Below we highlight three main parts of an FL experiment as perceived in OpenFL:

  • Model and an optimizer may be created and initialized in a way preferred by the user. OpenFL frontend API provides a Model Interface class to register these objects. OpenFL allows using PyTorch or TensorFlow 2 deep learning frameworks as the computational backend. Model Interface is the place to choose one of them or provide a plugin supporting another DL framework. OpenFL itself is DL framework-agnostic.
  • FL tasks are main units containing training logic that describe a separate part of the training procedure independent of other parts, for instance ‘train’ or ‘validate’. OpenFL Python API provides the Task Interface class allowing adapting standalone functions as FL tasks. A task must take model, optimizer, and data loader as parameters and optionally return some calculated metrics, but besides these constraints, it is a regular Python function.
  • Data Loader. The final part of the FL experiment definition is preparing data. There is a subclassable Data Interface in OpenFL API that will adapt Shard Descriptors on collaborator nodes and provide tasks with local data. The difference between the Shard Descriptor and data loader must be explained: Shard Descriptor does data reading and preprocessing routine, there may be unique Shard Descriptors for every Envoy, while data loader is defined once per an FL experiment in the data scientist’s environment and contain augmenting and batching logic.

OpenFL is a Deep Learning framework-agnostic library

When all three parts of an FL experiment are implemented, a user should utilize controlling objects to register the experiment and oversee its execution.

The federation object is a wrapper around the Director’s client, it helps to connect a particular notebook to a Federation. Connected Federation object allows user to examine connected Envoy set and may also provide a dummy Shard Descriptor that mimics remote ones. It allows local debugging of the Experiment code thus providing a single-node experience as if data was accessible from the data scientist’s machine.

The experiment object wraps the Model, Data, and Task interfaces and helps to report an experiment to a given Federation. It should be used to pack the local code and list of used Python packages and send them to the Director. The Experiment object also provides methods to monitor the status of accepted experiments and to retrieve trained models.

The general intention behind bringing the Interactive API to OpenFL is to allow data scientists to wrap their existing one-node training pipelines and start FL experiments with minimal effort.

Summary

OpenFL development moves towards creating a flexible and handy tool for data scientists, trying to ease and accelerate research in the Federated Learning field.

You can check out a practical example of training a UNet model on the Kvasir Dataset in the Federated manner with OpenFL and a manual on how to do that.

If you are interested, here is a list of useful resources:

--

--