Distributed Deep Learning Training with Horovod on Kubernetes

Published in

Towards Data Science

7 min readSep 16, 2020

You may have noticed that even a powerful machine like the Nvidia DGX is not fast enough to train a deep learning model quick enough. Not mentioning the long wait time just to copy data into the DGX. Datasets are getting larger, GPUs are disaggregated from storage, workers with GPUs need to coordinate for model checkpointing and log saving. Your system may grow beyond a single server, team wants to share both GPU hardware and data easily.

Enter distributed training with Horovod on Kubernetes. In this blog, I will walk through the setups to train a deep learning model in multi-worker distributed environment with Horovod on Kubernetes.

Horovod with TensorFlow on Kubernetes

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Open sourced by Uber, Horovod has proved that with little code change it scales a single-GPU training to run across many GPUs in parallel.

Horovod scaling efficiency (image from Horovod website)

As an example, I will train a movie review sentiment model using Horovod with TensorFlow and Keras. Although Keras itself supports distributed training natively, I found it a little more complex and less stable comparing to Horovod.

Often time, customers ask me how to allocate and manage the schedule of GPUs to team members in such an environment. This becomes more important in a multi-server environment. I have heard solutions like time table in Excel (awful, but still surprisingly common), Python scripts, Kubernetes and commercial softwares. I will use Kubernetes because it supports a nice interface to run many application containers, including deep learning, on top of a cluster.

A fast shared storage/filesystem is critical to simplify distributed training. It is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share both GPU hardware and data. I will use FlashBlade S3 for hosting the dataset, and FlashBlade NFS for checkpointing and storing TensorBoard logs.

The below is the architecture of this setup:

Distributed training with Horovod for TensorFlow on Kubernetes and FlashBlade

Deploy Horovod on Kubernetes

In a multi-worker Horovod setup, a single primary and multiple worker nodes coordinates to train the model in parallel. It uses MPI and SSH to exchange and update model parameters. One way to run Horovid on Kubernetes is to use kubeflow and its mpi-job library, I found it is overkilled to introduce Kubeflow just for this purpose. Kubeflow itself is a big project. For now, let’s keep it simple.

We need to install MTP and SSH first. Horovod provides an official Docker file for this. I have customised it to fit my needs. While MPI and SSH setup can be put into the Docker image, we do need to configure passwordless SSH authentication for the Horovod pods. Not a hard requirement but to make the example more concise, I use a Kubernetes persistent volume (PV) to store my SSH configuration and mount it on all containers at /root/.ssh.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: horovod-ssh-shared
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: pure-file

Note the PV is a pure-file class (backed by FlashBlade NFS) with ReadWriteMany access mode. The same way, I also create another PV called tf-shared for checkpointing and TensorBoard logs. I mount these PVs to all the containers:

volumeMounts:
  - name: horovod-ssh-vol
    mountPath: /root/.ssh
  - name: tf-shared-vol
    mountPath: /tf/models

I use Kubernetes Init Container to run a init-ssh.sh script to generate the SSH passwordless authentication configuration before the Horovod primary container starts.

initContainers:
- name: init-ssh
image: uprush/horovod-cpu:latest
volumeMounts:
  - name: horovod-ssh-vol
    mountPath: /root/.ssh
command: ['/bin/bash', '/root/init-ssh.sh']

The content of the init-ssh.sh is something like this:

if [ -f /root/.ssh/authorized_keys ]
then
    echo "SSH already configured."
else
    ssh-keygen -t rsa -b 2048 -N '' -f /root/.ssh/id_rsa
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
    chmod 700 /root/.ssh
    chmod 600 /root/.ssh/authorized_keys
fi

I then declare two Kubernetes Deployments: one for the primary, another for workers. While the primary does nothing, the workers start a SSH server in the pod.

- name: horovod-cpu
    image: "uprush/horovod-cpu:latest"
    command: [ "sh", "-c", "/usr/sbin/sshd -p 2222; sleep infinity" ]

With these, root user on the primary pod can SSH to the workers without password. Horovod setup is ready.

Accessing Datasets on S3 in TensorFlow

My dataset is stored in FlashBlade S3 as TensorFlow record files. I want my TensorFlow script to directly access it instead of downloading to local directory. So I added several environment variables using Kubernetes Secret to the deployments:

env:
- name: AWS_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: tf-s3
      key: access-key
- name: AWS_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: tf-s3
      key: secret-key
- name: S3_ENDPOINT
  value: 192.168.170.11
- name: S3_USE_HTTPS
  value: "0"

Later in my TensorFlow script, I will use these variables for S3 authentication:

endpoint_url = f"http://{os.environ['S3_ENDPOINT']}"
kwargs = {'endpoint_url':endpoint_url}
s3 = s3fs.S3FileSystem(anon=False, client_kwargs=kwargs)# List all training tfrecord files
training_files_list = s3.ls("s3://datasets/aclImdb/train/")
training_files = [f"s3://{f}" for f in training_files_list]# Now let's create tf datasets
training_ds = tf.data.TFRecordDataset(training_files, num_parallel_reads=AUTO)

FlashBlade S3 is very fast, a minimum deploy can go up to 7GB/s read throughput at around 3ms latency consistently. This should be good enough for many DL training workloads.

GPU Scheduling on Kubernetes

To let Kubernetes schedule the pod based on GPU resource requests, we need to install the Nvidia k8s device plugin. It is required to use nvidia-docker2 package instead of regular docker as default runtime. Follow the README on how to prepare your GPU nodes. The device plugin installation is straightforward using helm. In my lab, I only install the plugin on nodes with Tesla GPUs on them. So I added Node Label to my GPU nodes.

kubectl label nodes fb-ubuntu01 nvidia.com/gpu.family=teslahelm install \
    --version=0.6.0 \
    --generate-name \
    --set compatWithCPUManager=true \
    --set nodeSelector."nvidia\.com/gpu\.family"=tesla \
    nvdp/nvidia-device-plugin

The plugin will be installed as a DaemonSet in kube-system namespace. If everything went well, the GPU nodes should now have GPU capacity:

kubectl describe node fb-ubuntu01

Capacity:
  cpu:                32
  ephemeral-storage:  292889880Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             264092356Ki
  nvidia.com/gpu:     1
  pods:               110

We can then request GPU resource for Horovod pods:

resources:
  limits:
    nvidia.com/gpu: 2 # requesting 2 GPUs

Prepare for Training

Next I use a pre-train script to prepare the environment for training. The script uses Kubernetes CLI to select Horovod pods and then do the followings:

Generate a pip-install.sh script to install Python dependencies on all pods.
Generate a horovod-run.sh script to start the Horovod job.
Copy source code and generated scripts from my workstation to the shared PV of the Horovod primary pod.

After running the pre-train.sh script, my primary pod will have these files in the shared PV:

root@horovod-primary-84fcd7bdfd-2j8tc:/tf/models/examples# ls
horovod-run.sh  imdb-sentiment.py  pip-install.sh  pre-train.sh  requirements.txt

Here is an example of a generated horovod-run.sh:

mkdir -p /tf/models/aclImdb/checkpointsAWS_LOG_LEVEL=3 horovodrun -np 3 \
 -H localhost:1,10-244-1-129.default.pod:1,10-244-0-145.default.pod:1 \
 -p 2222 \
 python imdb-sentiment.py

This script runs the training job on three pods in parallel, with each pod using 1 CPU. Here we don’t use GPUs because the model is very small.

Because everything is automated, each time I change the training code in my VSCode (I use the Remote extension to write code on the server over SSH), I run the following to start the training job:

Run the pre-train.sh script to regenerate and copy source code.
Enter to Horovod primary pod.
Run pip-install.sh to install dependencies on all pods.
Run horovod-run.sh to start Horovod training job.

So far this workflow works well for me.

Horovod with TensorFlow

The modifications to the training script required to use Horovod with TensorFlow is well documented here.

My example code is an end-to-end runnable script to train a movie review sentiment model. It is similar to single-node training except:

The code runs on all Horovod pods in parallel.
Each pod only processes parts of the total number of training and validation batches, so shard the dataset (use tf.data.Dataset.shard()) and set steps_per_epoch and validation_steps properly when calling model.fit.
Some tasks, such as saving checkpoints, TensorBoard logs and the model, should be taken care to only run on the primary pod (hvd.rank() = 0) to prevent other workers from corrupting them.
Because the pods can run on any server (GPU nodes only if requesting GPU resource) in the Kubernetes cluster, we should save checkpoints, TensorBoard logs and the model in a persistent volume (FlashBlade NFS in my example) or object storage (e.g., FlashBlade S3).

I will skip the details of the training code here. Please refer to my example code.

The below is a running output:

If I look at my Kubernetes monitoring UI, I can see all the Horovod pods’ CPU usage jumps up. This indicates the training job was running on all the pods in parallel.

Pod resource usage during Horovod training

Summary

Distributed training is the future of deep learning. Using Horovod and Kubernetes, we demonstrated the steps to quickly spin up a dynamic distributed deep learning training environment. This enables deep learning engineers and researchers to easily share, schedule and fully leverage the expensive GPUs and the data.

A shared storage like FlashBlade plays an important role in this setup. FlashBlade makes it possible to share the resource and data. It relieves me from saving/aggregating checkpoints, TensorBoard logs and the models. Horovod with Kubernetes and FlashBlade just makes my deep learning life much easier.

No more time table in Excel!