SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE — 05

Distributed Deep Learning with Horovod

Scaling Deep Learning on a Supercomputer using Horovod

Jordi TORRES.AI
Towards Data Science
12 min readDec 4, 2020

--

Marenostrum Supercomputer — Barcelona Supercomputing Center (image from BSC)

[This post will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]

In the previous post we explored how we can scale the training on Multiple GPUs in one Server with TensorFlow using tf.distributed.MirroredStrategy(). Now, in this post, we will use Horovod API to scale the training on multiple servers following data parallelism strategies.

1. Horovod

Uber Engineering introduced Michelangelo, an internal ML-as-a-service platform that makes it easy to build and deploy these systems at scale. Horovod, a component of Michelangelo, is an open-source distributed training framework for TensorFlow, PyTorch, and MXNet. Its goal is to make distributed Deep Learning fast and easy to use via ring-allreduce and requires only a few lines of modification to user code. Horovod is available under the Apache 2.0 license.

A data-parallel distributed training paradigm

Conceptually, the data-parallel distributed training paradigm under Horovod is straightforward:

--

--

Professor at UPC Barcelona Tech & Barcelona Supercomputing Center. Research focuses on Supercomputing & Artificial Intelligence https://torres.ai @JordiTorresAI