Making Sense of Big Data

Scalable Machine Learning with Tensorflow 2.X

Large scale ML training on single or multiple machines with no code change using Tensorflow 2.X features

Marcos Novaes

Published in

Towards Data Science

8 min readJan 20, 2021

One of the amazing facts about Machine Learning is that even the simplest models can yield good results if given enough computing power. This aspect has quickly popularized the adoption of distributed computing for Machine Learning workloads, and many distributed computing frameworks have emerged to address such demand. These nascent ML frameworks had to work around a fundamental difficulty in distributed computing: the leap from a shared memory model (SMM) to a distributed memory model (DMM). This leap is a big deal, and has caused the failure of many distributed frameworks. This paper will focus on the approach to distributed computing taken by the Tensorflow platform, and in particular in the enhancements available with version 2.X. This new approach has succeeded, to a large extent, in bridging the gap between the shared and distributed memory models. Further, this new model has introduced a unified approach to heterogeneous hardware accelerators, such as Graphic Processing Units (GPUs) and Tensor Processing Units (TPUs). This article will discuss these recent advances, using as an example the development of a distributed Generative Adversarial Network (GAN).

Introduction

The focus of this paper is to propose a seamless transition from a Shared Memory Model to a Distributed Memory Model when developing Machine Learning models. But let’s first discuss why this transition is so important and why it should be made in the first place. Most of us are only familiar with the Shared Memory Model. This is the model where you only have to deal with one machine, and you can pass variables freely among computing threads because all the cores in this machine have access to the same memory, thus the name “shared memory”. Now, consider the Distributed Memory Model (DMM). In the distributed paradigm you have to be aware that the computing threads live in different machines, you often need to know their network addresses, and you also have to know how to move data among them. You have to consider data serialization, machine native numerical formats, and so forth. So, DDM is clearly more complex, and one can pose the question: why go there if it is so much harder?

The simple answer to this question is that, although the Shared Memory Model is much easier for developers, it comes with limitations. You have the advantage of the single machine abstraction, but you are also limited to the number of cores and amount of memory of a single machine. What if your model grows larger than the RAM memory currently available in single machines? What if your model is computationally bound and requires hundreds or thousands of cores? This is when you enter the realm of scalable machine learning, this is when you need to endure the complexity of a distributed memory model in order to reap the benefits of unlimited scalability.

A Tale of Two Accelerators

In the area of Machine Learning, the difference between distributed and shared memory models becomes quite evident when developing for hardware accelerators, such as NVIDIA’s Graphics Processing Units (GPUs) and Google Tensor Processing Units (TPUs). The recent introduction of high end GPUs allow for a good amount of memory and compute power to be available within the abstraction of a single machine and benefitting from the shared memory model. But if your model requires accelerator memory above 50G and hundreds of accelerator cores, then you will have to deal with the distributed memory model. The Google Cloud TPUs on the other hand were designed for the distributed memory model. For example, the TPU accelerator has its own IP address and communication protocol (gRPC). Fortunately, the developer does not need to deal with the communication layer, as it is taken care by the Accelerated Linear Algebra (XLA) layer and exposed to developers by the Tensorflow API. There is also XLA support for PyTorch, but this paper will focus uniquely on the Tensorflow interface. But the distributed computing nature does surprise the developer that is used to the shared memory model. For example, in the TPU architecture you typically don’t pass data back and forth using variables in shared memory, instead you have to serialize the training data in batches of records in a format called TFRecord. This difference deters many developers that are not familiar with data serialization. However, adopting the TFRecord data serialization brings a lot of benefits, as it is based on the simple but effective “protobuf” architecture that has been used at Google at scale for many years. The data serialization step would be needed for large scale deployment in any case, using GPU or TPU clusters, so it is actually a “necessary evil” when adopting a distributed computing architecture. The following illustration shows the basic differences of the shared and distributed architectures when using GPU or TPU accelerators.

Other than the data serialization step, there are other unique aspects to watch out when adopting distributed computing for machine learning. The Tensorflow API has succeeded to some extent to hide the complexity of the interprocess communication that happens during model training. However, in Tensorflow version 1.X the access to the TPU accelerator was only possible through the TPUEstimator interface, and this severely limited the types of machine learning applications that could be developed for TPUs. The TPU interface in version TF 1.X basically expects to be passed exactly one model (network) that is then compiled in TPU format and shipped to the TPU for training. This is shown in the figure below:

This architecture is great because it greatly reduces communication with the CPU and thus improves performance. But in version 1.X this feature is limited to one model, so it was very hard to develop multi-model ML applications, such as reinforcement learning, Generative Adversarial Networks (GANs) and many others. For example, the implementation of a GAN would require an ingenious approach to pack the generator and discriminator models into a Convolution Network as it is done in TF-GAN, but such a framework makes it hard for developers to derive their own customizations. In TF-GAN the generator and discriminator networks are wrapped into a CNN which can then be trained as a single model with TPUEstimator. The TF-GAN hides the implementation details behind an interface called the GANEstimator. This way TF-GAN manages to run on TPUs in version TF 1.X. However, developers frequently need to alter the network logic to suit their applications. For example, recent advances in GAN modeling utilize a custom “encoder” component, which is actually a third model that gets trained along the discriminator and generator models. There is also a lot of research in optimizing the “Latent Space” that is used to generate the image space. Such customizations are not possible in TF-GAN. Fortunately, Tensorflow 2.X introduces the custom training loop feature which makes possible to train any number of models synchronously using the GradientTape construct..

The Tensorflow 2.X version also introduces the concept of compiled functions that are defined using the @tf.function decorator. When the XLA compilation layer encounters such functions it will utilize the amazing “Autograph” capability to automatically derive a computational graph that can then be compiled and shipped to the TPU for execution. This feature is shown below:

The introduction of custom functions has drastically enhanced the flexibility of TPUs, making them akin to GPUs when it comes to flexibility and ease of development. There are several other new features in Tensorflow 2.X that actually make it possible to develop ML models that can run transparently on GPUs or TPUs without any significant modification. The most important of these features is the tf.strategy() concept. The tf.strategy object is used to define an actual distributed computing paradigm. The tf.strategy makes it possible to run custom functions in parallel, making it fairly easy for the developer to implement distributed model training techniques such as model parallelism and data parallelism. In Tensorflow 2.X there are also extensions to the tf.data object that make it easy to specify how data is distributed among the several cores. With these introductions, the developer can now specify how the computation and the data are distributed among the several cores. And perhaps the most important aspect is that the tf.strategy and tf.data objects work the same for GPUs and TPUs, making the development of models using a “single code path”, which is extremely important as discussed in the next section.

The Eager Beaver

Another great enhancement in Tensorflow 2.X is that it implicitly assumes the “eager mode” of execution. That means that now you can step through your model in an interactive environment, such as a Jupyter Notebook, and your code will actually execute step by step as expected. The eager mode is the way you expect things to behave in a shared memory model, so having eager mode be the default greatly contributes to ease of use. Developers typically rely on eager mode in order to observe their code and variables during execution, such as when debugging the model. However, when the compiler encounters a custom function marked as @tf.function it will revert to a delayed execution model. The reason is that a @tf.function needs to be compiled for remote execution. The delayed execution model is harder to debug, since you have to rely on tracing rather than interactive execution. This is where the “single code path” strategy can help. By developing a model leveraging the common GPU/TPU code path made possible in Tensorflow 2.X the developer can now run initially in eager mode using a GPU in the shared memory model, and then just by switching the tf.strategy the same code will run in distributed mode using TPUs. This practice results in an efficient methodology for model development, which leverages both the development ease of the shared memory model and the scalability of the distributed memory model. The method is as follows:

Develop your model using the common GPU/TPU code path using the tf.strategy object
Test/Debug your model in eager mode using the GPU path
Deploy your model to TPUs in delayed execution mode simply by switching the tf.strategy object

Using this method also allows developers to optimize the use of accelerator resources. There is no need to utilize a high power GPU during the development phase, a low end GPU such as a NVIDIA k80 should suffice for basic model development and test. Once tested, the model can be deployed to TPUs by changing one line of code (the tf.strategy object). This method is illustrated below:

Conclusion

There are important trade offs to consider when making the choice between a Shared or Distributed Memory Model for ML models. While it is easier to develop for a Shared Memory Model, the Distributed Memory model has better scalability both in terms of performance as well as memory capacity. While it has been traditionally harder to program in a Distributed Memory Model, the recent advances in Tensorflow 2.X make it much easier to train ML models in a Distributed Memory Model, and the extra effort required to program for the Distributed Model is richly rewarded by the gains in performance.