Making Sense of Big Data

Motivated by the desire to accelerate the speed of learning, a common practice in the world of deep learning today is to distribute training activity across multiple workers (e.g. GPUs). In data Distributed Training each of the workers performs a training step on a different subset (local batch) of the training data in parallel, broadcasts its resultant gradients to all of the other workers, and updates its model weights based on the gradients calculated by all of the workers.
In a previous post we expanded on some of the intricacies and potential challenges of data distributed training. In this post we will focus on the challenge of performing Hyperparameter Tuning in a data distributed training setting.
I would like to thank Dennis Kreinovich whose work on distributed training and mixed precision contributed to this post.
Introduction
An essential part of any deep learning project is Hyperparameter Tuning. The model hyperparameters are all of the control settings that we fix before running our training algorithm. They are distinct from the model parameters which are learned during training. Examples of common hyperparameters include:
- Optimizer settings: The choice of optimizer and the settings that control it, such as the learning rate.
- Model architecture: The choice of model layers, how they are combined together, and how they are configured. This includes settings such as the number of channels per layer, the sizes of convolutional kernels, regularization settings, dropout rates, batch normalization settings, the choice of activation functions, and more.
- Loss function: The choice of the loss function and its respective configuration settings.
Needless to say that the choice of hyperparameters can have a decisive impact on the success of our training algorithm. Hyperparameter Tuning (HPT) refers to the art of finding optimal values for these hyperparameters, that is, finding the values that will ultimately lead to an optimal training result. It should be noted that a complete search over all feasible combinations of hyperparameters would be virtually impossible. In practice, most teams will fix a large number of the hyperparameters and limit the HPT to a subset of the training controls. They may further simplify the problem by limiting the search space of the hyperparameters that remain.
Every successful deep learning project includes some form of HPT, even if it is not explicitly called that in name. The moment you try to train your model with an additional convolutional layer or channel, or with a different learning rate, regularization value, dropout rate or activation layer, just to see how it impacts the quality of your results, you are performing HPT. Of course the more you invest in your HPT strategy, the more likely you are to succeed at it. Here are some aspects that impact the strength of an HPT strategy.
1. Advanced Parameter Search Algorithms Given a parameter search space there are a number of ways one could go about performing the search. Trivial algorithms include random search and grid search. In random search we choose parameter combinations at random. In grid search we break the search space up into a grid and test the combinations on the intersection points of the grid. However, there are many more advanced algorithms based on Bayesian optimization that have been shown to perform much better than random search and grid search (e.g. see here). As of the time of this writing, HPT search algorithms for deep learning remains an active area of research.
2. Experiment Parallelization
Obviously, running experiments in parallel will speed up the HPT stage and enable you to reach conclusions regarding the optimal hyperparameters faster. However, the potential benefit can be much greater than the linear (times N) speed up due to running N experiments in parallel. This is due to the fact that parallelization increases the potential for early termination of experiments. Let us demonstrate this through a simple example. Suppose that we have defined our HPT algorithm as follows:
- Each experiment trains for at most 8 epochs (or fixed number of steps) and evaluates the accuracy after each epoch.
- After each epoch we compare the accuracy of each experiment to the accuracy results of all other experiments that have reached the same step. Any experiment with accuracy that is more than 10% worse than the highest calculated accuracy is terminated. When an experiment is stopped early its associated resources can be used for subsequent experiments.
- An experiment is declared the winner if it is the only remaining experiment or if it has completed the 8 epochs with the highest accuracy.
Now suppose that at a given stage in our HPT we are running 4 experiments that will result in the accuracies shown in the table below:

Were we to run all four trials in parallel we would terminate trial 3 after 1 epoch, trial 2 after 2 epochs, and trials 1 and 4 after 4 epochs, at which point trial 4 would be declared the winner. The given stage would run for a total of 11 epochs. On the other hand, were we to choose a random order on the experiments and run them sequentially, by applying simple laws of probability theory, we would find that the Expectation Value of the total number of epochs would be 19.6. In this case the parallelization led to expected savings of 44%. While this example is contrived, the possibility of savings of this magnitude is not far fetched. And the potential for savings only increases with the degree of parallelization.
3. Automation
Conceivably, one could define the parameter search space, choose an HPT algorithm, manually instantiate all experiments, and monitor their progress. However, manual implementation of HPT algorithms can be slow, tedious, and extremely error prone. It is much more advised to automate HPT. There are a number of tools and libraries that support HPT automation. These differ based on the support they provide for different training environments, search algorithms, parallelization, early stopping, and more. One of the most popular libraries (at the time of this writing) is Ray Tune which supports a wide variety of training frameworks, cutting edge search algorithms, and experiment parallelization. I hope to expand on HPT frameworks in a future post.
The focus of this blog post is on experiment parallelization or, more specifically, on the challenge of running parallel experiments in a distributed training setting. We will introduce this challenge in the following section.
The Challenge – HPT in Distributed Training Settings
As important as a strong HPT strategy is for general training, it is all the more important for distributed training. This is due to the following two factors:
- Complexity of Tuning: The net effect of data distributed training is a significant increase in the training batch size. Denoting the number of workers by k and the local batch size by b, the result of performing distributed training on k workers is that at each training step the model trains on a global batch of size k*b samples. As discussed in our previous post HPT can become more difficult as the batch size increases. More advanced optimizers (such as LAMB and LARS) may be required, as well as more delicate tuning of the optimizer settings. Moreover, the optimal hyperparameters may be highly dependent on the size of the global batch. The results of tuning for a global batch of size X may be very different than the results of tuning for a global batch of size Y.
- Cost of Tuning: The cost of training in a highly distributed setting can be quite high. Inefficiencies in your HPT strategy (such as using a primitive search algorithm) that might be tolerated in a single GPU setting, are simply inexcusable in a high cost distributed setting. The cost we refer to here and in the rest of the post can be the price of using cloud based training services or the opportunity cost of overusing your companies in-house training resources.
As we saw above, one of the techniques of a strong HPT strategy is to parallelize the tuning experiments. Parallelization can both speed up the overall process and facilitate cost savings by enabling early termination of experiments. Such cost savings can be significant in a multi-worker setting. However, when trying to parallelize highly distributed training experiments you might find yourself up against a major challenge that we will demonstrate through the following simple example:
Suppose that, as in this paper, you have chosen to train on 256 GPUs and that a given stage of your HPT algorithm calls for running a meager 8 experiments in parallel. By simple calculation we find that you will need 2048 GPUs to run your HPT in parallel!! You may not have access to this number of GPUs, and even if you do, the cost per hour of running your HPT might be prohibitively large. For example, 2048 GPUs in the cloud could easily cost several thousands of dollars per hour.
Round Robin Experimentation
One option you could consider in place of experiment parallelization, is to perform each experiment one epoch at a time in round robin fashion. For example, in the case above we would run the first experiment on 256 GPUs for one epoch, then the second, then the third, up until the eighth experiment. At that point we would compare the interim results, decide which experiments to terminate, and then proceed to run the remaining experiments for a second epoch, then a third epoch, and so on. While this method would take 8 times as long as full parallelization, it enables the same type of early termination of experiments and results in running the same overall number of epochs. The problem with this approach is that the context switches between experiments, which includes capturing and saving the model state and rebuilding the computation graph, are likely to introduce significant overhead, especially in a distributed training setting. Thus the overall cost savings resulting from the ability to terminate experiments early will be lower than in the case of parallelization.
The solution we put forth in this post is to run parallel experiments on a smaller number of workers by using Large Batch Simulation.
Large Batch Simulation
In Large Batch Simulation (LBS) we simulate training on X workers using only Y (<X) workers, where Y is a divisor of X, by running with the same global batch size on the Y workers as we would on X workers. For example, suppose that in the example we saw in the previous section the local (per worker) batch size was 32 and that the global batch size was 8192 (32×256). Using LBS we could run, in parallel, each of the 8 HPT experiments on 32 GPUs (instead of 256) in such a way that each run would use a local batch of size 256 (instead of 32). The resultant size of that global batch would be 8192, the same as when we ran on 256 GPUs. In this way, when running 8 experiments in parallel (each on 32 GPUs), we would be using the same total number of 256 GPUs during HPT as during training. Here too the 8 experiments would take a lot longer than full parallelization (over 2048 GPUs). However, the savings in cost from early experiment termination would equal the savings in full parallelization. (In fact, it might even be a bit better due to the reduction in the number of gradient sharing messages.)
Were it possible to simply increase the local batch size on a training worker at will, implementing BTS would be trivial. However, in reality the size of the batch is limited by the available memory on the training worker. To overcome this we will use a technique called Gradient Aggregation.
Gradient Aggregation
In a typical training step we perform a forward pass to calculate the loss on the local batch, perform the backward pass to calculate the gradients, broadcast the results to the other workers, and then update the model weights taking into account the gradients calculated by all of the workers. When we perform gradient aggregation, rather than sharing and applying the gradients immediately, we collect them over a predefined number of steps and only then broadcast their average and update the weights. Denoting the local batch size by B, the effect of performing gradient accumulation over N steps would be the same as training with a local batch of size _N_xB. And running gradient aggregation over N steps with a local batch of size B and on Y workers, would have the same effect as training with a local batch of size B on X=_N_xY workers.
This technique for simulating a large batch size relies on the linearity of the gradient calculation, that is, on the equivalence between the gradient of a batch of size K=_N_xB and the average of the gradients of N batches of size B.

In summary, the solution we propose is to use Y workers to simulate a training session with _N_xY workers, by performing gradient aggregation over N steps on each worker.
Large Batch Simulation Using Horovod
Horovod is a popular library for performing distributed training with wide support for TensorFlow, Keras, PyTorch, and Apache MXNet. The way Horovod works is by introducing gradient sharing into the gradient calculation step. While there are various ways to instantiate Horovod, one of the common ways is to wrap your training optimizer with a Horovod optimizer using the DistributedOptimizer API, as in the TensorFlow code snippet below. For more details see the Horovod documentation.
import horovod.tensorflow.keras as hvd
# Initialize Horovod
hvd.init()
# Pin GPU, build model, and configure optimizer
opt = ...
# Wrap optimizer with Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
Starting from version the 0.21, DistributedOptimizer API two new flags for programming gradient aggregation: _backward_passes_perstep and _average_aggregatedgradients. When _backward_passes_perstep is set to a value greater than 1, Horovod will employ a gradient aggregator behind the scenes which will accumulate gradients over the chosen number of steps before sharing them and using them to apply updates to the model weights. The _average_aggregatedgradients determines whether to average the accumulated gradients before sharing them. In the code block below we show how to modify the Horovod setup code so as to increase the effective global batch size by a factor of 8:
import horovod.tensorflow.keras as hvd
# Initialize Horovod
hvd.init()
# Pin GPU, build model, and configure optimizer
opt = ...
# Wrap optimizer with Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(
opt,
backward_passes_per_step=8,
average_aggregated_gradients=True)
Use with Caution
While modern day deep learning frameworks offer high level APIs for virtually everything you may want to do, it is always a good idea to develop a deep level understanding of what is going on under the hood, e.g. how the model layers work, what operations your chosen optimizer is running, and what is actually happening during a training step. This is especially true when you are trying to do something out of the ordinary such as LBS. Otherwise, you might find your program to be failing and have no idea why. Here are some examples of what to look out for:
Periodic updates to hyperparameters: When simulating a training environment with X=_N_xY using Y workers, it is important to take note of the fact that each step in the X-worker environment is equivalent to N steps in the Y-worker environment. Some training algorithms call for making periodic updates to the optimizer hyperparameters. For example, a common practice is to use a Learning Rate Scheduler to adjust the optimizer learning rate at fixed iteration steps. For the simulation to work properly we would need to modify the scheduling of hyperparameter changes by multiplying the predetermined steps at which to update by a factor of N.
Non-linear operations on the gradients: The LBS technique we have described relies on the linearity of the gradient calculation, that is, on the equivalence between the gradient of a batch of size _N_xB and the average of the gradients of N batches of size B. Some training algorithms may call for performing a non linear operation on the gradients. One example of a non-linear operation is gradient clipping. So long as such non-linearities are performed only on the fully accumulated gradients (i.e. only on steps when the gradients are applied) you should have no issue. Note that the default behavior in TensorFlow 2.5 is to perform the clipping on the aggregated gradients (see here).
Updates to model parameters during the forward pass: The gradient aggregation technique we have described delays application of the gradient updates to the model weights by _backward_passes_perstep steps. However, other than gradient updates, there may be additional updates to the model parameters that we need to consider. One example of a model parameter update that occurs during the forward pass, is the recalculation of the moving statistics in Batch Normalization layers. Assuming that these statistics are calculated on the primary (rank 0) worker (e.g. that we do not synchronize the batch normalization statistics across workers) we need to take into account that in the simulation environment the amount of data that the rank 0 worker will see will be _backward_passes_perstep times the amount of data that will be seen by the rank 0 worker in the original environment. Although batch normalization statistics tend to get sorted out rather quickly, this discrepancy might lead to differences in the evaluation metrics.
In the Appendix to this post we describe how to combine LBS with Mixed Precision training, a use case that further demonstrates the need for caution.
In the next section we will demonstrate the use of LBS in action on a number of simple experiments.
Experiments
The following experiments were run on Amazon EC2 G4 instances. For the single GPU experiments we used a g4dn.2xlarge instance and for the four GPU experiments we used a g4dn.12xlarge instance. The experiments were run on TensorFlow version 2.4.1 and Horovod version 0.21.0. For all experiments we ran the official Horovod tf.keras mnist example with the following changes:
- We applied the required changes described in the appendix to support LBS with mixed precision.
- For the mixed precision tests we made the following highlighted adjustments to the model (see the mixed precision documentation to understand the changes):
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)
mnist_model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation('softmax',
dtype='float32',
name='predictions')
])
- To run LBS we set the _backward_passes_perstep setting to 4 and updated the learning rate accordingly.
In the chart below we plotted the accuracy as a function of the number of epochs for four experiments: distributed training on 4 GPUs with and without the mixed precision setting, and the corresponding large batch simulation runs on a single GPU.

In all cases we see that the accuracy follows a similar convergence pattern. Although this was demonstrated on a simple model, we can see the potential of using both mixed precision training and large batch simulation.
Concluding Remarks
In this post we have focused on using LBS to solve the challenge of performing HPT in a highly distributed training environment. There are many additional scenarios in which LBS may be useful. Here are some examples.
Batch Size Tuning
Throughout the post we have assumed that both the number of training workers and the global batch size are fixed, and addressed the challenge of tuning other hyperparameters accordingly. Of course, in reality both can be viewed as hyperparameters that require tuning as well. The LBS technique we have discussed offers a creative way of tuning the batch size and assessing how the choice of number of workers can impact the quality of training. This can be done by running parallel experiments on multiple machines each with a different value for the _backward_passes_perstep setting.
Increasing Batch Size on Single Instance
There may be scenarios in which running with a large batch size may not be a forced side effect of data distributed training but rather a desired tool for improving training performance. In such cases LBS can be used to work around the batch size limitations imposed by memory constraints and run training on batch sizes that are as large as we want.
Dynamic GPU Orchestration
One of the challenges faced by modern day AI companies is how to most efficiently divide training resources among multiple teams and projects. Training resources are expensive and there is strong financial motivation to maximize their utilization. One common strategy is to allocate GPUs among teams dynamically based on teams’ needs and project priorities. The effect of such a strategy is that at different stages of training you may have access to a different number of workers.
A similar situation can arise if you rely on low cost "spot" instances for training. Many cloud service providers offer significant discounts for surplus compute instances. In AWS these are called Amazon EC2 Spot Instances, in Google Cloud they are called Preemptible VM Instances, and in Microsoft Azure they are called Low-Priority VMs. The tradeoff is that such instances can be terminated mid-use if demand for them increases. Once again you could find the number of workers allocated changing over the course of training.
Given the likelihood of such changes in resource allocation, it would be wise to design your training to account for this possibility. Specifically, at any given time you would want to take advantage of as many of the available workers as possible, but also be impervious to reallocation of a subset of them. One strategy could be to naively train on whatever number of workers are available. However, the global batch size will change based on the number of workers and as we have already seen, successful training in each scenario is not always a simple matter of adjusting the learning rate. An alternative strategy that is based on LBS, would be to maintain the same effective batch size across scenarios. Here is an example of how this would work:
Suppose you design your model to train on a maximum of 256 workers and with a local batch of size B. However, when you start your training you find that there are only 128 workers available. Using LBS you can begin your training on 128 workers with an effective local batch of size _2_xB. Now suppose that an hour into training the number of workers allocated to you drops below 128. Once again, using LBS you can reset your training on 64 workers with an effective local batch of size _4_xB training. Later on, resources may begin to free up and might be able to reset your training to run on 256 workers.
In theory we might even be able to implement this strategy without ever needing to stop and restart the training by using Elastic Horovod, a compelling feature of Horovod that allows for uninterrupted training even in the event of changes in the number of allocated resources. However, at the time of this writing the Horovod APIs do not support dynamic updates to the _backward_passes_perstep setting. For more information on the benefits of Elastic Horovod check out this recent post.
Summary
Data distributed training comes with a number of unique challenges. Fortunately, these challenges are also opportunities for innovation and creativity. In this post we have taken on the challenge of Hyperparameter Tuning in highly distributed training settings. We have shown how we can use Horovod’s gradient aggregation support to enable parallel experimentation and the potential cost efficiencies that come with it. Highly distributed training can be expensive and any opportunity to reduce cost should be fully explored.
Appendix: Combining Large Batch Simulation with Mixed Precision Learning
In Mixed Precision training we mix both 16-bit and 32-bit floating point types during training as opposed to using only 32-bit floating point types. This feature enables significant savings in both runtime and memory utilization.
In this post we will focus on the mixed precision support offered by TensorFlow, although most of what we will say should apply to other frameworks as well. The TensorFlow guide on mixed precision provides a good overview of how mixed precision works and how to configure it. We will refer to some of the elements mixed precision training that pertain to our discussion.
- The main concern when programming with 16-bit floats is numerical underflow or overflow due to the lower bit precision. In practice, when performing mixed precision training, the concern is mostly with the possibility of underflow during the gradient calculation.
- TensorFlow supports two types of 16 bit precision floating point types, float16 and bfloat16. The bfloat16 type has just as many exponent bits as float32 and is able to represent the same range of numbers. As a result, there is no concern of underflow with the bfloat16 type. While there are a number of processors that support bfloat16, as of the time of this writing, the official version of TensorFlow only supports bfloat16 on Google TPUs. Check out this post for an interesting overview on how floating point representations work.
-
The way to overcome the possibility of underflow when using float16 is to perform loss scaling. When we employ loss scaling we multiply the loss by a scaling factor before calculating the gradients, and then divide the resultant gradients by the same factor before applying them to the model weights. By choosing an appropriate scaling factor we can reduce the likelihood of underflow. In TensorFlow this process is implemented in the LossScaleOptimizer. The steps performed are:
- Scale the loss
- Calculate the gradients
- Unscale the gradients
- If all of the gradients are finite apply them to the weights
- If required, update the scaling factor based on the statistics of the current gradients
Being that both distributed training and mixed precision training share the common goal of accelerating training, it is only natural that we would be interested in combining the two together. In particular, we would need our LBS based solution for performing HPT to work with mixed precision. Sadly, the current default implementation for LBS in Horovod does not work with the default implementation of the LossScaleOptimizer in TensorFlow. However, fortunately, with small changes to each, these features can be programmed to work in harmony. Here are the items that need to be addressed when combining LBS with mixed precision:
- Dealing with Nan values in the gradients: One of the side affects of loss scaling is that we may occasionally end up with Nan or Inf values in the gradients due to numerical overflow. If the LossScaleOptimizer encounters such values it simply drops the gradients and proceeds to the next step. However, the current gradient aggregation code in Horovod does not account for the possibility of Nan values. The details of this issue as well as a suggested modification can be found here, but it will require overriding the default Horovod behavior. Hopefully this will be addressed in a future version of Horovod.
-
When to update the scaling coefficient: As we described above, the [[LossScaleOptimizer](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/LossScaleOptimizer)](https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/LossScaleOptimizer) applies periodic updates to the scaling coefficient based on the gradient statistics. In the previous section we recommended caution when it came to applying periodic updates of hyperparameters during LBS. In this case, if we fail to align the updates to the scaling coefficient with the steps during which we apply (and reset) the aggregated gradients, not only will we fail in our large batch simulation but we will also likely fail to train altogether. If the gradients are accumulated in their scaled state (e.g. TensorFlow 2.3), and we reset the scaling during accumulation, the aggregated gradients will have non-matching scale coefficients. As a result, the resultant, unscaled, gradients will be incorrect. Even if the gradients are accumulated in their unscaled state (e.g. TensorFlow 2.4 and 2.5) it is likely that the LossScaleOptimizer will fail to find an appropriate scale factor. In order to fix this you will need to override the loss scale update function that is called by the LossScaleOptimizer such that loss scale updates are only applied on steps in which the gradient aggregator is reset. See here for one way to fix this.
- Non-linearity of gradient quantization: One thing to keep in mind is that the quantization inherent in the use of low precision constitutes a non-linear operation on the gradients. And in fact, you are likely to find small numerical differences between between the gradient of a batch of size _N_xB and the average of the gradients of N batches of size B. Of course quantization exists even when using 32-bit floats, but the numerical impact is far less pronounced. Nevertheless, we have not found the numerical differences due to the non-linearity to impact the training, but it is good to be aware of.
Note that a few additional changes might be required in order to combine the two features depending on the TensorFlow version you are using. See here for more details.