How to get 4x speedup and better generalization using the right batch size

Published in

Towards Data Science

13 min readNov 4, 2019

Once upon a Tweet, I came across this conversation from Jeremy Howard quoting Yann LeCun about batch sizes :

As this subject was something that has always been in some part of my mind since I came across the very nice learning rate finder from Fastai, I always wondered if there could be a useful batch size finder, that people could use to quickly start training their model with a good batch size.

As a reminder, the learning rate finder used in Fastai helps to find the right learning rate by testing different learning rates to find which one gives the sharpest decrease in loss. A more detailed explanation can be found here: https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html

This idea of having a batch size finder was in my mind for a long time, and after getting the motivation from Jeremy, I have decided to go on this journey of implementing a batch size finder for training Neural Networks.

I want to share today both the journey and the destination, of implementing a paper, as they are both interesting in my opinion, and maybe will motivate you to try more stuff as well!

I. A story of size

One common perception is that you should not use large batch sizes, because this will only cause the model to overfit, and you might run out of memory. While the latter is obviously true, the former is more complicated than that, and to answer it, we will have a little dive into an OpenAI paper “ An Empirical Model of Large-Batch Training”.

This paper, which I recommend reading, explains many simple ideas that are good to remember.

First, our goal is to minimize a loss through a Stochastic Gradient Descent approach, and there is a true underlying landscape upon which we will minimize it. Nonetheless, we do not have access to the true gradient over the entire dataset (or more precisely over the distribution it is drawn from), therefore we must approximate the gradient with a finite batch size.

Because we average over a batch, if our batch size is small, there will be a lot of noise present, and we might train our model only on noise. Nonetheless, applying several successive updates will push in the right direction, but we might as well have directly used a larger batch size, which is more computationally efficient and directly averages out the noise. Nonetheless, after a certain size, if your gradient is already accurate, there is no point in making the batch size even bigger, because it will just be a computational waste as there will be little gain in accuracy.

Moreover, by using bigger batch sizes (up to a reasonable amount that is allowed by the GPU), we speed up training, as it is equivalent to taking a few big steps, instead of taking many little steps. Therefore with bigger batch sizes, for the same amount of epochs, we can sometimes have a 2x gain in computational time!

Second, there is a statistic called “Simple Noise Scale”, which helps us determine what would be a good batch size, and it is defined as :

with G being the real gradient of our loss L, over the n parameters.

Without going too much into the details of the paper as it is thoroughly explained, the idea is if we use a batch size smaller than the Simple Noise Scale, we could speed up training, by increasing the batch size, and on the opposite, if we use a too large batch size bigger than the Simple Noise Scale, we will just waste computational power.

To understand more what this statistic means, let us study each term :

The numerator is the sum of variances of each variable of our gradient. This is a measure of the noise present in our gradient.
The denominator is the square norm of the gradient, which we call the scale, and gives a measure of the closeness to a local minima where the gradient would be close to zero.

Therefore, the noisier our gradient is, the bigger batch size we want, which is natural, as we want to take gradient steps in the right direction. On the opposite, if the gradient is not noisy, we will benefit more from taking smaller steps, as we do not need to average out a lot of observations and use them separately.

On the other hand, the closer we are to the minimum, the bigger the batch size, as we are expected to take more careful steps the closer we are to a local minimum as we not want to overshoot it and miss the right direction.

Finally, the Simple Noise Scale gives us a tool to answer the question “Bigger batch sizes will make us overfit, while smaller batch sizes help regularize ” :

Not necessarily! If your task is already complex, and the approximate gradient will be noisy, it might be in your interest to have a bigger batch size to make sure your model is not training on too much noise. It’s not as if a bigger batch size will make you overfit, it’s more that a smaller batch size will add more regularization through the noise injecting, but do you want to add regularization if you can not even fit properly?

II. Implementing the paper

So now that we have seen a little bit why choosing the right batch size matters, and how we could find a good batch size through the Simple Noise Scale statistic, it’s now time to implement it! Yummy!

Just to remember, the Simple Noise Scale equation is :

The problem with this is that not only do we need to know the real gradient but also we need to know the variance of this gradient, which makes it more difficult. To tackle this issue, the authors have proposed two different statistics to approximate both the numerator and the denominator of the Simple Noise Scale.

Estimator for the scale

Here, we use two different batch sizes, B big and B small, to compute two different estimators of the real gradient using the formula :

Approximate gradient for a given batch size

Once we have those two approximates, we can finally compute the Simple Noise Scale with the formula :

Approximation of the Simple Noise Scale

To make sure this estimator has low variance, the authors computed several Simple Noise Scale estimators throughout the training, and have averaged it.

As explained in the paper, one natural way to do this is make use of the several GPUs to compute the local gradient of each GPU, which will be the small gradient, and then compare it to the average gradient across the different GPUs, which will be the big gradient. Nonetheless, this method assumes we have a multi GPU, which is not the case for most of us.

Therefore, an efficient way to implement this for single GPU has to be found, and it was not described in the original paper. So that’s where I have started and I will now share you my reasoning on how to solve this issue!

The code used in the rest of the article can be found here: https://colab.research.google.com/drive/15lTG_r03yqSwShZ0JO4XaoWixLMXMEmv

In the first lines of codes, I set up a Fastai environment to run a model on MNIST, as this dataset was already tested in the paper, and they got an average Simple Noise Scale of 900.

I will not explain the code too much into detail as it will take me a whole article to explain how Fastai puts everything together with their API, but the code should be a good start. If you want further help to understand the code, tell me in the comments and I can explain it, or even write an article on the coding part.

A. First approach using an exponential moving average

Given that I found the proposed statistics in the paper not really helpful because I did not have a multi-GPU setup, I thought that I could skip it, and simply compute directly the sum of variances, by doing approximations :

First I have approximated the real gradient with the estimated gradient for a given batch.

Then, as the computation of the Covariance matrix can be seen as two averages, I have tried to approximate it with an exponential moving average, as I did not want to store many gradients across training.

Running average of noise, scale, and Simple Noise Scale over batches computed

As you can see here, the results are weird, the Simple Noise Scale is way too bumpy, the noise is much greater than the noise, which gives a very negative Simple Noise Scale which does not make sense.

B. Storing gradients

We saw that using the exponential moving average is not a good idea to approximate the covariance matrix.

Another way to tackle this issue is simply to set in advance a number N of gradients to keep, then we will simply compute N different gradients, and use those N gradients to approximate the covariance matrix.

It starts to show results, but the way it is computed is tricky: the x-axis is the number of batches I stored to compute the Simple Noise Scale in this fashion. Though it seems to provide some kind of result, it is not usable in practice as I have stored hundreds of gradients!

C. Run two trainings

After failing another time, I have decided that I will follow the idea of the paper, and compute their two statistics. Nonetheless I need to have a way to get two batches of different sizes during training while I had only one GPU.

Then I thought, why do a single training epoch, when I could run do two training epochs with two different batch sizes and compute it afterwards?

So I went with this idea, and used B big=2 * B small, which will allow me to compute their respective gradient, and use them to compute G and S in an exponential moving average manner as in described in the paper.

Ouch! As in the same approach as the first one, it yielded weird results! Moreover, when I think about it, the batches I get might not be the same between the two runs, as nothing forces the small batch to be included in the big batch. In addition, I need to run two training epochs to compute this so it was not good.

D. Sequential batches

Finally, I realized that the best approach seemed to be the second one, but something had to be modified because I did not want to keep tons of gradients to compute the statistic.

Then a very simple, but effective thought came to mind: what if instead of averaging several batches in a parallel fashion as they did in the paper, I instead average consecutive batches in a sequential way?

This simply means that I will just need to set a parameter, that I call n_batch, which is the number of batch I have to store before computing the big and small gradient, and then I will be able to compute the statistics of the paper in a sequential way!

After implementing it this way, I have the following result :

Now ain’t she a beauty! In the paper, they described that the growing trend is to be expected, as the noise will more likely remain the same, while the scale of the gradient will decrease as we get closer to a minimum, which will lead to a growing Simple Noise Scale.

Because we most likely did not have the same setup, nor did I have access to their code, our results slightly diverge, but in the paper, the authors mentioned a Simple Noise Scale starting at around 50, and reaching 900, which is what matters. Given the many approximations that come to play both theoretically, and in practice, results can vary, but as explained in the paper, there should not be variations of more than an order of magnitude.

So after this long journey, there seems to be an implementation that is working, even though the paper gave little help to do so, and the best part is, to use it in practice, you only need one line of code!

Here the parameters correspond to :

learn : a Fastai Learner.
lr : a learning rate to do a training loop, can be found using lr_find()
num_it : the number of batches you want to process, can be set to None and it will automatically train during one epoch.
n_batch : the number of batches you want to store before computing the Simple Noise Scale. 20 seems to work well across different tasks.
beta : the beta parameter for an exponential moving average to compute the sum of variances, and the scale of the gradient. If the plot is too irregular, try increasing to 0.999 or more if needed, or increase the n_batch parameter.

III. Testing the batch size finder on different tasks

Now that we have an implementation working, it could be interesting to have a look at how it helps in practice to find a good batch size.

First, we will study the Rossmann dataset. This data set has been explored in Fastai courses v3, which you can find here: https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson6-rossmann.ipynb

Here I will simply run my batch size finder, and do the exact same training as the original, but with a batch size taking account the Simple Noise Scale.

Now how to interpret this? This means, that for this given learning rate, the training seems to converge to a Simple Noise Scale of around 500, i.e. the noise and the scale stabilize later in the training. Therefore, the best tradeoff between computing time and efficiency seems to be having a batch size of 512.

After running the same training with batch sizes 512 and 64, there are a few things we can observe.

First one-cycle training with batch size 512

First one-cycle training with batch size 64

With a batch size 512, the training is nearly 4x faster compared to the batch size 64! Moreover, even though the batch size 512 took fewer steps, in the end it has better training loss and slightly worse validation loss.

Then if we look at the second training cycle losses for each batch size :

Second one-cycle training losses with batch size 512

Second one-cycle training losses with batch size 64

We can see here that training is much bumpier with a batch size 64, compared to batch size 512, which is not overfitting, as the validation loss continues to decrease.

Finally, we can observe the following results for the last training cycle :

Last one-cycle training losses with batch size 512

Last one-cycle training losses with batch size 64

So in the end, if we sum up the results on Rossmann, using a batch size of 512 instead of 64

reduces the training time by 4
provides a better training and validation loss, as well as the metric of interest, here exp_rmse

I have looked into text and image data, but given that they are much heavier, especially with pretrained models with huge bodies, when I tried running training with batch sizes I ran into CUDA out of memory, so I will not show the results here but you can have a look on the Colab Notebook.

Conclusion

We have seen a lot of things throughout this article! I hope you enjoyed the ride, and if there are things you have to remember it would be :

There is no magic batch size number, such as 32, it depends on the complexity of your data, and the GPU constraints you have. We saw that small batch sizes can help regularize through noise injection, but that can be detrimental if the task you want to learn is hard. Moreover, it will take more time to run many small steps. On the opposite, big batch size can really speed up your training, and even have better generalization performances.
A good way to know which batch size would be good, is by using the Simple Noise Scale metric introduced in “ An Empirical Model of Large-Batch Training”. I have provided a first and quick implementation here: https://github.com/DanyWind/fastai_bs_finder. You can try it on your own datasets, especially on recommender systems or tabular models where you are less likely to run into CUDA out of memory.
Don’t hesitate to try things, and a little nudge can sometimes push you to do good things! I saw the paper maybe 6 months ago and I did not really pay attention until I actually tried (and failed many times) to implement it. But now, not only can I share those results with a large community, it has also helped me understand better how batch size works, and how common conception of it might be wrong. So don’t hesitate to implement cool stuff now, and it does not matter even if it doesn’t work directly, the journey is more valuable than the destination!

So I hope you enjoyed reading this article, would be great to have your feedback and I will try to post more articles in the future.

If you have any question, do not hesitate to contact me on Linkedin , you can also find me on Twitter !

How to get 4x speedup and better generalization using the right batch size

Written by Daniel Huynh