The Underlying Dangers Behind Large Batch Training Schemes

The hows and whys behind the generalization gap and how to minimize it

Published in

Towards Data Science

10 min readNov 12, 2022

In recent years, Deep Learning has stormed the field of Machine Learning with its versatility, wide range of applications, and parallelization training ability. Deep Learning algorithms are typically optimized with gradient-based methods, referred to as “Optimizers” in Neural Networks. Optimizers use the gradients of the loss function to determine an optimal adjustment to the parameter values of the network. Most modern optimizers deviated from the original Gradient Descent algorithm and adopted it to compute an approximation of the gradient within a batch of samples extracted from the entire dataset.

The nature of Neural Networks and their optimization technique allowed for parallelization or training in batches. Large batch sizes are often adopted when computation powers are allowed to significantly speed up the training of Neural Networks with up to millions of parameters. Intuitively, having a larger batch size increases the “effectiveness” of each gradient update as a relatively significant portion of the dataset was taken into account. On the other hand, having a smaller batch size translates to updating the model parameters based on gradients estimated from a smaller portion of the dataset. Logically, a smaller “chunk” of the dataset will be less representative of the overall relationship between the features and the labels. This can lead to one concluding that large batch sizes are always beneficial to training.

Large vs. Small Batch Sizes. Image by the author.

However, the assumptions above are deduced without considering the model’s ability to generalize to unseen data points and the non-convex optimization nature of modern Neural Networks. Specifically, it has been empirically proven and observed by various research studies that increasing the batch size of a model typically decreases its ability to generalize to unseen datasets, regardless of the type of Neural Network. The term “Generalization Gap” was coined for the phenomenon.

In a convex optimization scheme, having access to a more significant portion of the dataset would directly translate to better results (as depicted by the diagram above). On the contrary, having access to less data or a smaller batch size would reduce training speed, but decent results can still be obtained. In the case of non-convex optimizations, which is the case for most Neural Networks, the loss landscape's exact shape is unknown, and thus matters become more complicated. Specifically, two research studies have attempted to investigate and model the “Generalization Gap” caused by the difference in batch sizes.

The How and the Why of Generalization Gap

In the research paper “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima” from Keskar et al. (2017), the authors made several observations surrounding large-batch training regimes:

Large Batch Training methods tend to overfit compared to the same network trained with smaller batch size.
Large Batch Training methods tend to get trapped or even attracted to potential saddle points in the loss landscape.
Large Batch Training methods tend to zoom in on the closest relative minima that it finds, whereas networks trained with a smaller batch size tend to “explore” the loss landscape before settling on a promising minimum.
Large Batch Training methods tend to converge to completely “different” minima points than networks trained with smaller batch sizes.

Furthermore, the authors tackled the Generalization Gap from the perspective of how Neural Networks navigate the loss landscape during training. Training with a relatively large batch size tends to converge to sharp minimizers, while reducing the batch size usually leads to falling into flat minimizers. A sharp minimizer can be thought of as a narrow and steep ravine, whereas a flat minimizer is analogous to a valley in a vast landscape of low and mild hill terrains. To phrase it in more rigorous terms:

Sharp minimizers are characterized by a significant number of large positive eigenvalues of the Hessian Matrix of f(x), while flat minimizers are characterized by a considerable number of smaller positive eigenvalues of the Hessian Matrix of f(x).

“Falling” into a sharp minimizer may produce a seemingly better loss than a flat minimizer, but it’s more prone to generalizing poorly to unseen datasets. The diagram below illustrates a simple 2-dimensional loss landscape from Keskar et al.

A sharp minimum compared to a flat minimum. From Keskar et al.

We assume that the relationship between features and labels of unseen data points is similar to that of the data points that we used for training but not exactly the same. As the example shown above, the “difference” between train and test can be a slight horizontal shift. The parameter values that result in a sharp minimum become a relative maximum when applied to unseen data points due to its narrow accommodation of minimum values. With a flat minimum, though, as shown in the diagram above, a slight shift in the “Testing Function” would still put the model at a relatively minimum point in the loss landscape.

Typically, adopting a small batch size adds noise to training compared to using a bigger batch size. Since the gradients were estimated with a smaller number of samples, the estimation at each batch update will be rather “noisy” relative to the “loss landscape” of the entire dataset. Noisy training in the early stages is helpful to the model as it encourages exploration of the loss landscape. Keskar et al. also stated that…

“We have observed that the loss function landscape of deep Neural Networks is such that large-batch methods are attracted to regions with sharp minimizers and that, unlike small-batch methods, are unable to escape basins of attraction of these minimizers.”

Although larger batch sizes are considered to bring more stability to training, the noisiness that small batch training provides is actually beneficial to explore and avoiding sharp minimizers. We can effectively utilize this fact to design a “batch size scheduler” where we start with a small batch size to allow for exploration of the loss landscape. Once a general direction is decided, we hone in on the (hopefully) flat minimum and increase the batch size to stabilize training. The details of how one can increase the batch size during training to obtain faster and better results are described in the following article.

Why Using Learning Rate Schedulers In NNs May Be a Waste of Time

Hint: Batch size is the key and it might not be what you think!

towardsdatascience.com

In a more recent study from Hoffer et al. (2018) in their paper “Train longer, generalize better: closing the generalization gap in large batch training of neural networks”, the authors expanded on the idea previously explored in Keskar et al. and proposed a simple yet elegant solution to reducing the generalization gap. Differently to Keskar et al., Hoffer et al. attacked the Generalization Gap from a different perspective: the number of weight updates and its correlation to the network loss.

Hoffer et al. offer a somewhat different explanation for the Generalization Gap phenomenon. Note that the batch size is inversely proportional to the number of weight updates; that is, the larger the batch size, the fewer updates there are. Based on empirical and theoretical analysis, with a lower number of weight/parameter updates, the chances of the model approaching a minimum are tremendously smaller.

To start, one needs to understand that the optimization process of Neural Networks through batch-based gradient descent is stochastic in nature. Technically speaking, the term “loss landscape” refers to a high dimensional surface in which all the possible parameter values are plotted against the loss value across all possible data points produced by those parameter values. Note that the loss value is computed across all possible data samples, not just the ones available in the training dataset, but all possible data samples for the scenario. Each time a batch is sampled from the dataset and the gradient is computed, an update is made. That update would be considered “stochastic” on the scale of the entire loss landscape.

An example of a possible loss landscape. Here, the z-axis would be the loss value while the x and y axis would be possible parameter values. Image by the author.

Hoffer et al. make the analogy that the optimization of Neural Networks through stochastic gradient-based approaches is a particle performing a random walk on a random potential. One can picture the particle as a “walker”, blindly exploring an unknown high-dimensional surface with hills and valleys. On the scale of the entire surface, each move that the particle take is random, and it could go in any direction, whether towards a local minimum, a saddle point, or a flat area. Based on previous studies of random walks on a random potential, the distance that the walker travels from its starting position scales exponentially with how many steps it takes. For example, to climb over a hill with height d, it will take the particle eᵈ random walks to reach the top.

An Illustration of the exponential relationship between the number of “walks” and distance walked.

The particle that’s walking on the random high-dimensional surface can be interpreted as the weight matrix, and each “random” step, or each update, can be seen as one random step taken by the “particle”. Then, going from the traveling particle intuition that we built above, at each update step t, the distance that the weight matrix is to its initial values can be modeled by

where w is the weight matrix. The asymptotic behavior of the “particle” walking on a random potential is referred to as “ultra-slow diffusion”. From this rather statistical analysis and basing off of Keskar et al.’s conclusion of flat minimizers are typically better to “converge into” than sharp minimizers, the following conclusion can be made:

During the initial training, to search for a flat minimum with “width” d, the weight vector, or the particle in our analogy, has to travel a distance of d, thus taking at least eᵈ iterations. To achieve this, a high diffusion rate is needed, which retains numerical stability and a high number of iterations in total.

The behavior described in the “random walk on a random potential” can be empirically proven in the experiments performed by Hoffer et al. The graph below plots the number of iterations against the euclidean distance of the weight matrix from initialization for different batch sizes. A clear logarithmic (at least asymptotic) relationship can be seen.

The number of iterations plotted against the distance of the weight matrix from initialization. From Hoffer et al.

Methods to Reduce the Generalization Gap

There is no inherent “Generalization Gap” in Neural Network training, adaptations can be made to learning rate, batch sizes, and training method to (theoretically) completely eliminate the Generalization Gap. Based on the conclusion made by Hoffer et al., to increase the diffusion rate during the initial steps of training, the learning rate can be set to a relatively high number. This allows the model to take rather “daring” and “large” steps to explore more areas of the loss landscape, which is beneficial to the model eventually reaching a flat minimizer.

Hoffer et al. also proposed an algorithm to decrease the effects of the Generalization Gap whilst being able to keep a relatively large batch size. They examined Batch Normalization and proposed a modification, Ghost Batch Normalization. Batch Normalization reduces overfitting and increases generalization abilities as well as speeds up the convergence process by standardizing the outputs from the previous network layer, essentially putting values “on the same scale” for the next layer to process. Statistics are calculated over the entire batch, and after standardization, a transformation is learned to accommodate for the specific needs of each layer. A typical Batch Normalization algorithm looks something like this:

where γ and β represent the learned transformation, and X is the output from the previous layer for one batch of training samples. During inference, Batch Normalization uses precomputed statistics and the learned transformation from training phase. In most standard implementations, the mean and the variance are stored as an exponential moving average across the entire training process, and a momentum term controls how much each new update will change the current moving average.

Hoffer et al. propose that by using “ghost batches” to compute statistics and perform Batch Normalization, the Generalization Gap was able to be reduced. By using “ghost batches”, small chunks of samples are taken from the entire batch, and statistics are computed over those small “ghost batches”. By doing so, we applied the concept of increasing the number of weight updates to Batch Normalization, which doesn’t modify the entire training scheme as much as reducing the batch size as a whole. However, during inference, the entire batch statistics are used.

The Ghost Batch Normalization algorithm. From Hoffer et al.

In Tensorflow/Keras, Ghost Batch Normalization can be used by setting the virtual_batch_size parameter in the BatchNormalization layer to the size of ghost batches.

Conclusion

In real-world practices, the Generalization Gap is a rather overlooked topic, but its importance in Deep Learning cannot be ignored. There are simple tricks to reduce or even eliminate the gap, such as

Ghost Batch Normalization
Using a relatively large learning rate during the initial phases of training
Start from a small batch size and increase the batch size as training progresses

As research progresses and Neural Network interpretability improves, the Generalization Gap can hopefully completely become a thing of the past.