What distinguishes a neural network that generalizes well from those that don’t?

Understanding Deep Learning Requires Rethinking Generalization

Ratul Ghosh
Towards Data Science

--

Photo by Tina Xinia on Unsplash

I recently came across this widely talked about paper, “Understanding Deep Learning Requires Rethinking Generalization(Zhang et al. 2016)”. The paper has already won the best paper award at ICLR 2017. It raises a very important question “Why a deep neural network generalizes, despite having enough capacity just to memorize the inputs?”

In this article, I will share my understanding and discuss the different experiments performed and the results along with the implications.

Before jumping straight into the paper I will explain a few concepts that would be useful.

  • Generalization Error
  • Model Capacity
  • Explicit and Implicit Regularization

Generalization Error

Generalization Error is defined as the difference between the training error and testing error.

The training error is marked by the green curve and the testing error is marked by the red curve. Here we can see the model is overfitting and the generalization error is increasing gradually.

Model Capacity

Model capacity is defined as how flexibly a model can fit different types of input.

Universal Approximation Theorem

A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly.

— Ian Goodfellow, DLB

Vapnik–Chervonenkis VC dimension

A classification model f with some parameter vector θ is said to shatter a set of data points (x1,x2,…….,xn) if, for all assignments of labels to those points, there exists a θ such that the model f makes no errors when evaluating that set of data points.

The VC dimension of a model f is the maximum number of points that can be arranged so that f shatters them. More formally, it is the maximum cardinal D such that some data point set of cardinality D can be shattered by f.

The maximum number of data points we can perfectly label is called the VC dimension of the model.

In the above example, we have a simple linear classifier and we want to separate the two groups(blue pluses and red minuses). For four input data points, there exist some possible combinations of data points that are impossible to separate into two groups using a linear classifier. Therefore, the VC dimension of the linear classifier would be three.

The VC dimension predicts a probabilistic upper bound on the test error of a classification model.

Here, D is the number of parameters in the model and N is the size of the training set.

The above condition is only valid when D << N. This probabilistic upper bound is not useful for a deep neural network where usually the number of parameters is more than the number of data points(N << D).

Explicit and Implicit Regularization

The paper introduces two new definitions, explicit and implicit regularization. Drop out, data augmentation, weight sharing, conventional regularization(L1 and L2) are all explicit regularization. Implicit regularization is early stopping, batch normalization, and SGD. Although how the distinction is made is not defined in the paper, I feel implicit regularization are those where we are achieving regularization as a side-effect of some other process. For example, we use L1 exclusively for regularization hence explicit. Whereas we use batch normalization for normalizing the activations of the different inputs and as a side-effect it also happens to perform some sort of regularization, so it is implicit regularization.

Note: The above part is my understanding, please correct me if I’m wrong.

Different Randomization Tests

In the paper the authors have performed the following randomized tests:

  • True labels: the original dataset without modification.
  • Partially corrupted labels: independently with probability p, the label of each image is corrupted as a uniform random class.
  • Random labels: all the labels are replaced with random ones.
  • Shuffled pixels: a random permutation of the pixels is chosen and then the same permutation is applied to all the images in both training and test set.
  • Random pixels: a different random permutation is applied to each image independently.
  • Gaussian: A Gaussian distribution (with matching mean and variance to the original image dataset) is used to generate random pixels for each image.

Results of Randomization Tests

The results look quite interesting as the model can perfectly fit the noisy Gaussian samples. It also perfectly fits the training data with completely random labels although it takes a bit more time. This shows that a deep neural network with enough parameters could completely memorize some random inputs. This result is quite counter-intuitive as it is a widely accepted theory that Deep Learning usually discovers lower level features, middle-level features, and higher-level features and if a model can memorize any random inputs then what’s the guarantee that the model will try to learn some constructive features instead of simply memorizing the input data.

Results of Regularization Tests

https://bengio.abracadoudou.com/cv/publications/pdf/zhang_2017_iclr.pdf

The first diagram shows the effect of different explicit regularization on the training and testing accuracy. Here, the key takeaway is that there is not a very significant difference in the generalization performance between using regularization and not using regularization.

The second diagram shows the effect of batch normalization(implicit regularization) on the training and testing accuracy. We can see that the training with batch normalization is quite smooth but it doesn’t improve the test accuracy.

From the experiments, the authors concluded that both explicit and implicit regularizers could help to improve generalization performance. However, it is unlikely that regularizers are the fundamental reason for generalization.

FINITE-SAMPLE EXPRESSIVITY

The authors also proved the following theorem:

There exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function on a sample of size n in d dimensions.

which is basically an extension of the Universal Approximation Theorem. The prove is quite heavy if interested refer to Section C in the appendix of the paper.

IMPLICIT REGULARIZATION: AN APPEAL TO LINEAR MODELS

In the final section, the authors show that SGD-based learning imparts a regularization effect as the SGD converges to the solution with minimum L2 norm. Their experiments also show that a minimum norm doesn’t ensure a better generalization performance.

Final Conclusion

  • The effective capacity of several successful neural network architectures is large enough to shatter the training data.
  • Traditional measures of model complexity are not sufficient for a deep neural network.
  • Optimization continues to be easy even when generalization is poor.
  • SGD may be performing implicit regularization by converging to solutions with minimum L2-norm.

A subsequent paper “A Closer Look at Memorization in Deep Networks” has challenged some of the views pointed out in this paper. They have convincingly demonstrating qualitative differences in learning random noise vs. learning actual data.

https://dl.acm.org/doi/pdf/10.5555/3305381.3305406?download=true

The above experiment shows a deep neural net attempting to memorize random noise takes significantly longer to learn relative to the actual dataset. It also shows fitting some random noise results in a more complex function(more number of hidden units per layer).

https://dl.acm.org/doi/pdf/10.5555/3305381.3305406?download=true

This experiment shows that regularizers do control the speed at which DNNs memorize.

Thus to conclude, a deep neural network first tries to discover patterns, not brute force memorization, to fit real data. However, if it doesn’t find any patterns(like in case of random noise), the network is capable of simply optimizing in a way that just memorizes the training data. As both, the paper suggests we need to find some better tools to control the degree of generalization and memorization, and tools like regularization, batch normalization, and dropout are not perfect.

If you have any thoughts, comments, or questions, please leave a comment below or contact me on LinkedIn. Happy reading 🙂

--

--

Applied Scientist. Working on search, recommendation, advertisement and MLOps. I don’t represent my employer.