Perturbation Theory in Deep Neural Network (DNN) Training

Prem Prakash
Towards Data Science
7 min readMar 23, 2020

--

Vanishing Gradient, Saddle Point, Adversarial Training

Figure 1: Perturbation can help to approach correctness-attraction point (source).

Prerequisite- this post assumes the reader has an introductory-level understanding of neural network architectures, and have trained some form of deep networks, during which might have faced some issues related to training or robustness of a model.

A small perturbation or nudge in various parameters/components associated with training such as gradients, weights, inputs etc. can affect DNN training in overcoming some of the issues one might bump into, for example, vanishing gradient problem, saddle point trap, or creating a robust model to avoid malicious attacks through adversarial training etc.

Typically, perturbation theory is the study of a small change in a system which can be as a result of a third object interacting with the system. For example, how the motion of a celestial (planet, moon etc.) objects around the sun is affected by other planets/moons, even though the mass of the sun is almost 99.8% of the solar system. Almost similar to this, in DNN training a small perturbation in its components- gradients, weights, inputs- is used to solve some of the issues one might encounter during training or from a trained model.

Disclaimer: It should be made crystal clear that there is no formal perturbation theory in deep-learning/machine learning. Nevertheless, the use of the term ‘perturbation’ is not alien to the literature of machine learning. This has often been used to describe a nudge in encompassing component of the subject. This blog is a result of the accumulation of techniques related to perturbation in literature.

Vanishing Gradient

A neural network is simply a way to approximate a function that takes an input and produces an output. The final trained black-box model underneath is actually a composition of many functions (f(g(x))) i.e. each hidden layers represent a function in the composition series, followed by a composition of non-linear activation function and then again hidden layer, and so on; the non-linear activation function is to induce non-linearity, otherwise, it is simply a series of matrix multiplication that can be reduced to a single matrix that can only model linear functions, for which merely one layer will suffice.

Neural network parameters (weights and biases) are initialized with some initial set of values. These are then updated based on training data. The updates are performed using gradient in descent direction to find minima (well actually local minima since the function is non-convex and reaching global minima is not guaranteed), which works in practice regardless of depth or complexity of the network.

To compute gradient we start from the last layer’s parameters and then propagate backward (a solved illustration back-propagation) to the first layer parameters. As we move from last to initial layers, its gradient-computation for each layer parameters increases in the number of multiplication terms (gradient of a composition of functions). This multiplication of many terms can cause vanishing of the gradient for initial layers (as the number of terms goes up higher the depth more the terms), and if these terms have value in the range of [0, 1]- which is what is the rate of change for activation functions (i.e. gradient, ignore for the moment hidden layer contribution as they may or may not be in that range), for example, sigmoid (0, 1/4), hyperbolic-tangent (0, 1) etc. Said differently, in vanishing gradient the last layer’s parameters will learn just fine but as we move towards initial layers gradient may start to vanish or is low enough that can significantly increase training time. To avoid this, one can use ReLU or its variant or batch-normalization, but here we will discuss how a little perturbation in gradient (from the paper Neelakantan et al. (2015) can also help in assuaging the problem.

The perturbation in gradients is by adding gaussian-distributed noise with zero mean and decaying-variance- works better than fixed-variance, moreover, we do not want a constant perturbation even when cost function (J) value is approaching to converge. This process also helps in avoiding overfitting, and can further result in lower training loss. The perturbed gradient at every training step t is computed as follows:

where decaying-variance (σ) at training step t is given as:

where the value of η (eta) is typically taken to be 1 (nevertheless it can be tuned but has to between zero and one), γ parameter is set to 0.55.

On an additional note, this process actually adds more stochasticity in the training process that can further help in avoiding plateau phase in the early learning process.

Saddle Point

It is a stationary point on a curve, where the shape is that of the saddle (like in horseback riding as shown in Figure 2). For example, in the curve for minimization of loss-function, a stationary point is saddle point if it is local minima when one of the axes is fixed and maxima when the other (without-the-loss-of-generality we will consider the curve in three dimensions for the purpose of explaining). From the saddle point, a shift along one of the axes the function-value increase, while on the other it decreases, whereas, for a point to be minima it should increase in all the directions it moves, and for the maxima decrease in all the directions.

Figure 2: Saddle shape curve (source)

Sometimes, the training process can get stuck at the saddle point, since the gradient evaluates to zero which results in no update in weight-parameters (w). To escape saddle point efficiently one can perturb the weights. The perturbing of weights are conditioned on the gradient of weights, for example when L2-norm of the gradient is less than some constant value c then apply the perturbation. The perturbation is given by:

where wt is weight at t’th iteration of training, and ξt is sampled uniformly from a ball centred at zero with a suitably small radius. Nevertheless, it is not necessary to add uniform-noise, one can also go ahead with gaussian-noise, however, this does not have any additional advantage empirically. Uniform-noise is used for analytical convenience. Also, it is neither necessary nor essential to add noise only when the gradient is small; it is up to us to decide when and how to perturb the weights, for example, an intermittent perturbation (every few iterations with no condition) will also work which has polynomial time guarantee.

Adversarial Training

The dictionary defines adversary as a force that opposes or attacks; an opponent; an enemy; a foe. On a similar note, an adversarial example for deep learning model will be a malicious, spam, or poisonous input that can fool it to malfunction by predicting incorrect output with high-confidence.

To make a neural network model robust to these adversarial examples Goodfellow et al. proposed perturbation of inputs. The perturbation produces plurality in inputs that are obtained adding input with sign value of gradient (computed with-respect-to training input) multiplied with a constant value. The creation of a perturbed image is given as:

where for x is the training input, xˆ is a new perturbed image, ∇x(J) is the gradient of the loss-function (J) with respect to the training input x, ϵ is a predetermined constant value- small enough to be discarded by a data storage apparatus due to limited precision of it, and the sign function outputs 1 if the input is positive, -1 if negative and 0 if zero. This technique has recently been patented under US patent law, one can read it at this link in details. I highly recommend its reading for better clarification and also as an example of how a patent in software-algorithm is written.

Training the network with perturbed input extends the input distribution with its plurality. This makes the network robust against malicious input, for example, it can help avoid pixel attack in an image classification task.

Conclusion

We have learned how perturbation helps in solving various issues related to neural network training or trained model. Here, we have seen perturbation in three components (gradients, weights, inputs) associated with neural-network training and trained model; perturbation, in gradients is to tackle vanishing gradient problem, in weights for escaping saddle point, and in inputs to avoid malicious attacks. Overall, perturbations in different ways play the role of strengthening the model against various instabilities, for example, it can avoid staying at correctness wreckage point (Figure 1) since such position will be tested with perturbation (input, weight, gradient) which will make the model approach towards correctness attraction point.

As of now, perturbation is mainly contingent to empirical-experimentation designed from intuition to solve encountering problems. One needs to experiment if perturbing a component of the training process makes sense intuitively, and further verify empirically if it helps mitigate the problem. Nevertheless, in future, we will see more on perturbation theory in deep learning or machine learning in general which might also be backed by a theoretical guarantee.

References

[1]. Neelakantan, Arvind, et al. “Adding gradient noise improves learning for very deep networks.” arXiv preprint arXiv:1511.06807 (2015).

[2]. Jin, Chi, et al. “How to escape saddle points efficiently.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[3]. Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).

--

--