Neural Networks: Problems & Solutions

Sayan Sinha
Towards Data Science
8 min readJul 28, 2017

--

Though the concept of artificial neural network has been in existence since the 1950s, it’s only recently that we have capable hardware to turn theory into practice. Neural networks are supposed to be able to mimic any continuous function. But many a times we are stuck with networks not performing up to the mark, or it takes a whole lot of time to get decent results. One should approach the problem statistically rather than going with gut feelings regarding the changes which should be brought about in the architecture of the network. One of the first steps should be proper preprocessing of data. Other than mean normalisation and scaling, Principal Component Analysis may be useful in speeding up training. If the dimension of the data is reduced to such an extent that a proper amount of variance is still retained, one can save on space without compromising much on the quality of the data. Also, neural networks can be trained faster when they are provided with less data.

Reduction in dimension can be achieved by decomposing the covariance matrix of the training data using singular value decomposition into three matrices. The first matrix is supposed to be contain eigenvectors. Furthermore, the set of vectors present in the matrix are orthonormal, hence they may be treated as basis vectors. We pick the first few vectors out of this matrix, the number being equal to the number of dimensions we wish to reduce the data into. Making a transformation of the original matrix (with original dimensions) with the matrix we obtain in the previous step, we get a new matrix, which is both reduced in dimension and linearly transformed.

The sum of the lengths of the blue lines is to be minimised (2D to 1D)

The above steps are mathematical in nature, but essentially we simply “projected” the data from the higher dimension to a lower dimension, similar to projecting points in a plane on a well-fitting line in a way that the distances a point has to “travel” is minimised.

An Octave implementation of the PCA would be :

function [U, S] = pca(X)
[m, n] = size(X);
U = zeros(n);
S = zeros(n);
Sigma=(1/m)*(X' * X);
[U, S, V] =svd(Sigma);
end
function Z = projectData(X, U, K)
Z = zeros(size(X, 1), K);
U_reduce = U(:, 1:K);
Z = X*U_reduce;
end
load ('<data>'); % Loads the dataset into
% variable X
[X_norm, mu, sigma] = Normalise(X); % Performs normalisation
[U, S] = pca(X_norm); % Performs PCA
K = input("Enter reduced dimension");
Z = projectData(X_norm, U, K);

One may use PCA for visualising the data by reducing it to 3D or 2D. But, a more recommended method would be to make use of t-distributed stochastic neighbour embedding, which is based on a probability distribution, unlike PCA. t-SNE tries to minimise the difference between the conditional probability in the higher and the reduced dimensions.

Conditional probability in the higher dimension

The conditional probability is high for points close together (measured by their Euclidean distance) and is low for the once which are far apart. Points are grouped according to the obtained distribution. The variance is chosen such that points in dense areas are given a smaller variance compared to points in sparse areas.

Though it was proved by George Cybenko in 1989 that neural networks with even a single hidden layer can approximate any continuous function, it may be desired to introduce polynomial features of higher degree into the network, in order to obtain better predictions. One might consider increasing the number of hidden layers. In fact, the number of layers of a network is equal to the highest degree of a polynomial it should be able to represent. Though this could also be achieved by raising the number of neurons in the existing layers too, that would require far more neurons (and hence an increased computational time) compared to adding hidden layers to the network, for approximating a function with a similar amount of error. On the other hand, making neural nets “deep” results in unstable gradients. This can be divided into two parts, namely the vanishing and the exploding gradient problems.

The weights of a neural network are generally initialised with random values, having a mean 0 and standard deviation 1, placed roughly on a Gaussian distribution. This makes sure that most of the weights are between -1 and 1. The sigmoid function gives us a maximum derivative of 0.25 (when the input is zero). This, combined with the fact that the weights belong to a limited range helps makes sure that the absolute value of their product too is less than 0.25. The gradient of a perceptron comprises the product of many such terms, each being less than 0.25. The deeper we go into the layers, we’ll have more and more such terms, resulting in the vanishing gradient issue.

Backpropagation of weights

Essentially, the gradient of a perceptron of an outer hidden layer (closer to the input layer) would be given by the sum of products of the gradients of the deeper layers and the weights assigned to each of the links between them. Hence, it is apparent that shallow layers would have very less gradient. This would result in their weights changing less during learning and becoming almost stagnant in due course of time. The first layers are supposed to carry most of the information, but we see it gets trained the least. Hence, the problem of vanishing gradient eventually leads to the death of the network.

There might be circumstances in which the weight might go beyond one while training. In that case, one might wonder how vanishing gradients could still create problems. Well, this might lead to the exploding gradient problem, in which the gradient in the earlier layers become huge. If the weights are large and the bias is such that it’s product with the derivative of the sigmoid of the activation function too keeps it on the higher side, this problem would occur. But, on the other hand, that’s a little difficult to achieve, for, increased weight may result in higher value for the input to the activation function, where the derivative of sigmoid would be pretty low. This also helps establish the fact that the vanishing gradient issue is difficult to prevent. In order to address this problem, we choose other activation functions, avoiding sigmoid.

Though sigmoid is a popular choice as it squashes the input between zero and one, and also for its derivative can be written as a function of sigmoid itself, neural networks relying on it might suffer from unstable gradients. Moreover, the sigmoid outputs are not zero centred, they are all positive. This means, all the gradients would either be positive or negative depending on the gradient of units on the next layer.

The most recommended activation function one may use is Maxout. Maxout maintains two sets of parameters. The one which yields higher value to be presented as input to the activation function is used. Also, the weights may be varied according to certain input conditions. One such attempt leads to Leaky Rectified Linear Units. In this special case, the gradient remains 1 when the input is greater than 0, and it gets a small negative slope when it’s less than 0, proportional to the input.

Another trouble which is encountered in neural networks, especially when they are deep is internal covariate shift. The statistical distribution of the input keeps changing as training proceeds. This can cause a significant change in the domain and hence, reduce training efficiency. A solution to the problem is to perform normalisation for every mini batch. We compute the mean and variance for all such batches, instead of the entire data. The input is normalised before feeding it into almost every hidden layer. The process is commonly known as batch normalisation. Applying batch normalisation can assist in overcoming the issue of vanishing gradients as well.

Regularisation can be improved by implementing dropout. Often certain nodes in the network are randomly switched off, from some or all the layers of a neural network. Hence, in every iteration, we get a new network and the resulting network (obtained at the end of training) is a combination of all of them. This also helps in addressing the problem of overfitting.

Whatever tweaks are applied, one must always keep a track of the percentage of dead neurons in the network, and adjust the learning rate accordingly.

Certain diagnostics may be performed on the parameters to get better statistics. Plots on bias and variance are two important factors here. They can be determined by plotting curves with the output of the loss function (without regularisation) on the training and the cross validation data sets versus the number of training examples.

(i) High bias (ii) High variance

In the figure above, the curve in red represents the cross validation data while the colour blue has been used to mark the training data set. The first figure is the one which would be roughly obtained when the architecture is suffering from high bias. It means, the architecture is poor, hence it gives pretty high errors even on the training data set. Addition of more features into the network (like adding more hidden layers, and hence introducing polynomial features) could be useful. If it is suffering from high variance, it means the trained parameters fits the training set well, but performs poorly when tested on “unseen” data (the training or the validation set). This could be because the model “over-fits” the training data. Getting more data could act as a fix. Reducing the number of hidden layers in the network might also be useful in this case. Playing with the regularisation parameter could help as well. Increasing its value could fix high variance whereas a decrease should assist in fixing high bias.

An Octave implementation of plotting diagnostic curves would be:

function [error_train, error_val] = ...
learningCurve(X, y, Xval, yval, lambda)

m = size(X, 1);
error_train = zeros(m, 1);
error_val = zeros(m, 1);

for i = 1:m
X_here = X(1:i,:);
y_here = y(1:i);

theta = train(X_here, y_here, lambda);

error_train(i) = LossFunction(X_here,y_here,theta,0);
error_val(i) = LossFunction(Xval,yval,theta,0);
end;

end
lambda = input("Enter regularisation parameter");
[theta] = train(X_poly, y, lambda);
graphics_toolkit "gnuplot";
figure(1);
[error_train, error_val] = ...
learningCurve(X_poly, y, X_poly_val, yval, lambda); plot(1:m, error_train, 1:m, error_val);
xlabel('Number of training examples'); ylabel('Error');
legend('Train', 'Cross Validation');

Though it has been noticed that a huge number of training data could increase the performance of any network, getting a lot of data might be costly and time consuming. In case the network is suffering from high bias or vanishing gradients issue, more data would be of no use. Hence simple mathematics should be implemented as it would guide us which step we should descend towards.

References:
*
Machine Learning, Stanford University
* Convolutional Neural Networks for Visual Recognition, Stanford University
* Michael A. Nielsen, “Neural Networks and Deep Learning”, Determination Press, 2015
* Batch Normalization — What the hey? (by Karl N.)
* Paper summary → Character-level Convolutional Networks for Text Classification (by Nishant Nikhil)
Code repository: https://github.com/americast/ML_ANg

--

--