Convolutional Neural Networks: An Introduction

A brief journey from Perceptron to DenseNet

Oliver Kramer
Towards Data Science

--

Image created with an AI.

One may rightly wonder why another introduction to convolutional neural networks is necessary when there are numerous introductions to the same topic on the web. However, this article takes the reader from the simplest neural network, the perceptron, to the deep learning networks ResNet and DenseNet, (hopefully) in an understandable, but definitely in a concise way, covering many of the basics of deep learning in a few steps. So here we go— if you want.

Introduction

Machine learning is an important part of artificial intelligence (A.I.). It used in numerous application areas like image recognition, speech recognition, wind power prediction, and drug design. Data science is an emerging area closely related to A.I. focusing on learning, and prediction with a focus on collecting, visualizing, preprocessing for inference and learning models. A feature is a variable describing an observation. A pattern x ∈ ℝᵈ is a set of features. A label y ∈ ℝ is an observation we are interested in. Pattern-label pairs form the ground truth.

Supervised learning trains a classifier in a training phase by adapting its parameters to adapt to the training patterns. From a set of pattern-label pairs (xᵢ, yᵢ) with i = 1 …, n with xᵢ ∈ ℝᵈ that form a d-dimensional training set, a machine learning model f is trained that is used to predict appropriate label information y given a novel pattern x. In classification the label is discrete, e.g., {0,1} and known as class or category, while in regression the label in continuous.

Perceptron

A perceptron [1] is a simple neural unit (f : ℝᵈ ➝ ℝ) that sums up the weighted inputs and feeds them to an activation function

Here, x ∈ ℝᵈ is the input to the perceptron, w ∈ ℝᵈ is the weight, and b ∈ ℝ is the bias, and w · x is the sum of the component-wise products of all xand wᵢ.

Plot of activation function ReLU

Function σ: ℝ ➝ ℝ is an activation function like the rectified linear unit, short ReLU, see Figure 1:

ReLU can be computed very fast, which explains its success in deep learning. Among the zoo of further activation functions is the sigmoid function:

mapping the input to values between 0 and 1 and the hyperbolic tangent:

mapping to the interval [-1, 1].

Layering multiple perceptrons allows separating more complex sets, also non-linear ones like the XOR problem. A dense layer puts numerous perceptrons into one layer. All neurons of successive layers are connected to each other, i.e., are densly connected. Dense layers are also known as fully-connected (FC) layers. The input layer’s number of neurons corresponds to the dimensionality of the pattern. A multi-layer perceptron (MLP) consists of an input and an output layer with one or more hidden layers, see Figure 2.

Figure 2: Left: Illustration of MLP with one hidden layer. Right: Multiple layers allow non-linear classification boundaries.

Each edge in this MLP graph with neurons as nodes is equipped with an own weight wᵢ. If x ∈ ℝᵈ is the input to a layer and W ∈ ℝ^{k × d} the weight matrix of a layer, then

is the weighted sum with bias b ∈ ℝᵏ. The output is a k-dimensional vector, whose elements are fed element-wisely to activation function σ resulting in a vector of activations of the corresponding layer. The extension to tensors is reasonable, if channel information (RGB-values of images) needs to be processed. Information is passed from input layer to output layer. Hence the network architecture is called feed-forward network, in comparison to recurrent networks that have backward connections.

Network weights and biases are usually initialized uniformly with small values like -0.01 to 0.01. Alternatively, Glorot initialization aims at making the variance of the outputs of a layer to be equal to the variance of its inputs. Glorot draws samples from a normal distribution centered on 0 with standard deviation based on the number of inputs and outputs.

Training

Learning is weight adaptation. The typical setting in a training phase is to adapt all weights w and biases bⱼ of a network such that patterns are mapped by the MLP to correct labels of a training set.
The main weight adaptation algorithm in neural learning is backpropagation. Backpropagation performs gradient descent on the weights to minimize the loss function L, which can, e.g., be cross-entropy or MSE. Gradient descent is an optimization method that moves the search into the opposite direction of the gradient.

In regression the MSE is most often used as loss function:

for labels y₁, …, yₙ ∈ ℝ and outputs y’₁, …, y’ₙ ∈ ℝ. The introduced loss functions are used to adapt the weights of the neural networks.

The gradient Δ L(w) at point w ∈ ℝᵏ is a k-dimensional vector of partial derivatives ∂L(w)/∂w w. r. t. each parameter wᵢ, i = 1,…, k:

Gradient descent is usually faster than undirected randomized search. For regression problems, the loss is defined as the sum of squared residuals, for classification a common loss function is the cross-entropy loss, see Equation above.

If the loss function L is differentiable, the partial derivative ∂L/∂w w. r. t. the weights can be computed leading to gradient ΔL(w). Gradient descent performs minimization by going into the opposite direction of the gradient with a learning rate η :

This update is also known as vanilla update.

For a simple perceptron f(x) with f(x) = σ(wx + b) with input pattern x, target label y, sigmoid activation function, and loss L = 1/ 2 (y — σ(z))², we exemplary derive backpropagation:

with the chain rule. Constants vanish, again the chain rule and with the sigmoid’s derivative we get:

In one epoch all patterns of the complete training set are presented to the network. A successful variant for weight updates is called stochastic gradient descent (SGD). SGD updates the weights and computes the gradients after presentation of one pattern. Hence, it tries to approximate the true gradient by considering only one training sample at a time. Training can be efficient, if the training set is shuffled and divided into disjunct or overlapping batches. In mini batch mode the neural network is trained with gradients for a subset of the training samples. SGD is less robust than mini batch, but it allows faster steps. Further, it may get stuck in local optima less frequently. A local optimum employs a better fitness than its neighborhood, but may not be the global optimum, see Figure 3.

Figure 3: Illustration of local optimum.

If compared to true gradient descent on the whole training set, mini batch gradient descent is the compromise between normal gradient descent and SGD.

Overfitting

A model that concentrates on adapting to the training patterns may overfit to the training data and may learn complex adaptations albeit the desired model structure may be less complex. This leads to the lack of ability for generalization of the model. Overfitting can be avoided with regularization, cross-validation, and dropout.

Regularization adds a penalty to the error classification or regression error L(w) based on the magnitude of weights, e.g., sum of squares of all weights:

with parameter α in case of regularization, where

is the norm of weight vector w. Large weights are associated with overfitting, while small weights are supposed to prevent overfitting. A penalty on the weights enforces small weights and hence prevents overfitting.

Cross-validation (CV) uses a set of training samples for training and evaluates the model quality on an independent validation set. Generalized to more than one of such cross-validation processes, N-fold cross-validation divides the randomly shuffled dataset into N disjunct subsets, see Figure 5. Each subset is left out once to be the validation set. The process is repeated for all N folds and the error is averaged. A final evaluation on an independent test set can be applied to illustrate the model quality.

Figure 5: CV repeatedly (here 3 times for 3-fold CV) leaves out one (grey) validation set and trains the model on the remaining (blue) folds.

An extreme case is leave-one-out cross-validation (LOO-CV) with N=n, i.e., every pattern is a fold. It is useful from a statistical viewpoint, but due to a large number of training processes it is very inefficient and hence mostly applicable for small datasets.

Controlling the error on a left-out validation set allows early stopping, i.e., exiting the training process in case the validation error grows while the training error further reduces.

Dropout turns off each neuron with a probability p ∈ [0,1] called dropout rate, during the training phase. For each hidden layer, for each training sample, and for each iteration, a random fraction p activations are ignored, also said to zeroed out. In the testing phase, all activations are used, but reduced by factor p. This accounts for the missing activations during the training phase. Dropout is also used in convolutional layers, see the section about dropout, where it also randomly zeroes out activations.

Figure 6 illustrates dropout. The grey neurons do not take part in the training process. Their weights are not updated.

Figure 6: During dropout neurons are deactivated with probability p, here illustrated for the grey neuron during training.

Dropout forces the network to learn multiple independent representations of classes, which prevents overfitting. It can also be understood as ensemble learning of multiple subnetworks, which combine their decisions.

Convolution

Convolutional networks have been introduced early [2], but led to a breakthrough in image recognition with AlexNet by Alex Krizhevsky in the ImageNet Large Scale Visual Recognition Challenge in 2012. An exemplary network architecture shows Figure 7. Convolutional layers act as translation invariant feature learners.

Figure 7: Architecture of convolutional network with filters and channels.

Let X be a H × W × C-dimensional input. A convolutional layer consists of C’ filters or kernels of shape m × m × C. The convolutional operation is performed per channel in an element-wise multiplication and the results across all channels are summed up to produce a single value. This operation is performed per filter kernel, i.e., C’ times resulting in C’ output channels for output A.

The convolution operation moves a m × m-dimensional kernel matrix w over the input volume, from top left to bottom right, see Figure 8, computing:

Filter kernel w is adapted via backpropagation and learns useful feature detectors. High similarities between input x and w yield high activations a. The activations are summed up per input channel thus consisting of C’ summands. The process is repeated C’ times resulting in C’ output channels for output A.

Figure 8: Example for 2D convolutional process.

For example, the first convolutional layer for a network for CIFAR-10 with 32 × 32 color-images may employ 64 times 3 filter kernels 3 × 3-matrices. The layer would ouput 64 channels.

The step size a filter kernel is moved over the input volume matrix is called stride. It employs a vertical and a horizontal axis. Stride one for both axes is a frequent choice, see Figure 9. Higher values reduce the computational and the memory complexity. To avoid dimensionalities to shrink, the borders of the input volume can be filled with zeros, e.g., by adding W — m zeros. This process is called zero padding.

Figure 9: Illustration of a 1 × 1-stride.

A bias b can be added to the output volume, i.e., to all activations. The output of a convolutional layer applies an activation function A’ = σ(A), e.g. ReLU. Pooling layers reduce the dimensionality concentrating on the maximal or average activations. The number of filter kernels usually increases with the deepness of the network.

Pooling

Convolutional layers lead to a remarkable increase of activations. To scale the number down pooling layers are used, see Figure 10. Pooling is a channel-wise operation.

Figure 10: Max pooling returns the maximum value within the input volume that is usually shifted with a stride that corresponds to the dimensionality of the volume (2x2 here)

Let A be the feature map of activations, max pooling moves an m × m square over each channel choosing the maximum value:

Average pooling is the corresponding process using the average value of each square. Pooling layers can also be used as replacement of a dense layer head (the last layer of activations). Global average pooling applies to the whole volume.

Another way to reduce the amount of information is a 1 ×1 ×C convolutional module, which computes a weighted sum over all modules resulting in a H ×W-dimensional volume. C’ of such modules would introduce C’ channels.

VGG-19 is an example of an early convolutional network. It comprises 19 weight layers with the following configuration for an input of 224 × 224 RGB images:

FC means fully connected layer. The three FC are also known as MLP-head. VGG-19 employs 144 million parameters and is one example of deep architectures.

ResNet

The more parameters and weights a network employs, the more it can represent. Hundreds of layers are possible, but computationally expensive to train. But very deep networks are prone to overfitting. Gradient updates become smaller in each layer as errors are propagated backwards through the network in a multiplicative manner. The effect increases with the number of layers known as vanishing gradient problem.
ResNet [3] solves this problem by offering shortcut connections in form of identities X and adds them to a module’s output R, which becomes a residual from the input. Residuals are easier to learn than identities. A ResNet module offers identities and learns the deviation from them. The sum of layer and identity becomes a ResNet module R:

and the residual R has to be learned. To match the dimensions of R and X a projection matrix W can be added resulting in:

Figure 11 shows an exemplary ResNet module based on two convolutional layers.

Figure 11: Identity shortcut connection of a ResNet module.

The ResNet principle can be applied to all kinds of modules consisting of fully connected or of convolutional layers. Even the shortcut connection may consist of one convolutional layer overriding a module of convolutional layers.

The search for an optimal model f* can be seen as search in a function class F. In this space a neural network is a function f ∈ F defined by architecture, hyperparameters and weights. Changing the function class does not necessarily move it closer to f*. A ResNet module realizes the identity adding the residual part and thus representing a nested function, i.e., it holds F’F, which enforces a movement towards the optimum f*, see Figure 12.

Figure 12: Nested function classes (right) like achieved with ResNet modules move closer to the optimal function f∗, non-nested (left) do not necessarily

The ResNet architecture from the original paper employs 152 layers with one 7 × 7, 64 stride 2 convolution and a 3 × 3 max pooling with stride 2 followed by the following convolutional ResNet modules:

consisting of each three convolutions. At the end of the network, average pooling is employed with a 1000-dimensional fully connected layer and softmax.

Conclusions

Convolutional neural networks are not only among the most important methods in artificial intelligence, the underlying methods and principles are also used in many other deep learning algorithms. For example, backpropagation is the optimization backbone of almost all deep learning methods, and convolutional layers are part of numerous neural architectures. Their application is not limited to image recognition, but has been proven in many other fields as well. If you want to dive deeper into the above topics and implementations, see the section on further reading for some references. Python is the best programming language for Deep Learning and Keras — along with PyTorch — can give you easy access to convolutional neural networks.

All images unless otherwise noted are by the author.

Further Reading

Some (of many possible) References

[1] F. Rosenblatt, The perceptron — A perceiving and recognizing automaton, Cornell Aeronautical Laboratory, Report №85–460–1 (1957)

[2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation, 1(4):541–551 (1989)

[3] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR, pp. 770–778 (2016)

[4] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely Connected Convolutional Networks. CVPR, 2261–2269 (2017)

--

--

Professor for Computational Intelligence, University of Oldenburg, Germany