The world’s leading publication for data science, AI, and ML professionals.

Image Classification For Beginners

VGG and ResNet architecture from 2014

Images from unsplash - modified by author
Images from unsplash – modified by author

Image classification was the first topic I taught at Interview Kickstart to prepare professionals for landing jobs in top tech companies. I wrote this post when I was preparing for one of my lectures there. So if you are unfamiliar with this topic, this intuitive explanation might help you too.


In this post, we look at VGG and ResNet models; Both are seminal and influential works in the development of convolutional neural networks (CNNs) for computer vision. The VGG[2] was proposed in 2014 from a research group at Oxford, and the ResNet[3] was proposed by Microsoft Researchers in 2015.

Let’s get started.

What Is VGG?

VGG stands for Visual Geometry Group and is a research group at the university of Oxford. In 2014, they designed a deep convolutional neural network architecture for image classification task and named it after themselves; i.e. VGG. [2].

VGG Network Architecture

This network comes in few configurations; all have the same architecture just the number of layers are different. The most famous ones are VGG16 and VGG19. The VGG19 is deeper and has better performance than VGG16. For the sake of simplicity, we focus on VGG16.

VGG16’s architecture is depicted in the image below. As we see it has 16 layers; 13 convolutional layers and 3 fully connected layers.

VGG16 architecture - image by the author
VGG16 architecture – image by the author

It is a very simple architecture; it consists of 6 blocks where the first 5 blocks contain convolutional layers followed by a max pool, and the 6-th block contains only fully connected layers.

All convolutional layers use 3×3 filters with stride=1, and all the max pooling layers are 2×2 with stride=2 so they halve width and height of the input feature map. This is called downsampling as it reduces the size of the output feature map.

Note that convolutional layers start with 64 filters and doubles after every pooling until it reaches 512 filters. All convolutional layers use "same" padding to maintain the same size between input and output, and they all use RELU activation function. Below, we explain these concepts:

Same padding: Same padding is a padding technique to ensure the output volume of a convolutional operation has the same height and width as the input volume. It works by padding the input with zeros evenly on all sides such that the spatial dimensions are unchanged after the convolution operation.

Max pooling: As we saw above after each block, a 2×2 max pool with stride=2 is applied. Max pooling outputs the maximum value in a window. The stride=2 halves spatial dimension, and retains critical information for robust feature detection. In addition, this reduction leads to computational efficiency.

RELU activation: As we mentioned the activation function VGG uses is RELU. The RELU sets negative values to zero, and passes positive values unchanged. The non-linearity it adds contributes to increasing model’s expressiveness and helps in detecting complex patterns. The VGG model uses RELU after each convolutional layer.

image by author
image by author

Let’s go through VGG16 architecture layer by layer:

  • Let’s say input is a colored image with dimensions of height and width, then its size is (height, width, 3) . Note RGB is 3 channels.
  • The first layer has 64 neurons and applies 3×3 convolutions with "same" padding so the output feature map from first layer is (height, width, 64).
  • The second layer is the same as first layer so the output feature map from this layer is also (height, width, 64).
  • The third layer is 2×2 max pooling with stride=2, so it shrinks the size to (height/2, width/2, 64)
  • The forth and fifth layers are conv3–128 with "same" padding, so they change the output size to (height/2, width/2, 128).
  • The sixth layer is 2×2 max pooling again and it changes the output size to (height/4, width/4, 128).
  • If we continue like this, we see that when data reaches the first fully connected layer, it is of shape (height/32, width/32, 512). So we see that number of channels increased from 3 to 512, while height and width reduced by a factor of 32!!! Think of it as compressing the information and instead capturing patterns in channels.

VGG Computational Cost

VGG16 is one of the largest CNN models; it has 138 Million parameters. In the image below, we are seeing two variants of VGG: VGG16 (has 16 layers) and VGG19 (with 19 layers).

image from [1]
image from [1]

We see that both VGG16 and VGG19 are largest CNN models in terms of number of operations it takes to do in one forward pass. Note that number of operations is proportional to the number of parameters the model has. In next post, we will look into ResNet[3] model that as we see it is much smaller than VGG and outperforms VGG.

Why Was VGG Proposed?

Prior to VGG, CNN models had fewer layers and larger convolutional filters. The VGG network was introduced to show that a simple CNN with only 3×3 convolutional layers stacked together would work as good as complex models with large filters.

It also showed the importance of depth in convolutional networks. They showed that stacking many small 3×3 convolutional layers effectively simulates larger receptive fields. At the time VGG was introduced, it outperformed every other model on Image Classification task on ImageNet dataset.


What is ResNet?

ResNet short for Residual Network was proposed by Microsoft Researchers in 2015 [3]. Before, we dive into its architecture, let’s first see why it was proposed.

Why ResNet was proposed?

In a nutshell, ResNet was proposed to address the vanishing gradient problem in very deep networks. **** Let’s take a closer look:

As we saw in the case of VGG, deep neural networks are extremely powerful. But they also have more parameters and so they take longer to train and cost higher in computation. In addition, we need more training data to train them too.

Besides, computational cost and size of training data, there is an obstacle in training deep neural network too. As the image below shows, when we train shallow neural networks, the training loss starts decreasing in early epochs. But in deep neural networks the training loss decreases very minimal in early epochs, and after first few epochs it suddenly drops. This is a big obstacle in practical training in deep neural networks.

Now why does it happen?

training loss decreasing by epoch in shallow and deep neural networks - image by author
training loss decreasing by epoch in shallow and deep neural networks – image by author

It happens because of two reasons:

  1. In early layers of a deep NN, vanishing gradient descent problem kicks in; i.e. the gradient of the loss vanishes by the time it reaches early layers of the networks and so the parameters of these layers receive very little update.
  2. In late layers of a deep NN, very little of the original signal (i.e. original input) arrives. Why is that? because the signal is getting multiplied by the weight of all previous layers and going through activation which pushes signals to zero. So output of these layers in early epochs is almost random noise. So the gradient with respect to the loss is random noise, and the update to these layers’ parameters is meaningless.
deep neural networks suffer from learning in first few epochs - image by author
deep neural networks suffer from learning in first few epochs – image by author

So this is why we do not see much improvement in first few epochs of training deep NNs.

To address this, we’d like to find a way for the input to reaches late layers, and for the gradients to reaches early layers. We can achieve both using skip connections.

Skip Connection

The idea of skip connection, is to group layers of the network into blocks, **** and for each block make the input to both go through and around the block. Like this :

image by author
image by author

Within each block, the layers pass their data forward normally, and between blocks, we have a new type of connections.

As we see above, this connection works by combining the input to the block with the output from the block. So basically data has two paths to flow: one through the block, and one around the block.

So one residual block looks like this:

residual block - image by author
residual block – image by author

The "+" sign above shows the "combine" symbol which combines input tensor and output tensor. It has to be an operation that passes gradient undisturbed. The "+" operation can be either of the following:

  1. element-wise addition of the two tensors
  2. concatenation of the two tensors

It is worth emphasizing that a residual block is called "residual" because it enables a residual learning approach. Each residual block learns a residual function with reference to its input, rather than directly fitting a desired underlying mapping.

In feed-forward networks, we learn the direct mapping from input to output, i.e. f(x): x->y. In residual blocks however, as we see in the above, instead of trying to learn the direct mapping, each residual block learns a residual function, i.e. x->f(x)+x. This residual function represents the modification that needs to be made to the input to get the desired output.

image taken from [3]
image taken from [3]

ResNet Is Easier to Train

A network made of residual blocks is called a residual networks or ResNet. They have couple advantages that make them faster and easier to train.

  1. One is that every residual block augment the data: Since they pass the input around the block unchanged, the job of residual block is not to figure out what important information is in the input, but rather its job is to figure out what additional information we can add to the input to reach the output. And this turns out to be a simpler job.
  2. The network has shorter gradient paths. Since each block has a path that goes around the block, the gradient goes through that path too. And so any layer in the network has relatively a short path by which loss gradients can arrive.
gradient flows in two paths: through the layers and around the blocks - image by author
gradient flows in two paths: through the layers and around the blocks – image by author

Concerns For ResNet

There are few concerns around the residual blocks that we have to pay attention to them when we design residual networks:

  1. To be able to add/concatenate the input and output of a residual blocks, we have to have both tensors to be of the same shape. Obviously, if we enforce the shape of every layer’s output to be the same as its input, this issue will not arise. But enforcing this constraint limits model’s capacity.
  2. If we use concatenation instead of element-wise addition to combine input and output tensor of each block, then we will end up with a very large tensor and we will have an explosion of parameters. So we should not overuse concatenation operation and must prefer addition if our network is deep. Often, concatenation is used in one or two blocks maximum.

ResNet Architecture

Now that we know about residual block and skip connections, ResNet[3] for image classification is built by stacking multiple residual blocks. We can construct very deep networks of over 100 layers. Original ResNet has variant architectures from 18 layers to 152 layers [3].

Each residual block consists of a convolutional layer, batch normalization and and RELU activation functions. As we see in the image below, "batch normalization" is used after every convolutional layer; this normalizes activations by subtracting the mean and dividing it by standard deviation. This operation stabilizes the training procedure.

Residual block - image by author
Residual block – image by author

When ResNet was proposed, it achieved state-of-the-art results on ImageNet classification task[3].

Takeaways

Takeaway1: Last layers in a deep NN receive very little of the input signal. This is because each intermediate layer’s activation functions like sigmoid or tanh saturate to 0 or 1 for large positive/negative inputs. This diminishes signal as it passes through layers. This is referred to as "saturation".

Takeaway2: Early layers of a deep NN receive very little of the gradient in first few epochs of training a network. This is because the error gradient is propagated back through many layers during training, it shrinks exponentially. This makes it difficult for early layers to learn effectively. This problem is referred to as "vanishing gradient problem".

Takeaway3: VGG was proposed to show deep networks with simple 3×3 filters resemble complex networks with large convolutions. ResNet was proposed to address vanishing gradients in very deep networks.

Summary

In this post, we looked at two seminal CNN architectures called VGG and ResNet. VGG is a deep CNN which contains only 3×3 convolutional layers. It was historically used for image classification task, and at the time it was proposed it outperformed AlexNet and other competing models on ImageNet challenge. It demonstrated the power of depth in CNNs and the fact that using simple 3×3 convolutions can resemble the effect of larger kernels. ResNet was introduced after VGG and it outperformed VGG. ResNet innovation was in introducing residual blocks which made training of deep networks easier and faster.


If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/

References

  1. An analysis for deep neural network models for practical applications
  2. Very deep convolutional networks for large scale image recognition
  3. Deep Residual Learning for Image Recognition

Related Articles