The world’s leading publication for data science, AI, and ML professionals.

Understand and Implement ResNet-50 with TensorFlow 2.0

Image Classification with very Deep Neural Network

Early Morning in Kyoto: Source (Author)
Early Morning in Kyoto: Source (Author)

Our intuition may suggest that deeper neural networks should be able to catch more complex features and thus they can be used for representing more complex functions compared to the shallower ones. The question that should arise is – if learning a better network is equivalent to stacking more and more layers? What are the problems and benefits of this approach? These questions and some very important other concepts were discussed in the Deep Residual Learning for Image Recognition paper by K. He et al. in 2017. This architecture is known as ResNet and many important must-know concepts related to Deep Neural Network (DNN) were introduced in this paper, these will all be addressed in this post including the implementation of 50-layer ResNet in TensorFlow 2.0. What you can expect to learn from this post –

  1. Problem with Very Deep Neural Network.
  2. Mathematical Intuition Behind ResNet.
  3. Residual Block and Skip Connection.
  4. Structuring ResNet and Importance of 1×1 Convolution.
  5. Implement ResNet with TensorFlow.

Let’s begin!

** A faster version of the Image Classification pipeline using tf.datais discussed here.


Degradation Problem :

The main motivation of the ResNet original work was to address the degradation problem in a deep network. Adding more layers to a sufficiently deep neural network would first see saturation in accuracy and then the accuracy degrades. He et.al. presented the following picture of train and test error with Cifar-10 data-set using vanilla net–

Fig. 1: Classification error with Cifar-10 data increases with increasing no. of layers for both training (left) and test data (right) in a plain DNN. Reference: [1]
Fig. 1: Classification error with Cifar-10 data increases with increasing no. of layers for both training (left) and test data (right) in a plain DNN. Reference: [1]

As we can see the training (left) and test errors (right) for the deeper network (56 layer) are higher than the 20 layer network. More depth and with increasing epochs, the error increases. At first, it appears that as the number of layers increases, the number of parameters increases, thus this is a problem of overfitting. But it is not, let’s understand.

One way of thinking about the problem is to consider a deep DNN that calculates a sufficiently strong set of features that is necessary for the task at hand (ex: Image classification). If we add one more layer of the network to this already very DNN, what will this additional layer do? If already the network could calculate strong features, then this additional layer doesn’t need to calculate any extra features, rather, just copy the already calculated features i.e. perform an identity mapping (kernels in the added layer produce exact same features as that of the previous kernel). This seems to be a very simple operation but within a deep neural net this is far from our expectations.


Mathematical Intuition behind ResNet:

Let us consider a DNN architecture including learning rate and other hyperparameters that can reach a class of functions F. So for all f∈ F, there exist parameters W which we can obtain after training the network for a particular data-set. If f* denotes the function that we would really like to find (the result of best possible optimization) but if it is not within F, then we try to find f1, which is the best case within F. If we design a more powerful architecture G, we should arrive at a better outcome g1, which is better than f1. But if F ⊈ G, then there is no guarantee that the above assumption would hold. In fact g1 could be worse than f1 and this is degradation problem. So the main point is – if deeper neural net function classes contain the simpler and shallower network function classes then we can guarantee that the deeper network will increase the feature finding power of the original shallow network. This will be more clear once we will introduce the residual block in the next section.


Residual Block :

The idea of a residual block is completely based on the intuition that was explained before. Simpler function (shallower network) should be a subset of Complex function (deeper network) so that degradation problem can be addressed. Let us consider input x and the desired mapping from input to output is denoted by g(x). Instead of dealing with this function we will deal with a simpler function f(x) = g(x)-x. The original mapping is then recast to f(x)+x. In the ResNet paper He et al. hypothesized that it is easier to optimize the residual f(x) than the original g itself. Also optimizing the residual takes care of the fact that we don’t need to bother about the dreaded identity mapping f(y)→ y in a very deep network. Let’s see the schematic of the residual block below –

Fig. 2: residual block and the skip connection for identity mapping. Re-created following Reference: [3]
Fig. 2: residual block and the skip connection for identity mapping. Re-created following Reference: [3]

The residual learning formulation ensures that when identity mappings are optimal (i.e. g(x) = x), the optimization will drive the weights towards zero of the residual function. ResNet consists of many residual blocks where residual learning is adopted to every few (usually 2 or 3 layers) stacked layers. The building block is shown in Figure 2 and the final output can be considered as y = f(x, W) + x. Here W’s are the weights and these are learned during training. The operation f + x is performed by a shortcut (‘skip’ 2/3 layers) connection and element-wise addition. This is the simplest block where no additional parameters are involved in the skip connection. Element-wise addition is only possible when the dimension of f and x are same, if this is not the case then, we multiply the input x by a projection matrix Ws, so that dimensions of f and x matches. In this case, the output will change from the previous equation to y = f(x, W) + Ws * x. The elements in the projection matrix will also be trainable.


Building ResNet and 1× 1 Convolution:

We will build the ResNet with 50 layers following the method adopted in the original paper by He. et al. The architecture adopted for ResNet-50 is different from the 34 layers architecture. The shortcut connection skips 3 blocks instead of 2 and, the schematic diagram below will help us clarify some points-

Fig. 3: Left: Skip 2 layers, ResNet-34. Right: Skip 3 layers including 1× 1 convolution in ResNet-50. Reference: [1]
Fig. 3: Left: Skip 2 layers, ResNet-34. Right: Skip 3 layers including 1× 1 convolution in ResNet-50. Reference: [1]

In ResNet-50 the stacked layers in the residual block will always have 1×1, 3×3, and 1×1 convolution layers. The 1×1 convolution first reduces the dimension and then the features are calculated in bottleneck 3×3 layer and then the dimension is again increased in the next 1×1 layer. Using 1×1 filter for reducing and increasing the dimension of feature maps before and after the bottleneck layer as described in the GoogLeNet model by Szegedy et al. in their Inception paper. Since there’s no pooling layer within the residual block, the dimension is reduced by 1×1 convolution with strides 2. With these points in mind let’s build ResNet-50 using TensorFlow 2.0.


Building ResNet-50:

Before coding, let’s see the ResNet-34 architecture as presented in the original paper –

ResNet-34 (from K. He et al. "Deep Residual Learning")
ResNet-34 (from K. He et al. "Deep Residual Learning")

The only pooling layers are placed at the very beginning and, before the dense connection at the end of the architecture. To change dimension elsewhere, 1×1 convolution is used as described in the previous section. For the number of filters and other parameters, I followed the Keras example. Now it is time to code. First, we define the simplest identity block where dimension of the input doesn’t change but only the depth, below is the code block-

The other residual block will include a change in the dimension of the input by using a 1×1 convolution with a stride 2. Thus the skip connection will also go through a dimension change –

Combining these two residual blocks we can now build the complete 50 layer ResNet as below –

Using a batch size of 64, 160 epochs and data augmentation, the accuracy of ∼ 85% on training data and ∼ 82% on test data was achieved. Below are training and validation curves –

Train and Validation accuracy/loss are shown for 50 layer ResNet for Cifar-10 data. (Source: Author)
Train and Validation accuracy/loss are shown for 50 layer ResNet for Cifar-10 data. (Source: Author)

Also, the confusion matrix for all the 10 classes in Cifar-10 data can be plotted

CM for Cifar-10 Data Trained using ResNet-50. (Source: Author)
CM for Cifar-10 Data Trained using ResNet-50. (Source: Author)

Discussion:

Here we have seen one example of implementing ResNet-50 with TensorFlow and training the model using Cifar-10 data. One important point of discussion is the order of Convolution – BatchNorm – Activation, which is still a point of debate. The order used in the original BatchNorm paper is not considered best by many. See a GitHub issue here. I recommend you try different parameters than those used in the notebook to understand their effects.

Few of the important points that you can take away from this are –

  1. The distinction between degradation and overfitting and why degradation occurs in a very deep network.
  2. Using 1×1 convolution to increase and decrease the dimension of the feature maps.
  3. How does the residual block help to prevent the degradation problem?

That’s all for now! Hope this helps you a little and stay strong !!

P.S: If you want to build a faster image classification pipeline using tf.data then check this post.


Reference:

[1] ResNet original Paper : Kaiming He et al.

[2] Keras example implementation.

[3] Alex Smola: ResNet Intuition Lecture

[4] Notebook for the codes used: GitHub Link.


If you’re interested in further fundamental Machine Learning concepts, you can consider joining Medium using My Link. You don’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!

Join Medium with my referral link – Saptashwa Bhattacharyya


Related Articles