A powerful method for weakly supervised object localization and debugging

Introduction
In this article I want to share a very powerful and interesting technique with you. This technique is called Class Activation Maps (CAMs), which were first introduced by researchers of MIT in the paper "Learning Deep Features for Discriminative Localization". The usage of CAMs allows you to not only see the class the network predicts, but also the part of the image the network is especially interested in. This helps to get more insights into the learnings of the network and to also ease up debugging, because the user gets an object localization of the predicted class without having to explicitly label the bounding box for this object. In the chapter Methodology I want to go through the used data and where to find the code used for this article. The chapter Model describes the used model and the training of it. The chapter Class Activation Mapping describes the idea of CAMs and how to compute it. The chapter Conclusion is supposed to conclude the findings.
Methodology
The training process of the network and the computation of the CAMs is done using jupyter notebook and tensorflow. The data set from Kaggle’s 360 fruits challenge is used. It contains 90483 images of fruits and vegetables with a total of 131 different classes. Figure 1 shows an example image with the ground truth label as title. The corresponding notebooks can be found in my Github repository.

Model
As model, I decided to use the already trained ResNet50 [4] for Transfer Learning (TL). This model was trained on the ImageNet challenge containing 1000 different classes. TL is very useful for quickly training a large and deep network, because the network doesn’t have to be trained from scratch. Normally, the deeper you go, the more complex features the filters learn in your network. The filters of the first layer only learn very low level features like edges. This is due to the fact that the filters of the first layer only see a small portion of the input image. On the second layer, this portion (also called receptive field) is already larger and so the filters can learn more complex features. And on the last convolution layers the filters can already detect complete objects. When using TL, a pre-trained model gets loaded and the classification layer gets dropped. Afterwards, the complete CNN except the new added classification layer gets frozen and trained until the loss converged. This avoids destroying already learned features due to the initially potentially large gradients. Optionally, you could afterwards unfreeze the remaining part of the CNN or only the last layers and fine tune their weights to fit your current task. For this task I decided to only re-train the classification layer in order to focus on computing the CAMs. Therefore, I dropped all dense layers from the ResNet50 at the end and added a global average pooling layer (see next chapter why this is added) plus the softmax layer with as many neurons as classes to classify. The architecture will be described in more detail within the next chapter. The model was then trained for five epochs resulting in a validation accuracy of 99% at the end.
#
A CAM is a weighted activation map generated for each image [1]. It helps to identify the region a CNN is looking at while classifying an image. CAMs aren’t trained supervised, but in a weakly supervised fashion. This means, that the objects do not have to be labeled manually and the localization is kind of learned for "free". The only thing that has to be changed on the architecture is to get rid of the fully connected dense layers at the end in order to maintain the spatial information contained in the output of the last convolution layer. In addition, a global average pooling layer is added afterwards. This layer is usually used for regularization in order to prevent the network from overfitting [2]. Finally, the output softmax layer with as many neurons as classes to classify (in our case: 131) is added. Figure 2 summarizes the required architecture and shows how to compute the CAM.
![Figure 2: CAM architecture and procedure [1]](https://towardsdatascience.com/wp-content/uploads/2020/07/1O5azF2X0KF1NQmpcooXF1Q.png)
In our case, the architectural changes and the training is already done (c.f. chapter Model), so we can directly start with computing the CAM. As first step, we feed an image to the network and compute the classification output of the network. As next step, we fetch the weights connected to the "winning" neuron. Additionally, we store the output of the final convolution layer. Then, we compute the CAM by multiplying each depth from the output of the final convolution layer the corresponding weight connected to the "winning" neuron (i.e. w1 will be multiplied with depth one of the final convolution layers’ output, w2 with depth two, …) and summing them all up. Finally, we use bilinear up sampling in order to extend the size of the CAM to match the size of the input image. Figure 3 shows the function from the jupyter notebook on my github page.
Figure 4 shows six example CAMs. The title of each CAM shows the predicted label of the network.

Conclusion
As one can see, the CAM can be easily computed by just making little adjustments to the network architecture and comes for free, so no one has to expensively label the data. Also, the CAM can be used for debugging, because one can see on which area of the image the network is focusing on. This could be helpful if, for example, the network is supposed to classify the presence of a harbor in the image and by plotting the CAM it gets clear that the network is focusing on the sky and not on the harbor itself. Then the user could adapt the dataset to also contain images of harbors without a visible sky and therefore avoid such wrong correlations.
Resources
[1]: Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Learning Deep Features for Discriminative Localization (2015), MIT [2]: Min Lin, Qiang Chen, and Shuicheng Yan, Network In Network (2013), National University of Singapore [3]: Alexis Cook, Global Average Pooling (2017), Github Blogpost [4]: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition (2015), Microsoft Research