Recently, more and more attention is focussed on the interpretability of the Machine Learning models mainly Deep Learning ones because of their black-box nature. One such important Deep Learning architecture used is Convolutional Neural Networks(CNNs) which has made a breakthrough in Computer Vision including image classification, object detection, semantic segmentation, Instance segmentation, image captioning etc. The progress in the refinement and development of CNNs has been exponential and the architectures have been greatly simplified but the prediction results cannot be decomposed into intuitive and completely understandable parts.
Understanding and Interpreting models are crucial to building people’s confidence in our system. The most common problem tackled by the CNNs is image classification and a visual explanation approach usually known as saliency maps or attribution map is used to find the most influential features that affect the predictions of the model. There are several techniques used to generate saliency maps and the most common in them is gradient-based visualization which backpropagates the partial derivative of the score on the target class concerning the input layer. However, the problem with gradient-based maps is that they are generally of low quality and have random noise. Other methods include perturbation based methods, which work by adding a small noise to the input(perturbing) and observe the change of the prediction. Another is Class Activation Maps(CAM) which were explained by me recently in my post and their extensions like Grad-Cam and Grad-Cam++. They usually generate Saliency maps by a linear weighted combination of the activation maps to highlight import regions in the image space. However, they also suffer from the problem of random noises which are irrelevant to the target object in the image and the weight does not well capture the importance of each activation map.
SCORE-CAM:
In this article, I would try to explain a very recent paper Score-Cam which is an improvement of the previous methods mentioned above and tries to make CNN’s interpretable as its predecessors.
The Score-Cam builds on top of Class Activation Mapping and claims to solve the previously reported problems of irrelevant noises and generates more clean and meaningful explanations. In Score-Cam, the researchers follow the idea of perturbation-based methods that mask part of regions in the original input and observe the change in target score. The activation mask obtained are treated as a kind of masks for the input image, which masks parts of the input image and makes the model predict on the partially masked image. The score on the target class is then utilized to represent the importance of the activation map.
Examples of Masked Activations using Perturbation-based methods:-




The Score-Cam, unlike the famous Grad-Cam, didn’t make use of gradients because the researcher believes that the propagated gradients are pretty unstable and produce random noise in gradient-based saliency maps. The unstable nature of the gradients is shown in Figure 3 where the gradient changes sharply when the input image is changed a little even though the change is not perceived by the human eye and does not change the prediction result. That’s why it is reasonable to doubt the effectiveness of gradient-based methods to reduce redundant noise.

Methodology:
Various previous works like the CAM, GradCam, etc. have asserted the fact that deeper layers in CNNs capture higher-level visual information. Furthermore, convolutional features naturally retain spatial information that is lost in fully connected layers, so it is common to expect the last convolutional layer to offer the best compromise between high-level semantics and detailed spatial information, and the neurons in these layers look for class-specific semantic information in the input image. Hence in Score-Cam, we use the last layer to get the activation maps containing the balanced representations.
In contrast to GradCam and GradCam++ which use the gradient information flowing through the last layer of the CNN to represent the importance of each activation map. In Score-Cam, we use the weights of the score obtained for a specific target class. Hence, Score-Cam can get rid of the dependence on the gradient and works as a more general framework as it only requires access to the class activation map and output score of the model.
The Pipeline of Score-Cam is shown in Figure 4 and shows all the steps involved to reach the final Saliency Map.

In order to obtain,the class discriminative Saliency map using Score-Cam, the process is divided into the following steps:
- The first step involved is passing images to a CNN model and performing a forward_pass. After the forward pass, the activations are extracted from last convolutional layer in the network.
- Each Activation Map obtained from the last layer having shape 1xmxn is then upsampled using bilinear-interpolation to the same size as the Input Image.
- After Upsampling the activation maps, the resultant activation maps are normalized with each pixel within the range of [0,1] to maintain the relative intensities between the pixels. The Normalization is achieved using the following formula shown in Figure 5.

- After the Normalization of the Activation Maps is complete, the highlighted areas of the activation maps are projected on the input space by multiplying each normalized activation map(1 x W x H) with the Original Input Image(3 x W x H) to obtain a masked image M with shape 3 x W x H.

- The Masked Images M thus obtained are then passed to Convolutional Neural Network with SoftMax output.

- After getting the scores for each class we extract the score of the target class to represent the importance of the kth activation map.

- Then we compute the sum across all the activation maps for the linear combination between the target class score and each activation map. This results in a single activation map having the same size as that of the input class.
- Finally, we apply pixel-wise ReLU to the final activation map. We apply ReLU because we are interested only in the features that have a positive influence on the class of interest.

IMPLEMENTATION OF SCORE CAM IN KERAS:
We will follow the same pipeline described early both in the figure and methodology.
- We use VGG16 pre-trained on Imagenet as our model for the whole pipeline.

- We load the input image shown above in **** Figure 1, preprocess it so that it is suitable to be passed into the VGG16 model.

- After that, we pass the image through the model and obtain prediction scores for each class. We extract the index for the target class from the predictions.

- Then we extract the activations from the final convolutional layer of size (1 x 14 x 14 x 512).


- Then we upsample all the activation maps to match the size of the original input image.



- Then, we normalize each of the 512 activation maps using the formula shown in Figure 5.
Note: We add a very small term i.e 1e-5 in the denominator to prevent the dividing by zero error leading to nan values.

- After step 6, we project the masks produced on the original image by performing an element-wise multiplication between the masks and the input image as shown in Figure 6.

- The masked images thus obtained are then forward propagated through the VGG16 model and the softmax scores are obtained.


- After obtaining the scores for all the classes, we just extract the scores corresponding to our target class. In the case of our input image, we have two classes present but for demonstration purpose, we will consider the dog as our target class.

- Now we have all that was required to produce the class discriminative saliency map i.e the normalized activation map and the score of the target class to be used as weights. Now we perform the element-wise multiplication of the target class softmax scores and normalized activation map.

- After the last step is complete and we have the result, we perform a summation over all the activation maps(512), and we combine all the maps to produce a single activation map of shape 1 x 224 x 224 x 3
- To obtain the final saliency map, we perform pixel-wise ReLU to the activation map obtained in the last step.

The Final Class Discriminative Map obtained is shown in Figure 9 below.

As we can see the Score-Cam is class discriminative as well as has less noise as compared to its predecessors.
Advantages of Score-Cam:
- Score-Cam like Grad-Cam and Grad-Cam++ can be used in any Convolutional Neural Network architecture and don’t require retraining of the model to produce saliency maps like CAM.
- Score-Cam is class discriminative and removes irrelevant noise to produce a meaningful saliency map.
- Uses the Softmax scores as weights and removes the dependence on unstable gradients.
Conclusion:
In this post, we discussed a recently released paper on a new novel architecture for score based Class Activation mapping (Score-Cam) which provides a new way to tackle the problem of model interpretability. The paper introduced an architecture in which the useless noise is removed and only the important saliency map is produced. Moreover, it removed the dependence on the target class gradients and produced a more generalized way to produce saliency maps. The paper draws its inspiration from two ways of producing saliency maps mainly perturbation and class activation maps and produces a method involving the best of both.
To learn more about Score-Cam, read the paper provided on the following link https://arxiv.org/abs/1910.01279.
Hope you liked the post and for further discussions, queries or related content you can contact me on Linkedin or follow me on Twitter.