Visualizing Convolution Neural Networks using Pytorch

Niranjan Kumar
Towards Data Science
12 min readOct 12, 2019

--

Convolution Neural Network (CNN) is another type of neural network that can be used to enable machines to visualize things and perform tasks such as image classification, image recognition, object detection, instance segmentation etc…But the neural network models are often termed as ‘black box’ models because it is quite difficult to understand how the model is learning the complex dependencies present in the input. Also, it is difficult to analyze why a given prediction is made during inference.

In this article, we will look at two different types of visualization techniques such as :

  1. Visualizing learned filter weights.
  2. Performing occlusion experiments on the image.

These methods help us to understand what does filter learn? what kind of images cause certain neurons to fire? and how good are the hidden representations of the input image?.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI. If you are interested checkout there course.

Receptive Field of Neuron

Before we go ahead and visualize the working of Convolution Neural Network, we will discuss the receptive field of filters present in the CNN’s.

Consider that we have a two-layered Convolution Neural Network and we are using 3x3 filters through the network. The centered pixel marked in the yellow present in Layer 2 is actually the result of applying convolution operation on the center pixel present in Layer 1 (by using 3x3 kernels and stride = 1). Similarly, the center pixel present in Layer 3 is a result of applying convolution operation on the center pixel present in Layer 2.

The receptive field of a neuron is defined as the region in the input image that can influence the neuron in a convolution layer i.e…how many pixels in the original image are influencing the neuron present in a convolution layer.

It is clear that the central pixel in Layer 3 depends on the 3x3 neighborhood of the previous layer (Layer 2). The 9 successive pixels (marked in pink) present in Layer 2 including the central pixel corresponds to the 5x5 region in Layer 1. As we go deeper and deeper in the network the pixels at the deeper layers will have a high receptive field i.e… the region of interest with respect to the original image would be larger.

From the above image, we can observe that the highlighted pixel present in the second convolution layer has a high receptive field with respect to the original input image.

Visualizing CNN

To visualize the working of CNN, we will explore two commonly used methods to understand how the neural network learns the complex relationships.

  1. Filter visualization with a pre-trained model.
  2. Occlusion analysis with a pre-trained model.

Run this notebook in Colab

All the code discussed in the article is present on my GitHub. You can open the code notebook with any setup by directly opening my Jupyter Notebook on Github with Colab which runs on Google’s Virtual Machine. Click here, if you just want to quickly open the notebook and follow along with this tutorial.

Don’t forget to upload the input images folder (can be downloaded from the Github Repo) onto Google Colab before executing the code in Colab.

Visualize Input Images

In this article, we will use a small subset of the ImageNet dataset with 1000 categories to visualize the filters of the model. The dataset can be downloaded from my GitHub repo.

To visualize the data set we will implement the custom function imshow.

The function imshow takes two arguments — image in tensor and the title of the image. First, we will perform the inverse normalization of the image with respect to the ImageNet mean and standard deviation values. After that, we will use matplotlib to display the image.

Sample Input Image

Filter Visualization

By visualizing the filters of the trained model, we can understand how CNN learns the complex Spatial and Temporal pixel dependencies present in the image.

What does a filter capture?

Consider that we have 2D input of size 4x4 and we are applying a filter of 2x2 (marked in red) on the image starting from the top left corner of the image. As we slide the kernel over the image from left to right and top to bottom to perform a convolution operation we would get an output that is smaller than the size of the input.

The output at each convolution operation (like h₁₄) is equal to the dot product of the input vector and a weight vector. We know that the dot product between the two vectors is proportional to the cosine of the angle between vectors.

During convolution operation, certain parts of the input image like the portion of the image containing the face of a dog might give high value when we apply a filter on top of it. In the above example, let’s discuss in what kind of scenarios our output h₁₄ will be high?.

The output h₁₄ would be high if the cosine value between the vectors is high i.e… cosine value should be equal to 1. If the cosine angle is equal to 1 then we know the angle between the vectors is equal to 0⁰. That means both input vector (portion of the image) X and the weight vector W are in the same direction the neuron is going to fire maximally.

The neuron h₁₄ will fire maximally when the input X (a portion of the image for convolution) is equal to the unit vector or a multiple of the unit vector in the direction of the filter vector W.

In other words, we can think of a filter as an image. As we slide the filter over the input from left to right and top to bottom whenever the filter coincides with a similar portion of the input, the neuron will fire. For all other parts of the input image that doesn’t align with the filter, the output will be low. This is the reason we call the kernel or weight matrix as a filter because it filters out portions of the input image that doesn’t align with the filter.

To understand what kind of patters does the filter learns, we can just plot the filter i.e… weights associated with the filter. For filter visualization, we will use Alexnet pre-trained with the ImageNet data set.

#alexnet pretrained with imagenet data
#import model zoo in torchvision
import torchvision.models as models
alexnet = models.alexnet(pretrained=True)

Alexnet contains 5 convolutional layers and 3 fully connected layers. ReLU is applied after every convolution operation. Remember that in convolution operation for 3D (RGB) images, there is no movement of kernel along with the depth since both kernel and image are of the same depth. We will visualize these filters (kernel) in two ways.

  1. Visualizing each filter by combing three channels as an RGB image.
  2. Visualizing each channel in a filter independently using a heatmap.

The main function to plot the weights is plot_weights. The function takes 4 parameters,

model — Alexnet model or any trained model

layer_num — Convolution Layer number to visualize the weights

single_channel — Visualization mode

collated — Applicable for single-channel visualization only.

In the plot_weights function, we take our trained model and read the layer present at that layer number. In Alexnet (Pytorch model zoo) first convolution layer is represented with a layer index of zero. Once we extract the layer associated with that index, we will check whether the layer is the convolution layer or not. Since we can only visualize layers which are convolutional. After validating the layer index, we will extract the learned weight data present in that layer.

#getting the weight tensor data
weight_tensor = model.features[layer_num].weight.data

Depending on the input argument single_channel we can plot the weight data as single-channel or multi-channel images. Alexnet’s first convolution layer has 64 filters of size 11x11. We will plot these filters in two different ways and understand what kind of patterns filters learn.

Visualizing Filters — Multi-Channel

In the case of single_channel = False we have 64 filters of depth 3 (RGB). we will combine each filter RGB channels into one RGB image of size 11x11x3. As a result, we would get 64 RGB images as the output.

#visualize weights for alexnet — first conv layer
plot_weights(alexnet, 0, single_channel = False)
Filters from first convolution layer in AlexNet

From the images, we can interpret that the kernels seem to learn blurry edges, contours, boundaries. For example, figure 4 in the above image indicates that the filter is trying to learn the boundary. Similarly, figure 37 indicates the filter has learned about contours that could help in the problem of image classification.

Visualizing Filter — Single Channel

By settingsingle_channel = True we are interpreting each channel present in the filters as a separate image. For each filter we will get 3 separate images representing each channel since the depth of the filter is 3 for first convolution operation. In total, we will have 64*3 images as the output for visualization.

Filters from the first convolution layer in AlexNet (A subset out of 192 images)

From the above figure, we can see that each filter channel out of a total of 64 filters (0–63) is visualized separately. For eg. figure 0,0 indicate that the image represents the zeroth filter corresponding to the zeroth channel. Similarly, figure 0,1 indicates that the image represents the zeroth filter corresponding to the first channel and so on.

Visualizing the filter channels individually gives more intuition about what different filters are trying to learn based on the input data. By looking closely at the filter visualizations, it is clear that the patterns found in some of the channels from the same filter are different. That means not all channels present in a filter are trying to learn the same information from the input image. As we move deeper into the network the filter patterns more complex, they tend to capture high-level information like the face of a dog or cat.

As we go deeper and deeper into the network number of filters used for convolution increases. It is not possible for us to visualize all these filter channels individually either as a single image or each channel separately because of the large number of such filters. The second convolution layer of Alexnet (indexed as layer 3 in Pytorch sequential model structure) has 192 filters, so we would get 192*64 = 12,288 individual filter channel plots for visualization. Another way to plot these filters is to concatenate all these images into a single heatmap with a greyscale.

#plotting single channel images
plot_weights(alexnet, 0, single_channel = True, collated = True)
Filters from the first convolution layer in AlexNet — Collated Values
#plotting single channel images - second convolution layer
plot_weights(alexnet, 3, single_channel = True, collated = True)
Filters from the second convolution layer in AlexNet — Collated Values
#plotting single channel images - third convolution layer
plot_weights(alexnet, 6, single_channel = True, collated = True)
Filters from the third convolution layer in AlexNet — Collated Values

As you can see there are some interpretable features like edges, angles, and boundaries in the images from the first convolution layer. But as we go deeper into the network it becomes harder to interpret the filters.

Image Occlusion Experiments

Occlusion experiments are performed to determine which patches of the image contribute maximally to the output of a neural network.

In a problem of image classification, how would we know that the model is actually picking up an object of interest (eg. car wheel) as opposed to the surrounding background image?.

In occlusion experiments, we iterate over all the regions of the image systematically by occluding a part of the image with a grey patch set to be zero and monitoring the probability of the classifier.

For example, we start the occlusion experiment by greying out the top left corner of the image and compute the probability of a particular class by passing the modified image through the network. Similarly, we will iterate over all regions of the image and look at the probability of the classifier for each experiment. The heatmap in the above figure clearly shows that the probability of true class drops significantly if we occlude our object of interest like a car wheel or the face of a dog (the dark blue region).

The occlusion experiments tell us that our convolution neural network is actually learning some meaning patterns like detecting the face of a dog from the input. That means that the model is truly picking up the location of a dog instead of identifying based on the surrounding context like a sofa or a couch.

To understand this concept clearly, let’s take an image from our data set and perform occlusion experiments on it.

For occlusion experiments, we will use VGG-16 pre-trained on ImageNet data.

#for visualization we will use vgg16 pretrained on imagenet data
model = models.vgg16(pretrained=True)

To perform the experiments, we need to write a custom function to conduct occlusion on the input image. The function occlusion takes 6 arguments — model, an input image, an input image label, and occlusion hyperparameters. The occlusion hyperparameters include the size of the occlusion patch, occlusion stride, and occlusion pixel value.

In the function first, we are getting the width and height of the input image. After that, we will compute the output image width and height based on the input image dimensions and occlusion patch dimension. Then we will initialize the heatmap tensor based on the output height and width.

Now we would iterate through each of the pixels present in the heatmap. In each iteration, we will compute the dimensions of the occlusion patch to be replaced in the original image. We then replace all the pixel information in the image with occlusion patch in the specified location i.e… modifying the input image by replacing a certain area with a grey patch. Once we have the modified input we will pass it through the model for inference and compute the probability of a true class. Then we are updating the heatmap at the corresponding location with the probability value.

Once we obtain the heatmap, we are displaying the heatmap using a seaborn plotter and also set the maximum value of gradient to probability.

Occlusion Heatmap

From the heatmap, the darker color represents the smaller probability, meaning that the occlusion in that area is very effective. If we occlude or cover the area with a darker color in the original image then the probability of classifying the image falls significantly (less than 0.15).

What’s Next?

If you want to learn more about Artificial Neural Networks using Keras & Tensorflow 2.0 (Python or R). Check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. They explain the fundamentals of deep learning in a simplistic manner.

Conclusion

In this article, we have discussed the receptive field of a neural network. After that, we have discussed two different methods to visualize a CNN model along with Pytorch implementation. Visualizing the neural network models gives us a better intuition of how to improve the performance of the model for a wide range of applications.

Recommended Reading

Feel free to reach out to me via LinkedIn or twitter if you face any problems while implementing the code present in my GitHub repository.

Until next time Peace :)

NK.

Disclaimer — There might be some affiliate links in this post to relevant resources. You can purchase the bundle at the lowest price possible. I will receive a small commission if you purchase the course.

--

--

Senior Consultant Data Science|| Freelancer. Writer @ TDataScience & Hackernoon|| connect & fork @ Niranjankumar-c