Model Interpretability

A few months ago, OpenAI released CLIP which is a transformed-based neural network that uses Contrastive Language–Image Pre-training to classify images. CLIP’s performance was quite impressive since it was using an unusual approach that combined both text and images as input to classify images. Usually, there is only 1 paper that boasts about the new state-of-the-art performance that archives some really high score and that’s it. The best thing about CLIP is that a few days ago, another small paper was released to explore CLIP’s interpretability.
Our paper builds on nearly a decade of research into interpreting convolutional networks, beginning with the observation that many of these classical techniques are directly applicable to CLIP. We employ two tools to understand the activations of the model: feature visualization, which maximizes the neuron’s firing by doing gradient-based optimization on the input, and dataset examples, which looks at the distribution of maximal activating images for a neuron from a dataset.
Source: OpenAI
This paper is quite interesting because it reveals a lot of useful information that helps to explain why CLIP performs so well. Personally, I have been quite interested in the topic of Neural network interpretability, simply because I recently understood that it is quite vital to releasing an actual AI model into production.
Using feature visualization (which I will get into in a bit), the authors of CLIP started examining the model on 2 levels. The first one is a neuron level, so they would pass a few similar images in the network and check if the same neuron gets activated by a similar amount. This is quite interesting because for example if you have a network that classifies animals, imagine having a "dog neuron", or a "cat neuron". Also, if your network fails at classifying a certain animal after you have done this analysis, you would know where to look!
There is a lot to talk about here and there are many areas that I think this paper talks about. One of the most interesting things is actually having a look at the activation maps, I can’t include them here because of licensing issues, but I urge you to have a look there. The other interesting area (which is the one I am going to cover here, is the aspect of image-based neural networks’ interpretability and what benefits they actually got from their analysis.
Feature visualisation

So typically I would be focusing on the "news" side of things where the most important outcome of my article would be to summarise the recent release of a model. However, I want to explain how the most common model interpretability techniques here work in general before talking about CLIP. I think this will be quite beneficial to the reader since it will not be a CLIP-specific explanation (and thus can be applied to other models). If you aren’t interested, you can just skip this "Feature visualization" section
The first step in interpreting the system was to implement the following three methods from the literature. Although the field of neural network interpretation is a relatively recent development, these methods have had success in a range of problems.
- Class saliency maps
This method of model interpretation involves ranking the pixels of a given image according to their influence on the class score obtained by that image. We first consider a simplified example using a linear model. For a vectorized image I and a class c, taking the derivative we can see that the importance of the pixels of the image I is given by the components of the weight vector wc. Here we see the key idea in this method. The magnitude of the weight values shows which pixels in the image need to be changed the most to increase the class score.
However, we are not actually using a linear system to make our predictions. In fact, the function applied by the CNN is highly non-linear and we must therefore use a first-order Taylor expansion to approximate the score function in the neighborhood of the image
Given an image I with m rows and n columns, we first convert the image to greyscale so that each pixel only has one value. The weight matrix w (found by performing a single pass of back-propagation) then also has dimensions m by n. For each pixel (i,j) in image I, the class saliency map M ∈ Rm×n is defined as:
Mi,j = |wi,j|
This method is considerably faster than the two methods outlined below because only one pass of back-propagation is required.
2. Class-specific image generation
The method aims to generate an image that is "representative of the class" according to the scores generated by the model. The image generated is image I which maximizes the score Sc(I) for a given class c. This image is also regularized according to the L2-norm.
This image I is found using back-propagation, in a similar way to when training the model. The key difference is that now, instead of keeping the same inputs and optimizing the weights, we are keeping the weights constant and optimizing the inputs (i.e. the values of the pixels in the image). For this method, the image is initialized to have random RGB values for each pixel.
3. Deep Dream image generation
The key difference between Deep Dream image generation and class-specific image generation is that the starting image is no longer random. In the case of Deep Dream, an actual image from the data set is used as the initial image. The rest of the method is the same as class-specific image generation; we simply perform back-propagation and find the gradient with respect to the input image to update the image.
The methods outlined above have been implemented in papers mostly on images from the ImageNet data set. These images are of everyday objects and generally consist of a clear foreground and background.

Outcomes of interpretability analysis
Okay, let’s get back to CLIP now. So the authors explored the model’s interpretability and they classified the neurons into multiple groups: Region, People, and Emotion Neurons. Note that most of their analysis was on the vision side of CLIP, not the text side. The resulting analysis shows the unusual association between images and their meaning.
For example, for emotion neurons, bored would be equal to "Relaxing + grumpy + sunset". Also, the model shows a clear bias (because of the dataset) in a lot of cases. For example, "illegal immigration" seems to be highly correlated to "Latin America". My point here is not to iterate over the examples that you can just read about in their article, but rather the benefit of the interpretability analysis. Because as a Machine Learning developer, this is quite useful information.
Very funny failure cases
I am sure a lot of you have probably seen this somewhere around in the media:

It’s basically showing that although CLIP classifies an apple correctly, simply a "Pen and paper activation attack" can fool the network into thinking that this image is an iPod. This was quite eye-opening for me because it shows that although CLIP is a powerful model, it can be easily fooled. And I think this is true for a lot of the current AI systems (and I wonder how Elon Musk thinks AI is going to take over the world). I don’t mean to be discouraging, I think it is a good thing that we are aware of such issues so that we can fix them, and this is one of the main aims of this article, to promote more interpretability studies. Also, I think analyzing the attacks that can make a neural network fail is a significant aspect of such a study.
Finally, one of the most interesting things that they released is the OpenAI microscope. Which is a tool that you can use to actually see those activations for yourself!
Final Thoughts
I hope you have enjoyed this article, I tried not to make it too long. I think one of the most important outcomes that I am hoping you would get out of this is to have a small interpretability section in your machine learning project. This will help you with understanding deep learning in-depth and will provide extremely valuable images that you can present (in addition to your scores and metrics).