In recent times study and search for interpretability and explainability methods for deep neural networks have got pace. This comes at a time when AI is being accepted to use in more and more mission-critical sectors. The slightest miscomputation of these Deep Learning or AI-based systems can cause mayhem in terms of loss of trust, loss of money, loss of socio-economic stability, and even loss of human life. There is an urgent need to break the black-box nature of deep learning models and make them more understandable for larger users, especially for common people. The need for comprehensibility of models was felt never before since the advancement of deep learning. There is also a more generalized study of fairness in AI is going on around this, but in this blog, we are not going into the scrupulous details of the study. Rather I tried to find whether a very accurate model can attain the human-level vision.
And in my quest to find the answer I train a simple CNN model on MNIST digit data and try to answer some of the questions like.
- What part of the image is important for the classification result?
- Does high accuracy mean reliability?
- Does the model think as we humans do?
To find these answers, I employed some of the already popular interpretability methods and a custom high-level segment based visual interpretability method.
Model:
MNIST is probably one of the simplest computer vision datasets. So we don’t need a very deep network to achieve very high accuracy(>90%) in the classification task. I have created a very simple model for this purpose (probably a bit slapdash manner) and trained it to achieve 96.25% validation accuracy and 95.15% test accuracy.


Traditional Interpretability Study:
It’s now time to employ some of the popular interpretability methods to explain the classifier’s decision. I have used the ‘tf_explain’ library to perform the task. And to be precise I have used SmoothGrad, Integrated Gradients, and Occlusion Sensitivity method from the library. The tests are performed on the first image of the test dataset, which is eventually the digit 7.

The CNN model correctly predicts the class for the input as 7 with a score of 0.9993 (softmax). Below is the code for running the methods and the output of the three aforementioned techniques.


Attribution based interpretability methods provide reasoning for the outcome of the model concerning the individual pixel gradient. Each pixel is considered as individual input (independent) variable for the model. Whereas occlusion sensitivity tells us how different parts of the input image are impacting the decision of the model, using a sliding window over the input image. For our experiment, the window size is 3×3. A single-pixel itself doesn’t carry much high-level feature for the image other than the intensity of the image.
So, the question remains whether this sort of interpretability is comparable with human vision or not.
High-Level Segmentation Based Interpretability:
Human vision is different from computer vision in two main aspects. Firstly, the human brain is a huge source of prior knowledge, acquired by diverse sensory organs, experience, and memory. The deep learning model lacks this sort of prior knowledge for the vision-related task. And secondly, when we see a picture, rather than focussing on the complete image we focus (pay attention to) on different areas of the image, gather high-level features, and then consolidate all that high-level feature to decide on the image. So, if we ask ourselves why the input is an image of digit 7. We probably answer in a fashion that it has got a horizontal line along with a connected slanting vertical line and it matches our previous knowledge of digit 7, hence this input image is actually of class 7.
Can we get this level of interpretation from the CNN model? To find this out, I have employed a special technique. I have segmented the input image with the ‘Felzenszwalb’ method using the ‘skimage’ library and rather than the whole image giving as input to the model, I have given individual segments as input to the model and predicted the class along with the score.

I find the outcome of this experiment unusual, interesting, uncanny, and dangerous at the same time. If you have a look at the top three segments, which are nothing but the horizontal line from the actual image of digit 7, The model can predict those as class 7 with an almost near-perfect score. Those segments are nothing like digit 7. Whereas the 4th segment which somewhat like digit 7 the prediction score comes down to 0.913.
This finding further underscores the question, what the network is actually learning. Is it at all able to learn any high-level features like we human do or it just finds some low-level interaction of different intensity patterns of the pixels and classifies the images based on the presence or absence of those patterns?
Conclusion:
The blog aims to demonstrate the fact that high accuracy always doesn’t mean that the model is reliable or it has achieved a human-level understanding of images. Rather the outcome is quite opposite. So, as deep learning practitioners, we must ensure that the model is indeed able to perform vision-related tasks based on reliable high-level features. This sense of reliability is very much needed before we handover all our tasks to AI-based or deep learning-based systems. I feel that the search for more robust interpretability or reliability methods is still in a nascent stage and we will be able to build more reliable models for vision-related tasks in the future.