The world’s leading publication for data science, AI, and ML professionals.

AI Explainability Requires Robustness

How robustness to adversarial input perturbations affects model interpretability

Model Interpretability

Photo by Nagara Oyodo on Unsplash
Photo by Nagara Oyodo on Unsplash

Due to their opaqueness, a great deal of mystique surrounds the apparent power of deep neural networks. Consequently, we often want to gain better insight into our models through explanations of their behavior. Meanwhile, as we will see, the existence of adversarial examples – known to plague typical neural networks – implies that explanations will often be unintelligible. Luckily, recent effort seeking to find ways to train so-called robust models reveals a pathway to more interpretable models; namely, models that are trained to be robust to adversarial input perturbations exhibit higher-quality explanations.

Explanations and Interpretability

In the context of machine learning, an explanation broadly refers to some construct that helps us understand a model’s behavior. Most frequently this is realized by an attribution method, which quantifies the degree to which the model uses each of its features on a particular input.

Typically the consumers of explanations are humans, so we need a way of interpreting explanations. In the case of image classifiers, for example, we may visualize explanations by highlighting the parts of an image that were deemed most relevant by the model. An explanation can be considered interpretable if it can convey useful insight that can be easily understood by the person inspecting the explanation. Likewise, we will typically refer to the model itself as interpretable or explainable if explanations produced for the model generally make sense to humans.

While there are many methods for producing explanations, we will be focused on mathematically-rigorous methods (e.g., [1]) that produce explanations that are causally relevant to the model’s actual behavior, a property we will refer to as faithfulness. This is veritably an important property to have – an intuitive explanation is more misleading than helpful if it doesn’t accurately describe the model’s behavior. Thus, we should resist the temptation to "improve" our explanations by seeking explanation methods that attempt to always produce interpretable explanations, as they may be biased away from being faithful to the model.

In this light, we see that a lack of interpretability is actually an intrinsic problem to a model. The remainder of this article will explore a key instance of this insight.

Adversarial Examples

An adversarial example is an input to a model that resembles one class (e.g., "panda"), while being classified as another class (e.g., "gibbon") by the model. While the concept of resemblance is nebulous, we typically take it to mean that an adversarial example is derived by perturbing a natural input in a semantically meaningless way – for example, the perturbation may be small enough to be imperceptible to the human eye, or simply inconspicuous in the given context.

A classic example of an adversarial example, adapted from Goodfellow et al. [3], where an image of a panda is manipulated imperceptibly to fool a neural network into predicting "gibbon."
A classic example of an adversarial example, adapted from Goodfellow et al. [3], where an image of a panda is manipulated imperceptibly to fool a neural network into predicting "gibbon."

Adversarial examples impact the reliability of neural networks that are vulnerable to them—and constitute a security concern in safety-critical ML systems – as they lead to unexpected erroneous behavior on seemingly benign inputs.

Upon closer inspection, it becomes clear that the existence of adversarial examples has implications for explainability as well. Specifically, adversarial examples imply – by their existence – that legitimate explanations may rightly be as perplexing as the anomalous behavior induced by the adversarial examples themselves.

Consider the following thought experiment (illustrated in the figure below). Suppose we have an image, like the image of a panda on the left in the figure, that our model correctly labels as "panda." Further, let us suppose that changing one pixel in the corner of the image to red leads the model to produce a different label from "panda."

Example of how an explanation can be unintelligible, yet justified. Image by author.
Example of how an explanation can be unintelligible, yet justified. Image by author.

When we ask the model why it labeled the original image as a panda, we might get an explanation highlighting nothing but the single pixel in the corner of the image, as shown in the figure. While this explanation is certainly confusing, when we consider the context of the model’s behavior, there is actually an argument for why such an explanation would be justified. After all, if it weren’t for the value taken by that pixel in the original image, the label might not have been "panda." Thus it is reasonable to consider that pixel to be highly important in the model’s decision to label the image as "panda."

This suggests that in order for a model to be interpretable, it must make decisions based on features that are fundamentally intelligible to humans. Furthermore, we cannot expect that a model will naturally learn human-intelligible features without adequate regularization. After all, there are many ways of using features that could be consistent with the training data, but clearly not all of them are intelligible.

Robust Models

In order to defend against adversarial examples, we often aim to obtain so-called robust models, that are resistant to malicious perturbations. These defenses are usually tailored to a specific class of adversarial examples that can be precisely defined without depending on human perception, namely small-norm adversarial examples.

As the name suggests, a small-norm adversarial example is one for which the norm of the perturbation is below some small threshold, typically denoted by ε. In other words, the distance of the adversarial example (according to some metric, e.g., Euclidean distance) from the original input is less than ε. In terms of perception, when ε is sufficiently small, any points that are ε-close to the original input will be perceptually indistinguishable from the original input.

We say that a model is _locally robus_t at a point, x, if all points within a distance of ε from x receive the same label as x from the model. Hence, we see that small-norm adversarial examples cannot be derived from points for which a model is locally robust. Robust models resist adversarial examples by endeavoring to achieve local robustness at as many points as possible.

There has been a great deal of research on various methods for producing robust models. For example, a popular family of approaches is adversarial training [4], in which the training set is augmented by adversarial perturbations during training – essentially, the network is trained on adversarial examples. While adversarial training often provides a decent empirical defense, it does not offer any guarantee that would allow us to know which points, if any, are truly locally robust.

We might alternatively want to provide provable guarantees of robustness, as opposed to using heuristic defenses like adversarial training. For example, GloRo Nets [2], a recent type of neural network designed to be robust-by-construction, offer a state-of-the-art approach for training models with robustness guarantees. (For a crash course on how GloRo Nets achieve robustness, check out my blog post on the subject).

Robustness and Explainability

Intuitively, when a model is robust, it can’t heavily rely on imperceptible patterns to make its decisions – otherwise, these patterns could be inconspicuously added to natural images to lead the network astray, resulting in adversarial examples. This point is made more formally in work by Ilyas et al. [5] who argue that, broadly speaking, there is a dichotomy between "robust features" and "non-robust features." The latter are responsible for adversarial examples, and are inherently non-interpretable. On the other hand, robust models are dissuaded from learning these non-robust features, meaning they will primarily use features that are at the very least perceptible to humans, giving a chance at interpretability.

In practice, this leads robust models to exhibit explanations that are far more intelligible than the explanations on their non-robust counterparts. An example of this is shown in the figure below.

Example demonstrating how explanation quality is improved on robust models. Image by author, derived from the MNIST dataset.
Example demonstrating how explanation quality is improved on robust models. Image by author, derived from the MNIST dataset.

In the example shown in the figure, we trained two simple convolutional network models on the MNIST dataset: one was trained non-robustly using standard training (bottom); the other was trained using GloRo training [2], which produces a provably robust model (top). We then calculated and visualized gradient-based input explanations for both models on a sample of test inputs, using the TruLens library. In the visualizations, red regions correspond to pixels that would increase the confidence in the correct class if their brightness were amplified, while blue regions correspond to pixels that would decrease the confidence if brightened.

Intuitively, the most positively-relevant (red) pixels ought to be those that correspond to the hand-written digit in each image. We see that on the robust model, the salient pixels indeed line up with this intuition. On the other hand, the explanations on the non-robust model are far noisier, and seem to indicate that the model is less focused on the actual digits and more sensitive to irrelevant artifacts.

Summary

Fundamentally, quality explanations require quality models, as an explanation is intended to accurately illuminate a model’s behavior. Meanwhile, at the heart of things, a lack of robustness is a problem with model quality and conceptual soundness. Thus, robustness is a basic requirement for model quality that is necessary for ensuring interpretability. However, we should be clear that robustness alone may not always guarantee that a model uses features in a reasonable way. In other words, while robustness is not automatically sufficient for conceptual soundness, it is necessary; accordingly, we should keep robustness in our toolbox whenever we want explainable models.

References

  1. Leino et al. "Influence-directed Explanations for Deep Convolutional Networks." ITC 2018. ArXiv
  2. Leino et al. "Globally-Robust Neural Networks." ICML 2021. ArXiv
  3. Goodfellow et al. "Explaining and Harnessing Adversarial Examples." ICLR 2015. ArXiv
  4. Madry et al. "Towards Deep Learning Models Resistant to Adversarial Attacks." ICLR 2018. ArXiv
  5. Ilyas et al. "Adversarial Examples are not Bugs, They are Features." NIPS 2019. ArXiv

Related Articles