Notes from Industry

Overfitting and Conceptual Soundness

How feature usage informs our understanding of overfitting in deep networks

Klas Leino
Towards Data Science
8 min readOct 21, 2021

--

Photo by Shane Aldendorff on Unsplash

Overfitting is a central problem in machine learning that is strongly tied to the reliability of a learned model when it is deployed on unseen data. Overfitting is often measured — or even defined — by the difference in accuracy obtained by a model on its training data, compared to on previously unseen validation data. While this is a useful metric that broadly captures the extent to which a model will make mistakes on new points (one of the key problematic implications of overfitting), we will instead take a more general and nuanced take on overfitting. In particular, this article will use the TruLens explanatory framework to examine a key mechanism underlying overfitting: the encoding and use of unsound features.

At a high level, deep networks work by learning to extract high-level features that enable them to make predictions on new inputs. While some of these features may truly be generalizable predictors, others may coincidentally aid classification on the training set only. In the former case, we say the learned features are conceptually sound. The latter type of learned features are not conceptually sound, and thus may lead to anomalous or incorrect behavior on unseen points, i.e., overfitting.

In the remainder of this article, we will present evidence supporting this perspective of overfitting, and show how TruLens can be used to assess the features that are learned and used by neural networks. For a more general introduction to TruLens, see this article.

An Illustrative Example

Our hypothesis is that overfitting manifests itself in a model through idiosyncratic feature use. To illustrate this point, we will consider an example from the “labeled faces in the wild” (LFW) dataset. The LFW dataset contains images of many celebrities and prominent public figures circa the early 2000s, and the task is to identify the person in each picture. We have selected a subset containing five of the most frequently appearing identities. The full dataset can be obtained via scikit-learn.

Sample of LFW training points. We see the image in the top right corner has a distinctive pink background. (Image from Leveraging Model Memorization for Calibrated White-Box Membership Inference)

In the training set, we find that a few images of Tony Blair have a unique and distinctive pink background. Our hypothesis suggests that a model can overfit by learning to use the pink background as a feature for Tony Blair, as the feature is indeed predictive of Tony Blair on the training set. Of course, despite its coincidental usefulness on the training set, the background is clearly not conceptually sound, and is unlikely to be useful on new data.

If the model overfits in this way, it will be evident from an inspection of the features that are encoded and used by the model on instances with pink backgrounds. This can be done using internal influence [1] with TruLens. A notebook reproducing the images and experiments in this article can be found here.

We will begin by training a simple convolutional neural network (CNN) on our LFW training set. For example, using TensorFlow:

from trulens.nn.models import get_model_wrapper# Define our model.
x = Input((64,64,3))
z = Conv2D(20, 5, padding='same')(x)
z = Activation('relu')(z)
z = MaxPooling2D()(z)
z = Conv2D(50, 5, padding='same')(z)
z = Activation('relu')(z)
z = MaxPooling2D()(z)
z = Flatten()(z)
z = Dense(500)(z)
z = Activation('relu')(z)
y = Dense(5)(z)keras_model = Model(x, y)# Compile and train the model.
keras_model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer='rmsprop',
metrics=['sparse_categorical_accuracy'])
keras_model.fit(
x_train,
y_train,
epochs=50,
batch_size=64,
validation_data=(x_test, y_test))
# Wrap the model as a TruLens model.
model = get_model_wrapper(keras_model)

Next we will use internal influence to examine the primary learned features the model uses to make its decisions. According to our hypothesis, an overfit model will have encoded the pink background as a feature used for classification. If this hypothesis holds, we will need to find where this feature is encoded in the model, and whether it is used in classification.

The convolutional layers of a CNN consist of many channels, or feature maps, which are in turn made of a grid of individual neurons. Each channel represents a single type of feature, while the neurons inside each channel represent that type of feature at a specific location in the image. It is possible that a high-level feature (e.g., a pink background) is encoded by a network, but that it doesn’t correspond to a single channel. For example, it may be formed by a linear combination of multiple channels. In our example, for simplicity, we will limit our search to considering single channels, which happens to work for us.

The network we have trained is not particularly deep, so we don’t have too many choices for the layer to search in. Typically, deeper layers encode progressively higher level features; to find our pink background feature, we will begin with searching the second convolutional layer.

We use the following procedure: first we find the most influential channel in the second convolutional layer (layer 4 in the implementation of our model). There are a number of ways we could do this; in our case we will assign influence to each channel according to the maximum influence among each neuron in the channel. Once we have determined the most influential channel, we will visualize it by finding the input pixels that contribute most to that channel. Altogether, this procedure tells us which feature (at our chosen layer) is most influential on the model’s prediction, and which parts of the image are part of that feature.

from trulens.nn.attribution import InternalInfluence
from trulens.visualizations import HeatmapVisualizer
layer = 4# Define the influence measure.
internal_infl_attributer = InternalInfluence(
model, layer, qoi='max', doi='point')
internal_attributions = internal_infl_attributer.attributions(
instance)
# Take the max over the width and height to get an attribution for
# each channel.
channel_attributions = internal_attributions.max(
axis=(1,2)
).mean(axis=0)
target_channel = int(channel_attributions.argmax())# Calculate the input pixels that are most influential on the
# target channel.
input_attributions = InternalInfluence(
model, (0, layer), qoi=target_channel, doi='point'
).attributions(instance)
# Visualize the influential input pixels.
_ = HeatmapVisualizer(blur=3)(input_attributions, instance)
(Image by Author)

The most important pixels are highlighted in red. We see that indeed, the background is being heavily used by our model. Using an alternative visualization technique, we can again confirm that the explanation focuses on the background on these distinctive training points:

from trulens.visualizations import ChannelMaskVisualizer
from trulens.visualizations import Tiler
visualizer = ChannelMaskVisualizer(
model,
layer,
target_channel,
blur=3,
threshold=0.9)
visualization = visualizer(instance)
plt.imshow(Tiler().tile(visualization))
(Image by Author)

For the sake of comparison, we can follow the same procedure on a different model that did not see any pink backgrounds during training. This model has no reason to encode a pink background feature, let alone use it to identify Tony Blair. As expected, we see that the result is quite different:

(Image by Author)

Catching Mistakes with Explanations

Explanations can help increase our trust in conceptually sound models, or help us anticipate future mistakes that may arise from the use of unsound features.

Consider again our running example. The model learned that a pink background is a feature of Tony Blair. As it happens, there are no images in our test set — of Tony Blair or any other person — with a pink background. Our test set will thus not be useful in identifying this case of conceptual unsoundness. But should the model be trusted? Presumably pink backgrounds could easily arise in deployment, even if they are not found in the test set.

Both the model trained with the pink background and the model trained without the pink background achieved roughly the same validation accuracy (between 83 and 84%). From the perspective of the validation metrics, we should be just as happy with either of them. But again, the explanations generated in the previous sections should make it clear that one model has a weakness that the other does not.

In fact, we can directly demonstrate the implications of unsound feature use, which can be foreseen upon examining explanations. Though we have no examples in the test set that display a pink background, this can be easily fixed with some basic photo-editing. Here we have edited an image of a non-Tony-Blair person from LFW, Gerhard Schroeder, to have a pink background. Pictures like the edited image could, of course, easily be realized in real life.

Original image of Gerhard Schroeder from LFW (left) and edited version (right). (Image by Author)

We see that on the original image, the model makes the correct prediction of class 3, corresponding to Gerhard Schroeder. However, on the edited image, the model predicts class 4, corresponding to Tony Blair.

>>> keras_model.predict(original).argmax(axis=1)
array([3])
>>>
>>> keras_model.predict(edited).argmax(axis=1)
array([4])

And, predictably, if we ask the model why it has predicted Tony Blair on the edited image, we see that the pink background is again highlighted.

(Image by Author)

Finally, if we turn to our alternative model trained without the pink background, we observe that our edited image does not cause the same erroneous behavior. After all, the alternative model has no reason to associate a pink background with Tony Blair (or any other person), and did not appear to do so.

>>> keras_model_no_pink.predict(original).argmax(axis=1)
array([3])
>>>
>>> keras_model_no_pink.predict(edited).argmax(axis=1)
array([3])

Other Implications of Overfitting

We have seen how overfitting can lead a model to make peculiar mistakes on unseen data. In addition to causing misclassifications, overfitting presents a privacy risk. Intuitively, by learning features that are overly-specific to the training set, models will inadvertently leak information about their training data. My research [2] on this topic with Matt Fredrikson (appearing in USENIX 2020) uses insights similar to those presented here to show how an attacker can make inferences about the data used to train a model.

In particular, the attack we designed works even on models with essentially no gap in accuracy between their training and validation sets. This underscores the point that overfitting does not necessarily need to show up as mistakes on the validation data to cause problems. By examining the way in which our models use features, we can gain more trust in their efficacy than we should infer from simple performance metrics; and conversely, we can identify potential problems that might otherwise go unnoticed.

Summary

Machine learning models are prone to learning unsound features that can lead to prediction errors, privacy vulnerabilities, etc. Explanations can help identify cases of unsound feature usage that might otherwise go undetected even on the validation data. On the other hand, we should strive for models on which explanations indicate sound feature usage, increasing our trust in the model’s performance on future unseen data. TruLens is a powerful and easy-to-use tool that can help us apply this valuable form of model analysis.

References

  1. Leino et al. “Influence-directed Explanations for Deep Convolutional Networks.” ITC 2018. ArXiv
  2. Leino & Fredrikson. “Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference.” USENIX 2020. ArXiv

--

--

Klas received his PhD at CMU studying the weaknesses and vulnerabilities of deep learning; he works to improve DNN security, transparency, and privacy