Identifying Global Feature Relationships with SHAP Values

How to calculate global information for your features and use it for agnostic feature selection proceduresx

Published in

Towards Data Science

8 min readDec 7, 2021

The usage of SHAP Values is now one of the most used ways of explaining machine learning models and understanding how the features of your data are related to your outputs.

One of the biggest advantages of SHAP Values is that they provide local explainability: we can see how each feature affected the result for each instance.

This power, however, comes at a cost: it is not that straightforward to use SHAP Values for Feature Selection. It is common to cut off the features based on their average SHAP Value, or their max SHAP Value, however, those approaches do not guarantee that you are not removing some important feature from your dataset. Until today, there is no consensus on how to use it for feature selection.

Another way is to define feature importance based on the percentual impact on your dataset. This, however, requires a lot of fine-tuning and is not standardized enough to be used in an agnostic way.

In this post, we will study a paper[1] that proposes a way of finding global explanations using SHAP Values in a way that we are able to identify the degree of convergence between our features, allowing for a better understanding of the dataset and providing a better way of doing Feature Selection.

All the codes from this post are available in a notebook on Kaggle and also on my Github.

SHAP Quick Overview

First, let’s do a recap on SHAP Values and the SHAP library. The idea is not to explain the inner workings of the method, but rather to review the interpretation of the results from the library and understand how to get them.

When we grab the SHAP Values from a pair of a model + dataset, we get an NxM matrix where N is the number of instances of your dataset and M is the number of features. We have the same shape as the original dataset.

Each i,j entry in this matrix denotes the impact that feature j had on the prediction of instance i. This allows us to explain, at the local level, how our model is making its predictions based on the features.

If you grab the average prediction of your model and sum it to the sum of each row of the SHAP Values matrix, you will get exactly the prediction for each instance of your dataset.

This is the basic structure of the SHAP that we are going to be using.

Another very important structure is the SHAP Interaction Vectors.

SHAP Interaction Vectors

The SHAP interaction vector between two features defines the interaction between those features on the predictions. In order to calculate it, the SHAP Value for feature i is calculated when j is present and when j is absent. Permuting it for all possibilities generates the interaction vector.

If we define the SHAP vector for feature i as the vector containing the SHAP Values of the feature i for every sample as pi, then we know that:

Where pij is the SHAP Interaction Vector between features i and j.

As we are going to see in the next sections, these notions are important because the described methods have a really strong geometric interpretation that will help us understand what is happening.

To start, let’s plot both of these vectors on the space and build it up from there:

SHAP Vector and SHAP Interaction Vector. Image by the author, based on an image from [1]

Synergy

The synergy has the intuition of measuring how one feature benefits from the presence of another feature inside our dataset. If two features are highly synergetic, then they both should be on the dataset because they aid each other a lot.

The geometric interpretation of this is that we can create a synergy vector with the projection of the pi vector on the pij vector, which can be translated as the equation:

Visually we get:

Synergy Vector. Image by the author, based on an image from [1]

We then can generate a Synergy value between those features by calculating the length of that projection:

And since we defined this vector, we are basically saying: given the predictive power of my feature i, how much of it comes from the interaction with feature j? Well, if we answer this question, then we know how much of the predictive power does not come from that.

With this in mind, we can define the autonomy vector between those two features as the vector subtraction:

Geometrically, we can see that the sum of the synergy (what comes from the interaction) and autonomy (what doesn’t) will sum up to the original feature vector:

Autonomy Vector. Image by the author, based on an image from [1]

Redundancy

The redundancy has the intuition of being the amount of information from the feature i that is replicated on feature j. Two perfectly redundant features would have exactly the same information. One example of that is the temperature in Kelvin and in Celsius.

We can measure how much information is shared by projecting the autonomy vector from i to j into the autonomy vector from j to i. This translates algebraically to:

Visually, we have the following image. Notice that the aji vector belongs to another plane in relation to the aij and pij vectors:

Redundancy vector. Image by the author, based on an image from [1]

And then, as we did with the synergy, we can generate a single value from this vector by calculating the length of the projection:

Independence

Finally, we will define the final piece of information: given the information in i that is not synergic or redundant with the information in feature j, we have the independence from feature i.

This can be calculated as the difference between the autonomy and the redundancy between the features:

And as we did with the other features, we can calculate a scalar value by calculating the length of the projection of this vector on pi:

Coding the features

Now, let’s dive into some coding to see how we can use the SHAP Values results from the SHAP library to generate these values.

For this example, we will use the Wine dataset from the UCI repository, which is free to use, grabbing it from a function on the sklearn package. We will then apply a Random Forest Classifier.

First, let’s import the required libraries:

import shap
import numpy as np
import pandas as pdfrom sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier

Now, let’s fit the classifier on our dataset:

# Get the dataset and fit a Random Forest on it
X, y = load_wine(return_X_y=True, as_frame=True)rf = RandomForestClassifier()
rf.fit(X, y)

Now we can use the SHAP library to generate the SHAP values:

# Runs the explainer on the model and the dataset to grab the Shap Values
explainer = shap.Explainer(rf)
shap_values = explainer(X)

The SHAP library returns three matrices when we execute the code above, so we will select the SHAP matrix:

# The return of the explainer has three matrices, we will get the shap values one
shap_values = shap_values.values[:, :, 0]

Now, let’s generate the interaction values to generate the SHAP Interaction Vectors:

shap_interaction_values = explainer.shap_interaction_values(X)[0]

We will now define some zero-matrices to fill with our calculations. This is not the fastest method, however, it is done this way to be more didactic:

# Define matrices to be filled
    s = np.zeros((shap_values.shape[1], shap_values.shape[1], shap_values.shape[0]))
    a = np.zeros((shap_values.shape[1], shap_values.shape[1], shap_values.shape[0]))
    r = np.zeros((shap_values.shape[1], shap_values.shape[1], shap_values.shape[0]))
    i_ = np.zeros((shap_values.shape[1], shap_values.shape[1], shap_values.shape[0]))S = np.zeros((shap_values.shape[1], shap_values.shape[1]))
R = np.zeros((shap_values.shape[1], shap_values.shape[1]))
I = np.zeros((shap_values.shape[1], shap_values.shape[1]))

We defined a matrix for each of our vectors and one matrix for each of our scalar values. Now, let’s iterate over each of the SHAP values (imagine a double for loop over i and j) and select the vectors we are going to use:

# Selects the p_i vector -> Shap Values vector for feature i
pi = shap_values[:, i]# Selects pij -> SHAP interaction vector between features i and j
pij = shap_interaction_values[:, i, j]
            
# Other required vectors
pji = shap_interaction_values[:, j, i]
pj = shap_values[:, j]

With that in our hands, it is easy to calculate the following vectors just following the equations provided above:

# Synergy vector
s[i, j] = (np.inner(pi, pij) / np.linalg.norm(pij)**2) * pij
s[j, i] = (np.inner(pj, pji) / np.linalg.norm(pji)**2) * pji# Autonomy vector
a[i,j] = pi - s[i, j]
a[j,i] = pj - s[j, i]# Redundancy vector
r[i,j] = (np.inner(a[i, j], a[j, i]) / np.linalg.norm(a[j, i])**2) * a[j, i]
r[j,i] = (np.inner(a[j, i], a[i, j]) / np.linalg.norm(a[i, j])**2) * a[i, j]# Independece vector
i_[i, j] = a[i, j] - r[i, j]
i_[j, i] = a[j, i] - r[j, i]

Then, using the length calculation equation, we get the final scalar values:

# Synergy value
S[i, j] = np.linalg.norm(s[i, j])**2 / np.linalg.norm(pi)**2# Redundancy value
R[i, j] = np.linalg.norm(r[i, j])**2 / np.linalg.norm(pi)**2# Independence value
I[i, j] = np.linalg.norm(i_[i, j])**2 / np.linalg.norm(pi)**2

Also, there is an open-source implementation of these methods provided by the authors of the paper on the Facet Library.

A Feature Selection model proposal

Now that we already have an understanding of how the method works, we can start to think about how we can use it to generate a feature selection methodology.

A proposal of a method could be:

Given a trained model on your dataset, grab the SHAP information from it
Run the S-I-R calculation
If a pair of features have a Redundancy greater than a threshold, mark them for removal
Grab the feature from the pair which has the least synergy with the rest of the dataset and remove it. For that, you could use the average synergy or another metric.

This is just a basic idea of how these values can be used to improve your feature selection methodology. I expect to generate more robust ideas in the future on this subject.

[1] Ittner et al., Feature Synergy, Redundancy, and Independence in Global Model Explanations using SHAP Vector Decomposition (2021),
arXiv:2107.12436 [cs.LG]