You Are Underutilizing SHAP Values — Feature Groups and Correlations

Your model is a lens into your data, and shap its telescope

Estevão Uyrá Pardillos Vieira
Towards Data Science

--

I have been working with SHAP values for some time, and have had the opportunity to test a lot of applications that I had never seen on the web, but which worked very nicely and so I wanted to share with more people.

Since the post is meant to be more advanced stuff, I won’t give any kind of deeper introduction into what SHAP values are, besides reiterating that they give local explanations, related to specific data points. And that is where their power comes from. They are also the most formally rigorous approach to unstructured feature importance.

In this story, I will show two useful techniques to better understand your data, by grouping your features and by looking at correlations on their shap values. All code is here. Also take a look at this other post with more advanced analyses.

The datasets

I have used two distinct Kaggle datasets for this analyses, to try to expose the potential of these techniques, but it is only a toy depiction. You can only grasp how much they really help you when you think of the huge hundreds-of-variables datasets that you get when practicing machine learning in the industry.

Will it rain in Australia?
21 variables about the day before
Wind, humidity, temperature, etc.

Is the car accident fatal?
59 variables about the crash
Vehicle, location and accident information

I have chosen two dataset that I have never dealt with before, and I wanted to see if there are any interesting insights available from this model-based analysis. In any case, I will be sharing with you what I see in the plots.

Just to note: to have sensible shap values the model should not be overfitting too much, or else shap values may not be significative of the test set target statistics.

Shap for feature groups

We usually have natural ways to group our features. For example, we have different data sources, or different types of information. For time-window data, we can have features from different window sizes, and so on. In our cases it is possible to look at each feature individually, but when you work on tabular models nowadays it is very common to have many hundreds of variables, or even thousands in earlier modelling stages.

Sometimes people just look at the variables as a whole using the ready-made summary plot and then may sum up the raw values of the variables inside the groups or even their absolute values along the whole dataset and come up with a number for the “average importance for that group of features”. First, it is important that you sum the raw values, since you can have correlated variables going against each other, and having the whole group of variables giving zero impact even though each variable seems to have some impact.

Then you can look at the shap for feature groups with the same sample-by-sample detail that you had for the features themselves (you lose the variable value as color, because of course you do not have a single value, but you can always input some variable to show as magnitude if that is interesting in your case).

The groups give different lenses to look into the data. It is very clear that 3pm is more important than 9am (and it makes sense right? It is later on, so it is closer to the day that we want to predict). Also, it is easy to see that variables related directly to rain or humidity are more important than those related to wind, pressure, and especially temperature.

Things get really more interesting when we deal with more variables. In the UK accidents dataset, there seems to be really much more than we can take at a first glance into the summary plot, and we can start questioning our data by grouping the features.

It makes a lot of sense that the features related to the accident itself are the most impactful for our question (“was the accident fatal”). But It seems that there are an awful lot of different things on the accident group, and it would be nice to have a more detailed view. I have separated the accident variables into three groups: before, during, and after. For example Vehicle_Manoeuvre is before, 1st_Point_of_Impact is during and Did_Police_Officer_Attend_Scene_of_Accident is after.

It makes a lot of sense that the accidents_after variables have so much information, since police is probably only going to attend the scene if the accident is really bad. But there is information available before the accident, such as which kind of manoeuvre the driver or if the vehicle is leaving the carriageway, and this is potentially predictive stuff that could help prevent accidents in the future (“don’t manoeuver here”, “reduce speed when leaving the carriageway”). This has to be done with caution, since it is not a causal model, but I do not want to enter too deep into the data itself, since my purpose is just to give some ideas into different shap analysis. You can do this for specific insights, or like I am doing here just as a model-based exploration of the dataset.

Shap correlations

If you are looking into advanced shap analysis you probably already know a lot about correlations, and know how ungrateful it can be to analyze correlations when the features have completely different distributions, or worse to compare categorical and numerical features. The shap correlation analysis has the very useful property of being undisturbed by such differences. Whether the feature is categorical, ordinal or continuous, its shap value is continuous. You can get some insights about the distribution of the features themselves from the shap correlation plot, but you also understand how their impact is related in the model. You can use this inspection to drive further analysis such as dependency plots between specific pairs of variables with unexpected dependencies. Here I am plotting the absolute value of the spearman correlation. You can possibly be interested in looking at the correlation itself, but sometimes you just want to know that the variables are related, and then the absolute correlation gives a better visualization overall.

This kind of analysis can also help you with feature selection, since you can understand better relationships between variables, which can be synergistic or redundant. If variables are redundant, they may be splitting the importance between them, and removing one of them will increase a lot the importance of the other. But I will make a post specifically about feature selection later on, so let’s give the last example to tickle your brains on potential and possibilities.

The number of variables in the car accident makes it hard to focus here. You can look at some obvious correlations anyway, such as Junction Detail, Special Conditions and Junction Location. I’m not yet sure what are the best visualizations for large feature sets in cases like this, but I have prepared three other visualizations that can be interesting, if only to give a sense of the different lenses we can get into our data.

Some of them are kinda obvious, such as age_band and age, some make sense such as skidding_and_overturning and speed_limit (left). But what about the relationship between vehicle_leaving_carriageway and police_officer_attendance (right)? Two correlations related to Bus_or_Coach_Passenger were interesting to see, with Age_Band_of_Driver (middle), and with Vehicle_Manoeuvre (left). Think a little about why this correlations could exist! Buses do not do a lot of manoeuvres and its drivers are less varied than the overall driver population. We could dive deeper to understand what is really happening to our model if that was interesting for us.

Clustermaps of absolute correlation for some selected variables (selection in title).

Final remarks

All the code is available in this kaggle notebook, where I get the datasets and very straightforwardly apply the analysis. After this post I also wrote another one with more advanced analyses, if you liked this also take a look there.

I hope I was able to give some insights and maybe help you with some specific analysis on your day-to-day that you didn’t know how to do, or else that these ideas can help you explore and understand either your model or your dataset — remembering that shap can help us with both.

--

--

Master in Neuroscience and Cognition, Data Scientist @ Wildlife Studios. AI and Complexity enthusiast.