Tips and Tricks
Quantmetry – MAIF

Introduction
Following a golden age of performance in AI with the advent of ensemble models and deep learning, the Machine Learning community is entering the age of reliable AI. In that context, there is a rising interest in algorithm explainability as more and more tools are being developed to "open the black box" [1]. Among those tools, one seems to reign supreme: SHAP (SHapley Additive exPlanations). SHAP is a game theoretic approach to explain the output of any machine learning model using an efficient computation of Shapley Values [2]. In a nutshell, Shapley Values estimate the contribution of each variable to a model prediction. If you’ve ever tried to explain a black box model, I am betting that you have used SHAP…
Despite its popularity, SHAP has some limitations [3]:
- A poor handling of one-hot encoded categorical features that leads to erroneous Shapley values: the contribution of the feature is not the sum of the contributions of the one-hot columns!
- A biased (or even wrong) approximation of Shapley values when features are correlated.
When trying to understand how a model makes predictions, it is important to use reliable methods. In that regard, the ACV library (Active Coalition of Variables) offers a new way to compute Shapley values that addresses those issues, thanks to:
- A rigorous way of computing the contribution of a coalition of features, solving the one-hot encoding problem.
- A robust way to compute Shapley values when features are correlated.
To highlight the difference between the two libraries, we conducted a comparative study on the Telco Churn dataset, already used in Explainable Ai literature [4].
Outline of the study
The differences between SHAP and ACV will be significant when a model is trained on a dataset with a high proportion of one-hot encoded categorical features and correlated features. With 10 out of its 18 features being categorical and a high level of correlation overall, The Telco Churn data checks both those boxes.
Now that we have this covered, we want to train a black box model and answer the question: "Does ACV provides better explanations of my model than SHAP?". This is by definition a tricky question because it is impossible to know and understand how the black box works (that is why we need SHAP in the first place!). So how can I compare the quality of 2 explanations?
In this study we suggest a simple approach to answer those 2 questions: If we cannot look inside the black box, we can try to control the generative relationship of our data in the first place. If we know the "real" generative function y= g(x) of our dataset (X,y) and we assume that our blackbox model yᵖʳᵉᵈ = f(X) is reasonably faithful to g, then a good way to gauge the quality of an explanation of f is to verify how well it explains the initial function g.
In order to have access to the generative function of our data, we trained an interpretable additive model _g(X) = g₁ (X₁) + g₂(X₂) + ⋯ + gₚ(Xₚ) on our dataset and replaced the real y with the predictions of the model. For this step, we used the Explainable Boosting Machine (EBM) implemented in the InterpretML library. Because of its additive properties, this model is easy to interpret, and we can easily access the functions that links the predictions to each of the variables. In other words, thanks to EBM, we can display the graphic representation of the contribution gᵢ(Xᵢ)_ of each feature to the prediction. For instance, the "real" contribution of the variable MonthlyCharges is represented in the following graph:

To recap, here is a brief overview of the protocol that we used:
- Training EBM (Explainable Boosting Machine) on the training set (Xᵗʳ,yᵗʳ) to obtain g.
- Replacing yᵗʳ with the prediction yᵉᵇᵐ = g(x) + ϵ (some noise ϵ was added to mimic real conditions).
- Training a traditional black box gradient boosting on (Xᵗʳ,yᵗʳ) (we use Scikit Learn’s GradientBoostingClassifier) to obtain f (our model performs exceptionally well on this semi-synthetic data set: ROC_AUC=0.94).
- Explaining the gradient boosting model f with SHAP and ACV.
Shapley values with SHAP and ACV
After training the model, we computed two different sets of Shapley Values:
- Using the Tree Explainer algorithm from SHAP, setting the feature_perturbation to "tree_path_dependent" which is supposed to handle the correlation between variables.
- Using the ACV Tree algorithm from ACV which claims to do a much better job than SHAP at handling correlated features.
Before verifying that claim in detail, we can compare the average of the Shapley values on the test set (Figure 2) to see if SHAP and ACV agree on the global importance of the variables:

Both libraries seem to generally agree on which feature are the most important. We can nevertheless observe a few differences, especially for the importance of MonthlyCharges and TotalCharges, which are two very correlated numerical variables (Pearson correlation coefficient = 0.65). To display the differences in a more rigorous way, we can compute the relative L1 distance between the distribution of the contributions for each feature:

MonthlyCharges and TotalCharges have the biggest differences in their distributions, followed by Contract, StreamingMovies and TechSupport, which are 3 one-hot encoded categorical variables.
Local explanation differences between SHAP and ACV
Now that we know that certain features explains differently, let’s see what it means for the actual explanation of our predictions. In order to measure the "difference" between explanations, we can compute the Kendall’s rank correlation coefficients between the SHAP and ACV contributions for each observation. To study the predictions with the "most different" explanations, we can look at the ones with the lowest coefficients. Let’s look at the difference between SHAP and ACV for one of these predictions:

- MonthlyCharges is the most important positive contribution for SHAP (0.269) but contributes negatively for ACV (-0.186).
- Tenure is the most important contribution for ACV (-0.479) but only the 10th most important for SHAP (-0.048).
- Contract is the second most important contribution for SHAP (-0.435) but only the 10th most important for ACV (-0.054).
We can see that for some individuals, the explanation given by ACV is very different from the one provided by SHAP. In a context where the model is used for decision making that could impact an individual, this can have some real-life consequences. For instance, if you want to change the prediction of the model, knowing if a variable like MonthlyCharges contributes positively or negatively can lead to opposite decisions. Similarly, knowing if a variable like tenure is the most important one for the model or not at all can drastically change the interpretation and the potential decision that follows.
Focus on MonthlyCharges and TotalCharges
The differences in explanations for the features MonthlyCharges and TotalCharges is especially interesting due to their high level of correlation. ACV claims to provides better and more stable explanations than SHAP when features are correlated. Let’s compare the distribution of the computed contributions (Shapley Values) of MonthlyCharges to the generative graph of our data provided by the Explainable Boosting Machine:
Even if they both broadly follow the trend of the real contributions, the graphs for both SHAP and ACV seem to make some errors in the sign of the contributions for values of MonthlyCharges comprised in [20, 40] and [100, 120] (If we assume that our model broadly respects the real contributions). This could give us an erroneous interpretation of the behavior of our model that could be formulated like this:
"MonthlyCharges contribution to our prediction generally increases with its value, except for the interval [20, 40] when it contributes positively and for the interval [100, 120] when its contribution decreases rapidly"
When the actual explanation should be something like:
"MonthlyCharges contributes negatively when its value is inferior to 70, at which point its contribution starts increasing"
Even if both methods make errors, we do observe that ACV is a lot more stable and faithful to the generative pattern. Very few contributions are positive at the beginning of the graph and the "decrease" at the end is a lot less pronounced. Point for ACV!
Now let’s take a look at TotalCharges:
For this feature, the difference between SHAP and ACV is sticking. For values of TotalCharges above 6000, the contribution for SHAP starts increasing sharply and reaches positive values up to 2, when the generative function remains constant and negative. Due to the complexity of the computation, it is really hard to know why SHAP behaves that way, but we can say it provides erroneous explanations for a significant part of the dataset. ACV contributions, on the other hand, remain stable and coherent with the generative function. Point for ACV!
To conclude this comparison, let’s look at the explanations of SHAP and ACV for the black box model trained directly on the raw dataset. In other words, what a data scientist would do for a standard machine learning pipeline. Let’s look at the contributions for TotalCharges and MonthlyCharges in that case:

Interesting! We see that for both features, ACV seem to be a lot more faithful the functions learned by EBM (presented in figure 5 and 6) than SHAP. Due to the properties of the black box, we cannot draw a rigorous conclusion from this result. Nevertheless, this seems to indicate that the Shapley values computed with ACV provide a more robust explanation of the model. I think this still deserves another point for ACV!
Active coalition and Active Shapley Values
Shapley values are often used to find the most important features for a model. The selection is made after observing the explanation, and the number of variables to retain is often arbitrary. To address this issue, ACV also introduces a new tool called Active Shapley Values [5], that makes the selection for you by computing a "null coalition" of variables for which the contribution is deemed negligible. It provides a more concise explanation by attributing a null contribution to the variables in the "null coalition" and fairly distributing the "payout" among the variables of the active coalition.
The method to compute the active and null coalitions requires the concept of Same Decision Probability that will not be covered in this article. If you want to know more about these fancy methods, please visit the ACV GitHub or read Amoukou et al. 2021 [5].
The Active Shapley Values computed for the client 233 are shown in the figure below:
Only 4 variables amongst the most important were retained and we can see that Contract and PaymentMethods contributions have significantly increased. This is a much more interpretable way to compute Shapley values if your objective is to find the most important variables. In our case, we directly see say that PaymentMethods, Contract, MonthlyCharges and tenure are the most important variables for this prediction.
Conclusion
In this article we have shown that ACV can provide a much more reliable way to compute Shapley Values Leading to more truthful explanations when features are correlated/categorial. For "simpler" scenarios, the difference between the two libraries is not always so important, but ACV remains a more robust way to provide Shapley-value-based explanations. It is important to note that ACV is not completely model agnostic and only works for tree-based models (Random Forest, XGBoost, LightGBM…). Finally, ACV provides other tools like Same Decision Probability or Active Shapley Values [5] to rigorously explain models. Visit the GitHub page to learn how to use it, and to discover the papers of Salim Amoukou and its co-authors.
All the Shapley value graphs in this article were obtained with Shapash which is a wonderful user-friendly package to visualize Shapley values (developed by MAIF with contributions from Quantmetry). Shapash can use SHAP or ACV as backends for computing the Shapley Values.
Similar analyses were performed during a research collaborative project by MAIF and Quantmetry on real data sets and internal models. Similar differences have been reported. We both think that trustworthy AI methods must be grounded on transparent and testable computational techniques, and we recommend being critical when using explainable tools as black box themselves! The experiment can be reproduced with a notebook available on the ACV GithHub.
Affiliation
I work at Quantmetry. Pioneer and independent since its creation in 2011, Quantmetry is the leading French pure-player artificial intelligence consultancy. Driven by the desire to offer superior data governance and state-of-the-art artificial intelligence solutions, Quantmetry’s 120 employees and researcher-consultants put their passion at the service of companies in all sectors for high business results.
Contributors:
Cyril Lemaire, Thibaud Real, Salim Amoukou, Nicolas Brunel
References
[1] Christoph Molnar, A Guide for Making Black Box Models Explainable (2021)
[2] Lundberg et al., A Unified Approach to Interpreting Model Predictions (2017), NIPS 17
[3] Amoukou et al., The Shapley Value of coalition of variables provides better explanations (2021)
[4] Nori et al., InterpretML: A Unified Framework for Machine Learning Interpretability (2019)
[5] Amoukou et al., Accurate and robust Shapley Values for explaining predictions and focusing on local important variables (2021)