The rise of Machine Learning, taking over new industries and applications, has been very strong in the last decades. Its increasing complexity turned models into black-boxes, making the processes between input and output more opaque. In turn, and supported by other factors, the need to explain Machine Learning decision-making processes has become crucial. Explainability (XAI) provides a better understanding of how models work, increases confidence in them, and leads to better decisions.
Explainability must be an integral part of modeling for a Data Scientist. Suppose we were to develop a credit scoring model (whether or not we are going to grant a loan to someone); explainability could provide many insights:
- verify that expected features (salary, debt ratio…) have a significant impact, or conversely understand why unexpected ones are over-represented
- focus on features with great predictive power and improve feature engineering to boost performance
- provide recommendations that could be used by the client advisor to explain the reasons for the denial / approval
- share a model and results with Data Scientists and non-Data users
- etc…
XAI has been a strong research topic recently. Multiple explainability methods have emerged. But today these methods are not completely satisfactory and the interpretation of the results should be done carefully.
Therefore, at Société Générale, we have developed metrics to help evaluate the quality of explanations. Our contributions are now available in the 1.6.1 version of the Shapash library.

Shapash is an Open Source Python library developed by MAIF about explainability. It relies on existing methods (such as SHAP and LIME) and proposes simple visualizations and an interface that allows navigation between global and local explainability.
For more info on the implementation of the metrics about explanations quality, a jupyter notebook tutorial is available on the Shapash Github.
Local explainability methods
Local explainability comes in many forms. Among the most popular ones are the weight-based methods. Those assign a weight to each feature proportional to its contribution in the sample prediction. For example :

Basically, you transform your initial problem (where you might have used any model to get to the prediction) into a linear one, where the sum of the weights add up (more or less) to the model output.
By looking at the features that captures the biggest weights, you would be able to assess sentences like : "Your loan application was rejected because of your income (70% of the decision) and your young age (30%)".
SHAP and LIME are two of the most used weight-based methods; a quick recap is presented below.
LIME
The idea behind LIME is the following : we locally approximate the black-box model with a surrogate (linear regression) and interpret its coefficients as the contributions of the features.

For a more detailed explanation, you can check the original paper [1]
SHAP
SHAP (SHapley Additive exPlanations) is based on Shapley values, a concept coming from Game Theory. SHAP uses the idea that the outcome of each possible combination of features should be considered to determine the importance of a single feature.
Shapley values are caluclated by looking at each combination of features and seeing how the model output changes.

Ideally, you would need to train a different model for each subset of features (2ⁿ models for n features), which is of course impossible.
Rather than that, SHAP proposes some approximations to help calculate Shapley values. More info are available on the paper [2].
The need to develop metrics to assess quality
If existing methods may bring interesting insights and help in certain scenarios, unfortunately they are not perfect and suffer from limitations.
Among them, and without going into details, when correlated variables are involved, you will not be able to calculate accurate SHAP values: either calculations are based on artificially created instances that would never appear in real life, or weights could be put on features not actually used in the model (because correlated with features that are, instead, used).
For more info on those limitations, you can check this article [3].
In addition, explainability methods do not always agree between each other: a feature might have a strong contribution according to one method, and a weak one according to another, or even of opposite signs, making the task of having a trustable explanation harder.

Thus, we need tools to estimate when it is relevant to trust explanations in a specific case, and conversely, when we should consider them as mere insights.
We have developed three metrics that will be illustrated through an example on the Titanic dataset (classification problem to predict whether a person has survived or not in the sinking, based on some of their characteristics).
We will suppose that the model has already been trained, and we now want to understand its inner workings.
The three metrics are :
- Consistency
- Stability
- Compacity
Consistency metric
As mentioned above, several weight-based explainability methods exist to date. These methods can differ from each other on several points: strength of the theoretical foundations, starting assumptions, heterogeneity of the maturity levels… in short, many factors that can influence the values of the weights associated with each variable.
When we compare explainability methods on the same instance, we often get explanations that are not very similar (or even radically opposed in some cases). Unfortunately, it is difficult to determine the right method to choose (if there is one).
If a difference between methods can be understood because their assumptions are different, it is not satisfactory from a business perspective. If we want to understand why a client has not obtained a loan, we cannot answer them: "it depends on the assumptions, it is either because of your income or because of your age".
In order to highlight and quantify these differences between methods, we have developed a metric called Consistency
. This metric answers the following question: do different explainability methods give, on average, similar explanations?
We can thus distinguish situations with small differences, which increase confidence in the explanations provided by the methods, from those situations with strong disparities, in which case we will need to carefully interpret the explanations.
Graph description
The output comes into two graphs :


The 1st graph displays the average distance between the explanations provided by the different methods. It is a 2D representation which gives an overview of the similarities between the methods. Here for example, KernelSHAP and SamplingSHAP seem to give on average closer explanations than TreeSHAP.
However, the interpretation of distances is not easy. For example, is 0.33 good or not? To help understand the meaning, the second graph is presented and serves as a support to illustrate the values displayed in the 1st graph.
In this 2nd graph, real instances are selected and extracted from the dataset (we can find them thanks to the Id at the top of each graph) and allow a better understanding of the displayed values. Here for example, a distance of 0.14 seems to show very similar contributions, which is a little less so for a distance of 0.33.
Code
After declaring the object, the compile
method below allows you to calculate the contributions on a given dataset using the default methods supported by Shapash (namely SHAP, LIME and ACV).
from shapash.explainer.consistency import Consistency
cns = Consistency()
cns.compile(x=Xtrain, model=clf)
cns.consistency_plot()
You can also use your own contributions not computed by Shapash. Make sure they all are in the same Pandas DataFrame format (same columns, same indices) and inserted into a dictionary.
contributions = {"treeSHAP": treeSHAP,
"samplingSHAP": samplingSHAP,
"kernelSHAP":kernelSHAP}
cns.compile(contributions=contributions)
cns.consistency_plot()
Technical details
Mathematically speaking, the distances have been defined as follows. For two methods of explainability M₁ and M₂, the average distance between the explanations is:

whereN
is the number of instances considered in the calculation, w1
and w2
the normalized vectors (L2 norm) created by the contributions of all the features.
Stability metric
Another way to increase confidence in an explainability method is to study its stability. Intuitively, if we were to choose similar instances (we’ll see how to define this similarity below), we would expect explanations to be similar as well. In other words, for similar instances, are the explanations similar? This is the question this second metric, Stability
, answers.
The notion of similarity is based on two factors:
- the instances must be close in the feature space
- model predictions must be close
Indeed, a similarity only linked to the features’ values would not be sufficient to get the same explanations. For example, if you look at both sides of the decision boundary of a model, you would obtain different predictions, and thus, different explanations. But features values might be very similar. And this would make totally sense.
The question of stability is important because a failure at this level would not give confidence in the explanation.
Among the theoretical assumptions on which rely the current explainability methods (ex : LIME and SHAP), the question of stability is generally not addressed. Therefore, it does not follow automatically. Thus, it is important to have a metric to evaluate this aspect.
Graph description
Several graphs can be displayed, depending on whether one wishes to look at the stability on a full dataset (globally) or on a particular instance (locally).

This graph illustrates the stability on a set of instances. Each point on the graph represents a feature (being a Plotly graph, you would have to hover over it to see the name of the features).
- The Y axis displays the average importance of each feature over all the considered instances. The higher you go, the more important the feature.
- The X axis displays the average stability in the neighborhood of each considered instance. The further to the right, the more unstable the feature.
Thus, we see that the two features at the top left are important and relatively stable on average. If we wanted to zoom in a bit more, and look at the distribution of stability rather than its mean, we would get this graph:

We find the same two features with the greatest importance ("Sex" and "Ticket class") but we now see some differences: if their staibilty was close on average, we now see how their distributions vary; "Sex" remains very stable for all instances, while "Ticket class" has a much higher variability and depends on the studied instance.
A third vision, this time local, allows you to study stability on a specific instance:

Here, for example, "Sex" is stable in its neighborhood, which allows the contribution to be interpreted with a certain confidence, while "Port of embarkation" displays real contradictions.
Code
from shapash.explainer.smart_explainer import SmartExplainer
xpl = SmartExplainer()
xpl.compile(x=Xtrain, model=clf)
xpl.plot.stability_plot()
By default, the chosen explainability method is SHAP.
Technical details
Two steps are required to calculate de stability metric :
- Choose the right neighborhood
- Calculate stability itself
Regarding the neighborhood of a given instance, it is selected as follows:
- We normalize the data to have a unit variance (getting features on the same scale)
- We choose the top N nearest neighbors via the L1 distance (by default 10, which is large enough to allow subsequent filtering criteria)
- We reject those neighbors whose model output is too different from the instance’s (different is explained below)
- We reject neighbors whose distance to the instance is greater than a certain threshold in order to remove outliers (the threshold is defined from the distance distributions)
The maximum allowed difference between the model outputs is defined as follows:|
- For regression:

- For classification:

Once the neighbors have been selected, the calculation of the stability metric itself is done by considering the instance and all its neighbors and calculating the ratio between:
- the standard deviation of the normalized vectors (L1 norm) created by the contributions of the features.
- the average of the absolute values of the normalized vectors (L1 norm) created by the contributions of the features.
This ratio, unitless, is known as the Coefficient of Variation (CV).
Compacity metric
For the explainability to be interpretable, it must be easily understandable by humans. Even if the models used are simple, this does not guarantee that the explainability will be just as simple: a simple linear regression relying on 50 features shows problem; the number of features greatly affects the explainability.
A simple explainability is ideally based on a small number of features and it is therefore necessary to choose an adequate number of them. However, the selected features must have a sufficiently high predictive power so that they approximate the model sufficiently well. A trade-off must therefore be considered.
Unfortunately, this trade-off is not always possible. If the model uses indeed a large number of features to make its decision, it will be impossible to extract a subset satisfactorily.
The Compacity
metric developed here allows to determine whether we are in a situation where it will generally be easy to obtain an explanation based on few features, or if, conversely, this will not be possible. To do this, the output of the model containing all the features is compared to the top N contributions. Finally, statistics are displayed.
Graph description
The graphs produced below provide statistics globally (on a set of points), and not on a specific instance.
There are two parameters that influence the results:
- the desired level of approximation to the model that comprises all the features
- the number of features selected in the explanation

By fixing the desired level of approximation and varying the number of features, we obtain the left graph; conversely, by fixing the number of features and varying the approximation reached, we obtain the graph on the right.
As we can see in this example (left graph), if we selected 4 features, we would approach the base model at 90% in X% of the time. If, on the other hand, we selected 5 features (right graph), we would reach 90% of approximation in X% of the time.
Code
from shapash.explainer.smart_explainer import SmartExplainer
xpl = SmartExplainer()
xpl.compile(x=Xtrain, model=clf)
xpl.plot.compacity_plot(distance=0.9, nb_features=5)
By default, the chosen explainability method is SHAP.
Technical details
Here, only the distances between the outputs are used to calculate the metric. The distance used is the same as the one illustrated above:
- For regression:

- For classification:

Conclusion
The article presented three metrics that allow to evaluate the quality of the explanations provided by existing explainability methods. They answer the following questions :
- Do different explainability methods give similar explanations on average? (Consistency)
- Are the explanations similar for similar instances? (Local stability)
- Do a few features drive the model? (Compacity)
If explainability methods help us and provide interesting insights that can be used in certain cases, their interpretation must be done with caution, as the metrics have pointed out.
By developing these metrics, we make it possible to differentiate situations where we can have trust an explanation, from those more complex, that require a deeper attention.
The metrics are available in the 1.6.1 version of Shapash. A notebook tutorial is also provided to get familiar with them.
I thank Julien Bohné (Société Générale), Yann Golhen (MAIF) and Thomas Bouché (MAIF) for their contribution and commitment in this fruitful collaboration.
References
[1] Ribeiro et al., "Why Should I Trust You?": Explaining the Predictions of Any Classifier (2016)
[2] Lundberg et al., A Unified Approach to Interpreting Model Predictions (2017)
[3] Kumar et al., Problems with Shapley-value-based explanations as feature importance measures (2020)