With the progress in the field of machine learning a lot of new complex machine learning models are widely being used in various fields. Specially, with the advancement of deep learning data are being used to make some critical decisions. But sometimes it becomes difficult even for the AI experts to explain certain predictions done by the so called "Black-Box models". When it comes to the high risky fields such as healthcare, autonomous cars it becomes very much important to know
- What is our model learning?
- Which parts of the model are responsible for making certain predictions?
- Is the model robust?
Different model interpretability techniques help to answer those questions. In this article, I have shared overview of some of the commonly used interpretability tools.
Let’s make the interpretability tools interpretable to all 😀 .
Before starting, it is better to know some of the terms that will be used several times in this article.

The best way to interpret a particular prediction is if any interpretable machine learning technique can be used for making decision. Some of the examples of the interpretable ML techniques are:
![Examples of some interpretable ML techniques [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1fAkQap-33MiwbJ5patzekA.png)
All the members of Generalized Linear Model (GLM) family are very much interpretable. In this article, I will not go deeper into the interpretable ML algorithms. To know how GLM works go through my another article.
Permutation Feature Importance:
Feature importance gives a score to each of the features that tells us which of these features are important to our model, which of these features are playing a crucial role to drive our model predictions. There exist a few model-specific measures for feature importance like for GLMs the parameter coefficients scaled by the standard deviation of the feature works as feature importance similarly for tree-based models decrease in impurity for a split in a node gives the measure for feature importance.
But Permutation feature importance is one of the model-agnostic feature importance measure where we calculate the importance of a feature by permuting its values.
Shuffling the values of a particular feature if prediction error increases then we can say that particular feature is important.
![Permutation feature importance example [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1wkEuTGyA5YahbkqFg_x_FQ.png)
Feature importance is calculated as ,
![Feature importance using Permutation feature importance [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1-Z7bHDFrJvnbG7hpjxz3SA.png)
A feature can be concluded as important if the value of Fᵢⱼ is greater than 1. Permutation of the values of a feature can be done in different ways specifically n! ways if the number of data instances are n. So, permutation feature importance gives a confidence interval as the output.
Partial Dependence Plot(PDP):
Partial dependence plot is one of the global methods that considers all the data instances and gives an idea about the global relationship between the predictors/features and the outcome variable.
PDP calculates the marginal effect of one/two features on the outcome. It does not capture the interaction among the features and its effect on the predicted outcome.
For plotting the PDP, we have two set of features:
- The feature for which we want to plot the PDP
- Other features which are used in machine learning model.
The mechanism of plotting the PDP for numerical feature is little bit different from plotting the PDP for categorical feature.
For plotting the PDP for numerical feature X₁,
- The machine learning model f is fitted on the original data.
- For getting the PDP of a specific amount, say X₁₁ , of the feature X₁ an artificial data is created by changing the values to X₁₁ for all the data instances.
- Predictions are done using the already fitted machine learning model for each data instance.
- The value of the PDP for the specific amount X₁₁ of the feature X₁ will be the average of all the predictions done is step 3.
![PDP for a particular value of a feature [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1hzmUjKCt0bmRlWD_DGULow.png)
- The above-mentioned process is followed to cover the full interval of X₁ and that gives the PDP plot of X₁ .
Similarly, for plotting the PDP for a categorical feature an artificial data is created by changing the values of the feature with any one of the possible categories and this process continues for all the categories of that feature. The PDP for numerical and categorical feature looks like these respectively.
![PDP for a numerical feature and a categorical feature [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1Ea1slBP-Ge9XC6ALsubcqQ.png)
So, we have learned how to plot PDP. But, what can we say about the feature importance by looking at the plots?
The main idea is a flat PDP tells us that the feature is not that much important.
More the variation in PDP more important the feature is.
For numerical feature, the value of the feature importance is calculated by the following formula which is basically the standard deviation of the PDP values of that feature.
![Feature importance of numerical feature [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1BGWsLI_jtiw1ysKEjmesKw.png)
Similarly, for categorical feature the importance score is calculated by the following way:
![Feature importance of categorical feature [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/16uUKzexerDqCASNm4kDKoA.png)
What is a surrogate model?
A surrogate model is an interpretable model that is trained in such a way so that it can approximate the predictions of an underlying black-box model with the maximum interpretation. It uses the predictions made by the black-box model as input and try to fit an interpretable model (discussed in the previous section) that can closely approximate the black-box model.
- We can say that a surrogate model is a close approximation of the underlying black-box model if value of R² is high. We can calculate the R-squared value in the following way:
![R-squared value of a surrogate model [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1EGiuoX7jc8vRLmXRax5tZw.png)
- Training a surrogate model is a model-agnostic approach as it does not require any information about the underlying model all that is required is the feature values for each data instance and the prediction by the underlying model.
- Global surrogate models make approximations of the whole black-box model.
- Local surrogate models are used to approximate the prediction for a particular instance by the underlying black-box model.
LIME (Local Interpretable Model Agnostic Explanations):
LIME uses local surrogate models to draw a concrete explanation of an individual prediction rather than focusing on whole model explanation.
Assumption for LIME is we can probe the black-box model as many times as we want. And also, LIME considers any algorithm as black-box(even we apply LIME for Linear regression, it will assume linear regression as black-box).
Our goal is to know why the black-box model made a certain prediction.
![Data point of interest for interpretation [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1eRbt7tB87C016Mkw81QFKg.png)
From the above picture, we can see that there exists a complex decision boundary which is classifying two classes. But, we want to know why the model made a certain decision for the highlighted data instance, which features are more responsible for making such a decision.
For that we create a new dataset by perturbing samples from the original dataset and predictions done by the Black Box for those perturbed samples.
![New dataset created for the purpose of LIME [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1BsO48VDq5FIQtblkSoNsCQ.png)
The perturbed samples that are created are weighted according to their proximity from the data point of interest.
![Weighted samples in the neighbourhood of the datapoint of interest[Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1VqIKFqD8gvEnhoENHM7vuQ.png)
From the above image we can see that the samples that are close to the instance of interest are given higher weightage (bigger the circle bigger the weight) and the samples that are far from the interest point are given lower weightage.
The local surrogate model is trained on this new dataset. This local surrogate model should be any model from the family of interpretable models listed above.
For the explanation of the data instance(x) we want to minimise the loss L which calculates how much close the predictions of the black-box model f is to the predictions of the interpretable model g in the neighbourhood of x by keeping the complexity of g low (for Linear regression less features; for Decision tree less depth of the tree). This is a tradeoff between complexity of g and its close approximation to f. This trade-off is known as Fidelity-Interpretability tradeoff.
![Optimization function for LIME [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/15xZk-1i7rFo_T4mMnBT4eA.png)
Where G is the family of interpretable models.
While training the local interpretable model weightage that are given to the perturbed samples are kept in mind.
We are searching for the best approximation model of the complex model in the neighbourhood of x.
Exponential smoothing kernel is used to define the neighbourhood of x: smaller kernel width means a sample must be very close to the point of interest to influence the locally fitted model. I am not going much details into it. For more details you can check this book.
![Interpretation of the datapoint of interest using local surrogate model [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1boh-wJu62Y1y72CneTxkow.png)
From the above image you can see that a local interpretable linear model is fitted. Generally, these kinds of linear models are sparse in nature (Lasso regression is explained wonderfully by Saptashwa Bhattacharyya). By interpreting this local model, we can make certain explanations about the prediction done by the complex model for the point of interest.
We have understood how LIME works but that is not enough. Let’s see how to interpret a LIME output.
![LIME output for a particular data point [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1jQesiVuhMW7ANFnsW7NbPQ.png)
Suppose we want to interpret the LIME output for a particular data instance of a tabular data which has two classes as "edible" and "poisonous". Here we want see that how the value of different features for that particular data instance impacting the prediction. Orange highlights support poisonous and blue ones support edible. The locally fitted model predicted poisonous for this data instance with a probability score of 1. The weights that are given in the middle panel of the above image are the parameters from locally fitted model. Here we can see that "odor=foul" this value is the most important one to increase the chance to be classified as poisonous and the "gill-size=broad" is the only one that decreases the chance of classified as poisonous.
To know how LIME works for image and text data please check this source.
SHAP (Shapley Additive Explanations):
Shapley Additive Explanations or SHAP is based on the concept of Shapley values from game theory. The main idea behind SHAP is to know how much each individual feature contributes to make a certain prediction. This might be tricky to calculate the individual contribution as there may exist interaction among features.
SHAP is one of the measures that can be used for both local and global explanation.
So, we understand why Shapley and why Explanations but why Additive??
Let’s see that.
Shapley values make the output of a particular datapoint different from the baseline output of the underlying model. Let Φᵢⱼ be the Shapley value of the iᵗʰ data instance for feature j. These values contribute either positively or negatively to make the output different from the baseline output. Because of this addition of the effect of Shapley values to the baseline output it is "Additive".
![Explanation of SHAP [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1bEldWjZC_cTpfvSHkJXObw.png)
Now let’s see how these Φᵢⱼ are calculated.
Shapley value Φᵢⱼ for the jᵗʰ feature of the iᵗʰ data instance is calculated in the following manner
![Formula for calculating Shapley values for jᵗʰ feature of the iᵗʰ data instance [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/13f0G4jLEw7Sww4liG2LP7w.png)
Where,
S is a subset of features.
f is the underlying black-box model.
f(SU{j}) is the prediction of the black-box model for the subset S with the feature of interest.
f(S) is the prediction of the black-box model for the subset S without the feature of interest.
Suppose we have a total of p features. So, each S is a subset of p-1 features.
For example, suppose we want to calculate Φ₁₁ then S is a subset of green highlighted feature values. Here our goal is to see how a particular value of a feature is affecting the output. We consider different combinations S to capture the interaction effect among the features.
![Subset selection for calculating Shapley values [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1gdlnaVs_Tp_dwzg-k7saSQ.png)
The weightage term is given according to how many features are present in the subset S. The intuition behind that is to give more weightage to the contribution of adding jᵗʰ feature if already many features are included in the subset S. This indicates that the jᵗʰ feature contributes a strong change in the prediction even if many other features are already included.
Now let’s see how to interpret SHAP output for a particular data point.
![SHAP output for a particular data point [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1pW_NocaFgTL9vg8V_oDKkQ.png)
For this particular example the baseline output of the underlying black-box model is 22.841 and the output for this particular data point is 16.178. The red highlighted bars indicate the feature values that contribute the output to move towards the baseline output and the blue ones contribute towards the opposite direction.
To know how SHAP helps to interpret globally we will have to know SHAP feature importance and SHAP dependence plot.
SHAP Feature Importance:
Shapley value for __ jᵗʰ feature is calculated in the following manner.
![Formula for calculating global feature importance using Shapley values [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/166b-gHxrxJLjknCQlclJ1g.png)
Higher the feature importance value more important the feature is.
![Global feature importance plot [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1Y9YXl4vOBIl2Hr6v5zNp2g.png)
For this particular example the number of years with hormonal contraceptives is the most important feature and the values on x-axis represent the absolute amount of average effect of the features on the outputs.
SHAP Dependence Plot:
SHAP dependence plot can be considered as an alternative to PDP. Using SHAP dependence plot we can also infer about the interaction effect among the features on the model predictions which we can not do for PDP. It is basically created by plotting the values of a feature on the x-axis and its corresponding Shapley values on the y-axis.
For jᵗʰ feature SHAP dependence plot is created by plotting
![Data for plotting SHAP dependence plot [Image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/12d8VyFwfkp0GVCCd3rOUMg.png)
Now let’s check how its look and how to interpret this.
![SHAP dependence plot [Source]](https://towardsdatascience.com/wp-content/uploads/2021/11/1ppwsICZFSvmMZomXz5Eqvw.png)
For this example, when age is between 20 and 40 that plays a crucial role on the model prediction and after that the effect of age on model output is quite stable. The interaction effect of age with another feature Education-Num can also be interpreted from this plot as it is showing at age near 20 less education is affecting the output more than high level education.
Note: The plots that are used in this article are collected from different sources for better understanding of the methods and to interpret the outputs as clearly as possible.
Conclusions
Among all the interpretability tools that are discussed in this article SHAP is the most commonly used because of its granularity and because of its implementation in both local and global explanations . Although it takes little bit extra time compared to other techniques. So, it is very much important to know what is the purpose of explanation before using any interpretability tool.
If you like this article please hit recommend .That would be incredible.
Follow me on Medium and LinkedIn for my future blog posts and updates.
References
- C. Molnar . Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/
- SHAP documentation: https://shap.readthedocs.io/en/latest/index.html
- M. T. Ribeiro, S. Singh, C. Guestrin. "Why Should I Trust You?" Explaining the Predictions of Any Classifier. https://arxiv.org/pdf/1602.04938v3.pdf
Let’s wrap it up.Thank you so much for reading and happy learning.