Model Interpretability

ML Model Explainability (sometimes referred to as Model Interpretability or ML Model Transparency) is a fundamental pillar of AI Quality. It is impossible to trust a Machine Learning model without understanding how and why it makes its decisions, and whether these decisions are justified. Peering into ML models is absolutely necessary before deploying them in the wild, where a poorly understood model can not only fail to achieve its objective, but also cause negative business or social impacts, or encounter regulatory trouble. Explainability is also an important backbone to other trustworthy ML pillars like fairness and stability.
Yet "explainability" is often a broad and sometimes confusing concept. At its simplest, a machine learning explanation is a set of views of model function that help you to understand results predicted by machine learning models.
How do I pick the best explainability technique?
A fair amount of complexity lies beneath this simple description. There are competing methods for providing model explanations: coefficients of logistic regressions, LIME, Shapley values techniques (QII, SHAP), and integrated gradient explanations. How do you know if you are getting a good explanation of the model? How do you compare different methods of providing explanations to select the best one for your model? To provide clarity, we introduce a taxonomy of explanation methods to understand and characterize how myriad explanation techniques fit together.
Our ML explainability taxonomy has four dimensions:
- Explanation scope. What is the scope of the explanation and what output are we trying to explain?
- Inputs. What inputs is our explanation method using?
- Access. What model and data access does the explanation method have?
- Stage. To what stage of the model do we apply our explanations?
Let’s break down each of these questions and understand what they mean.

Explanation scope: what outputs are we trying to explain?
Let’s use a concrete example to illustrate this nuance. Suppose that a data scientist trains a machine learning model to predict the probability that a given individual is at risk for a disease. The model uses a variety of medical data in its decision-making, and the data scientist wants to justify the model’s predictions to doctors who might use the algorithm.
Global or local?
The data scientist might present a list of the top five important features that are driving model predictions across all individuals, such as a person’s blood pressure and whether they have a family history of the disease. This is an example of a global explanation, as the data scientist is explaining the model via overall drivers of its predictions across many data points. The data scientist might also supplement this with a local explanation which would instead explain drivers of a prediction for a single individual/datapoint – for example, why was John specifically predicted to be at risk for this disease? The scope of an explanation can widely range from local to global. Somewhere in the middle of the two, a data scientist could instead explain drivers of a model’s predictions on a segment of the population, for example, the drivers of predictions for women only..
Type of output
It’s not just about how many outputs we want to explain, but also their type. Let’s consider a model that determines whether someone is deserving of a loan. Are we explaining the probability scores (probability of a person getting a loan) that a model assigns to each datapoint? Or the classification decisions (whether or not to give a loan) that are created when the data scientist thresholds on the raw probability scores? Small changes to the model output that is being explained can have vast implications on the explanation itself.
As a quick example to demonstrate how small changes to the output being explained can have significant impact on the explanation itself, let’s take the U.S. presidential election. Each state can be considered a "feature" that assigns a score (electoral college votes) which then gets aggregated to determine the election outcome. If you were to explain the raw score of the election by posing the question "Why did Joe Biden receive 306 electoral college votes?", large states like Texas, California, and New York would be the primary contributors to his precise score. In contrast, to explain the classification outcome of "Why did Joe Biden win the presidency?", we’d instead turn to smaller swing states that were pivotal in deciding between candidates.
Explanation inputs: from which inputs is our explanation being calculated?
Explainability methods must summarize their results with respect to some component of the model. Typically, these components being analyzed are either the constituent features, intermediate features, or training data.
- Input features. Most commonly, it is the model’s constituent features that are being explained. A method will attempt to explain the local or global properties of the model by explaining how each feature affected the overall model decision(s). In the healthcare example above, a data scientist might point to "blood pressure" and "family history of the disease" as specific input features that are impactful on model decisions.
- Intermediate features. Let’s now turn to a new example: suppose a researcher has trained a convolutional neural network on image data to predict whether a dog appears in an image. The researcher might first explain the model with respect to each pixel of a particular input image, but this can be hard to reason about over thousands of pictures. However, the researcher knows that a particular intermediate layer of the neural network is responsible for identifying patterns in the image that would indicate a dog is present, and wants to sanity check these filters for correctness. So, she instead generates explanations with respect to an intermediate layer of the network. These types of internal explanations are common for models with an inherent structure that is sequential, such as intermediate layers of a neural network or branching structures within decision trees.
- Training data characteristics. But explanations that are calculated with respect to model internals are still using intermediate "features" of the model. What if our explanation method instead tried to justify model behavior by tracing it back to the training data itself? Other explanation techniques attempt to do just that by quantifying the data points that most contributed to a certain aspect of learned model behavior.
Explanation access: how much information does our explanation know about the model?
The amount of access that an explanation method has to the model itself is another defining dimension of our taxonomy. "Limited access" is not inherently better than "full access," or vice versa – they each have specific benefits.
- Limited access. Many explanation techniques often assume limited access to a model’s inputs and outputs without any knowledge of the model architecture itself. Model-agnostic techniques such as LIME, SHAP, and QII examine the effect that inputs have on model outputs without peering into model internals. This can be useful to compare explanations across multiple models that are trained on the same dataset. How does model behavior change when a data scientist switches a logistic regression model to something more complex like an ensemble of trees?
- Intermediate access. Explanation techniques for specific model classes have become increasingly popular – efficient approximations of Shapley values for tree models, for example, leverage the structure of the model to improve performance.
- Full access. In the extreme case, model-specific techniques might need full access to a model object. Although they are not translatable across multiple model classes, they can leverage the structure of a particular model type to give deep insight or provide a performance improvement over model-agnostic methods. Popular examples of model-specific techniques include gradient-based strategies like Integrated Gradients, SmoothGrad, and Grad-CAM which are designed for neural networks.
Explanation stage: when do explanations come into the picture?
The last part of our explanation taxonomy has to do with at which stage of model development the explanations are being applied – during training, or after? Explanation methods can follow naturally from the structure of the model (such as in the case of a decision tree), or be retroactively applied to understand a more opaque model (such as a neural network). To this point, a common debate in the literature is whether to build with self- or inherently interpretable models vs. algorithmically interpretable models.
Explainable during training
The prototypical self-interpretable (also called "white box") model is the simple linear or logistic regression; with these, feature importances can be intuited from the coefficients of the linear model and the exact way a model reaches its conclusions is clear because a linear model creates its predictions as a weighted sum of input features.
Decision trees, decision sets, rule-based classifiers, and scorecards are also considered whitebox models in that they force models to create predictions via clearly-defined rules. More recently, there has also been some research to add flavors of interpretability constraints directly into model training itself, e.g. Bayesian rule-based reasoning models, sparsifying neural networks through gradient penalties, or even by adding human feedback into an optimization procedure.
While inherently interpretable models do simplify the "explanation" of a model, the reality is that limiting oneself to linear or rule-based models means that data scientists would be severely constraining the types of models or use cases that they can achieve with machine learning.
Explainable after training
Explainability techniques that are applied to models after training is complete can explain the "self-interpretable" models mentioned above, while also explaining "algorithmically interpretable" models.
Algorithmically interpretable models are those that are not restricted in training to conform to a set of rules. The rules are derived during the training process itself and are not necessarily immediately obvious to the data scientist or an outside observer. For these reasons, they are often called "black box" models. In order to explain algorithmically interpretable models, explainability techniques have to be applied after a model is finished training. For this reason, these techniques are often known as "post-hoc explainability," and they can be used with any machine learning model type, both white box and black box.
Zooming out: common explanation techniques using this taxonomy
We’ve dissected each dimension of our explanation taxonomy, but let’s ground this with a few examples of popular explainability techniques, summarized in the table below.

Interpreting the coefficients of a logistic regression model
Logistic regression models are simple models that capture a linear relationship between inputs and outputs. As such, it can be very easy to understand and explain the model’s decision making – it’s all captured via the coefficients of the logistic regression. The respective magnitude of these coefficients can give an intrinsic sense of how much a feature is driving a model prediction.
Because of this, interpreting the coefficients of a logistic regression model is a self-interpretable technique as the model itself is inherently constrained to emulate a simple relationship between inputs and outputs. The scope of the explanation is global – coefficient magnitudes cannot be used to intuit about one specific data point but rather aggregate drivers of the prediction itself. It is also a model-specific technique as it requires access to the learned weights of the model.
LIME
Ribeiro et. al’s "Why Should I Trust You?" paper introduced Local Interpretable Model-Agnostic Explanations, or LIME, which is a well-known technique. At a high level, LIME works to explain model decisions on a single input by perturbing the data point’s features and seeing how this shifts model predictions.
As you can gather from the name, LIME is a local explainability technique that is only suited to interpret a single data point in isolation. It is also model-agnostic as it assumes only access to model inputs and outputs. It does not constrain model training at all and instead explains a model’s decision making after it has trained, making it an algorithmically interpretable technique. Finally, LIME is usually used to calculate explanations with respect to a model’s input features.
Shapley value-based explanations
Shapley Value-based explanation techniques such as Lundberg et. al’s SHAP and Datta et. al’s QII papers are increasingly popular. We will cover these techniques at greater depth in subsequent posts, but both use the concept of the Shapley Value (a term from coalitional game theory) to precisely attribute a model’s output score or classification decision for each of its inputs.
Shapley Value-based techniques, like LIME, are algorithmically interpretable and model-agnostic methods. They assume no access to model internals and can be applied to any model type. The core algorithm itself can be applied to any input, but is most commonly used to explain a model’s constituent features. Unlike LIME, however, Shapley Value-based explanations can be used to ascertain both local and global model reasoning for a variety of model outputs (probability, regression, and classification outcomes, to name a few).
Let’s explain our machine learning models… intelligently
As we have seen, there are a wide variety of ways to explain machine learning models. Since different techniques are appropriate to different model types or situations, this taxonomy, using the dimensions of explanation scope, inputs, access, and stage, can be helpful to a data scientist determining the most appropriate technique to use for their model.
This taxonomy can be helpful to characterize and compare the suitability of explanation techniques for each unique ML scenario. When a data scientist selects the most suitable technique for their model, they are putting themselves in the best situation to accurately describe their model’s results and increase stakeholder and user trust in the model, leading to the highest likelihood of approval for real-world use and broad, longstanding usage. When an unsuitable method is selected, the data scientist can end up in the opposite position – scrambling to explain why the model is behaving as it is, with the explanations ending up unreliable, leading to a downward spiral in confidence. We hope that this is helpful in making informed decisions about how to evaluate a model’s trustworthiness.