Understanding how LIME explains predictions

Published in

Towards Data Science

11 min readDec 28, 2018

In a recent post I introduced three existing approaches to explain individual predictions of any machine learning model. In this post I will focus on one of them: Local Interpretable Model-agnostic Explanations (LIME), a method presented by Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin in 2016.

In my opinion, its name perfectly summarizes the three basic ideas behind this explanation method:

Model-agnosticism. In other words, model-independent, which means that LIME doesn’t make any assumptions about the model whose prediction is explained. It treats the model as a black-box, so the only way that it has to understand its behavior is perturbing the input and see how the predictions change.
Interpretability. Explanations have to be easy to understand by users above all, which is not necessarily true for the feature space used by the model because it may use too many input variables (even a linear model can be difficult to interpret if it has hundreds or thousands of coefficients) or it simply uses too complex/artifficial variables (and explanations in terms of these variables will not be interpretable by a human). For this reason, LIME’s explanations use a data representation (called interpretable representation) that is different from the original feature space.
Locality. LIME produces an explanation by approximating the black-box model by an interpretable model (for example, a linear model with a few non-zero coefficients) in the neighborhood of the instance we want to explain.

In summary, LIME generates an explanation for a prediction from the components of an interpretable model (for instance, the coefficients in a linear regression) which resembles the black-box model at the vicinity of the point of interest and which is trained over a new data representation to ensure interpretability.

Interpretable representation

To ensure that the explanation is interpretable, LIME distinguishes an interpretable representation from the original feature space that the model uses. The interpretable representation has to be understandable to humans, so its dimension is not necessarily the same as the dimension of the original feature space.

Let p be the dimension of the original feature space X and let p’ be the dimension of the interpretable space X’. The interpretable inputs map to the original inputs through a mapping function hʸ: X’→X, specific to the instance we want to explain y∈ X.

Different types of mappings are used for different input spaces:

For text data, a possible interpretable representation is a binary vector indicating the presence or absence of a word, although the classifier may use more complex and incomprehensible features such as word embeddings. Formally, X’={0,1}ᵖ’≡ {0,1}× ⋅⋅⋅ × {0,1} where p’ is the number of words that contains the instance being explained and the mapping function converts a vector of 1’s or 0’s (presence or absence of a word, respectively) into the representations used by the model: if it uses word counts, the mapping will map 1 to the original word count and 0 to 0; but if the model uses word embeddings, the mapping should convert any sentence expressed as a vector of 1’s and 0’s into its embedded version.
For images, a possible interpretable representation is a binary vector indicating the presence or absence of a set of contiguous similar pixels (also called super pixel). Formally, X’={0,1}ᵖ’ where p’ is the number of super pixels considered, usually obtained by an image segmentation algorithm such as quick shift. In this case, the mapping function maps 1 to leaving the super pixel as in the original image and 0 to gray the super pixel out (which represents being missing).

Original representation (left) and interpretable representation (right) of an image. Sets of contiguous similar pixels (delimited by yellow lines) are called super pixels. Each super pixel defines one interpretable feature. Source: https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime

For tabular data (i.e., matrices), the interpretable representation depends on the type of features: categorical, numerical or mixed data. For categorical data, X’={0,1}ᵖ where p is the actual number of features used by the model (i.e., p’=p) and the mapping function maps 1 to the original class of the instance and 0 to a different one sampled according to the distribution of training data. For numerical data, X’=X and the mapping function is the identity. However, we can discretize numerical features so that they can be considered categorical features. For mixed data, the interpretable representation will be a p-dimensional space composed of both binary features corresponding to categorical features and numerical features corresponding to numerical features (if they are not discretized). Each component of the mapping function is also defined according to the previous definitions depending on the type of the explanatory variable.

Locality

Although a simple model may not be able to approximate the black box model globally, approximating it in the neighborhood of the individual instance we want to explain may be feasible. In other words, LIME relies on the assumption that every complex model is linear on a local scale.

Formally, we need a weight function wʸ: X→ℝ⁺ that gives the most weight to the instances z∈X closest to the instance we want to explain y∈ X and the least weight to the instances that are furthest away.

Fidelity-interpretability trade-off

Ribeiro et al. define an explanation as a model g: X’→ ℝ such that g∈G, where G is a class of “potentially” interpretable models, that is, models that can be readily presented to the user with visual or textual artifacts, such as linear models or decision trees. Note that the domain of g is the interpretable space X’.

Let 𝔏(f,g,wʸ) be a loss function that measures how accurate g is in approximating f taking into account the weights wʸ. As not every g∈G may be simple enough to be interpretable, we let Ω(g) be a regularization term that measures the complexity (as opposed to interpretability) of the explanation g∈G. For example, for linear models Ω(g) may be the number of non-zero coefficients, while for decision trees Ω(g) may be the depth of the tree.

Method

Let X=ℝᵖ be the feature space. Let y∈ℝᵖ be the original representation of an instance being explained, and y’∈X’ be its interpretable representation.

Also, let f: X=ℝᵖ→ℝ be the model being explained. In classification, f(x) is the probability (or a binary indicator) that x belongs to a certain class. For multiple classes, LIME explains each class separately, thus f(x) is the prediction of the relevant class. In regression, f(x) is the regression function.

Finally, let g: X’→ ℝ be the explanation model. Let 𝔏(f,g,wʸ) be a loss function that measures how unfaithful g is in approximating f in the locality defined by wʸ, and let Ω(g) be a measure of complexity of the explanation g∈ G.

In order to ensure both interpretability and local fidelity, we must minimize 𝔏(f,g,wʸ) while having Ω(g) be low enough to be interpretable by humans. Thus, the explanation ξ(y) produced by LIME is obtained by the following:

Note that this formulation can be used with different explanation families G, loss functions 𝔏 and regularization terms Ω.

In practice, the general approach LIME uses to produce an explanation is the following:

Generate N “perturbed” samples of the interpretable version of the instance to explain y’. Let {zᵢ’∈X’ | i=1,…,N} be the set of these observations.
Recover the “perturbed” observations in the original feature space by means of the mapping function. Let {zᵢ ≡ hʸ(zᵢ’)∈X | i=1,…,N} be the set in the original representation.
Let the black box model predict the outcome of every “perturbed” observation. Let {f(zᵢ)∈ℝ | i=1,…,N} be the set of responses.
Compute the weight of every “perturbed” observation. Let {wʸ(zᵢ)∈ℝ⁺ | i=1,…,N} be the set of weights.
Solve the equation above using the dataset of “perturbed” samples with their responses {(z’ᵢ, f(zᵢ))∈X’×ℝ | i=1,…,N} as training data.

Note that the complexity does not depend on the size of the training set (because it produces the explanation for an individual prediction), but instead on time to compute predictions f(z) and on the number of samples N.

The sampling process described in step 1 depends on the interpretable space X’, so it is different depending on the type of input data (see Section “Interpretable representation”). For text data or images, whose interpretable space is composed of binary features (i.e., X’={0,1}ᵖ’), the samples zᵢ’∈X’ are obtained by drawing non-zero elements of y’ uniformly at random, where the number of such draws is also uniformly sampled. For tabular data (i.e., matrices), step 1 depends on the type of features and it requires the original training set.

For text data, since the interpretable space is a binary vector indicating the presence or absence of a word, the process means randomly removing words from the instance being explained. For example, if we are trying to explain the prediction of a text classifier for the sentence “I hate this movie”, we will perturb the sentence and get predictions on sentences such as “I hate movie”, “I this movie”, “I movie”, “I hate”, etc. Note that if the classifier uses some uninterpretable representation such as word embeddings, this still works: in step 2 we will represent the perturbed sentences with word embeddings, and the explanation will still be in terms of words such as “hate” or “movie”.

For images, since the interpretable space is a binary vector indicating the presence or absence of a super pixel, the process consists of randomly making some of the super pixels gray. In other words, if we want to explain the prediction for an image, we will perturb the image and get predictions on images which have one or more hidden super pixels. This process is described in the following image:

Examples of perturbed instances of an image and their predictions. Source: https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime

For tabular data (i.e., matrices), there are differences between categorical and numerical features, but both types are dependent on the training set. For categorical features (whose interpretable representation is binary) perturbed samples are obtained by sampling according to the training distribution and making a binary feature that is 1 when the value is the same as the instance being explained. For numerical features, the instance being explained is perturbed by sampling from a Normal(0,1) distribution and doing the inverse operation of mean-centering and scaling (using the means and standard deviations of the training data).

Sparse Linear Explanations

Ribero et al. focus on what they call sparse linear explanations, which correspond to specific choices of the explanation familiy G, the loss function 𝔏 and the regularization term Ω. The choices are not explicitly justified, but some probable reasons can be inferred. They are the following:

The class of linear models as explanation family G. g(z’)=β⋅ z’

Linear models let us measure each feature’s influence in a prediction by inspecting both the magnitude and the sign of its coefficient. Therefore, linear models provide qualitative understanding between the explanatory variables and the response, which makes them potentially interpretable models suitable for being explanation models.

An exponential kernel defined on some distance function D as weight function. wʸ(z)=e^{-D(y,z)²/σ²}

The choice of using an exponential kernel is not justified. The distance functions used are the cosine similarity measure for text data and the Euclidean distance for images and tabular data. However, the implementations of LIME in Python and R accept any other distance chosen by the user. Finally, even though the choice of the kernel width σ is not discussed, it defaults to σ=3/4 \sqrt{p}, where p is the number of features.

Weighted least squares error as loss function.

where (z’ᵢ, f(zᵢ)) are the dataset of “perturbed” samples with their responses defined above. Quadratic loss functions are often more mathematically tractable than other loss functions and, specifically, the weighted least squares problem has a closed form solution.

Limiting to K the number of non-zero coefficients as regularization term.

where ∥β∥₀=∑ⱼ|βⱼ|⁰ is the L0 “norm” (i.e., the number of non-zero entries of β). The regularization parameter K is crucial for the explanation task because it establishes the number of components that will be presented to the user.

Note that there are two hyperparameters: the kernel width σ and the number of features K used for the explanation.

With the previous choices, the equation needed to solve to produce an explanation with LIME becomes:

This equation cannot be directly solved, so they approximate the solution by first selecting K features using a feature selection technique and then estimating the coefficients via weighted least squares.

In the original paper, the authors select K features using the regularization path produced by LASSO (i.e., the estimated coefficients by LASSO as a function of the shrinkage factor) and then estimate the coefficients via weighted least squares. They call this procedure K-LASSO. Nevertheless, implementations of LIME currently support more feature selection algorithms for selecting the K features. Also, all features can be used for the explanation, but it is not recommended unless there are very few features.

In summary, the current approach that LIME uses to approximate the solution of the previous equation and produce an explanation for instance y is:

Generate N “perturbed” samples of the interpretable version of the instance to explain y’. Let {zᵢ’∈ X’ | i=1,…,N} be the set of these observations.
Recover the “perturbed” observations in the original feature space by means of the mapping function. Let {zᵢ ≡ hʸ(zᵢ’)∈ X | i=1,…,N} be the set in the original representation.
Let the black box model predict the outcome of every “perturbed” observation. Let {f(zᵢ)∈ℝ | i=1,…,N} be the set of responses, and let ℨ={(z’ᵢ, f(zᵢ))∈ X’×ℝ | i=1,…,N} be the the dataset of “perturbed” samples with their responses.
Compute the weight of every “perturbed” observation. Let {wʸ(zᵢ)∈ℝ⁺ | i=1,…,N} be the set of weights.
Select K features best describing the black box model outcome from the perturbed dataset ℨ.
Fit a weighted linear regression model (actually, current implementations of LIME fit a weighted ridge regression with the regularization parameter set to 1) to a feature-reduced dataset composed of the K selected features in step 5. If the black box model is a regressor, the linear model will predict the output of the black box model directly. If the black box model is a classifier, the linear model will predict the probability of the chosen class.
Extract the coefficients from the linear model and use them as explanations for the black box model’s local behavior.

An example of this procedure for images is shown in the image below. Note that, in addition to a black box model (classifier or regressor) f and an instance to explain y (and its interpretable representation y’), the previous procedure requires setting in advance the number of samples N, the kernel width σ and the length of the explanation K.

Explaining the prediction of a classifier with LIME. Source: https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime

In future posts I will explain how to explain predictions of ML models with LIME using R and Python.