The world’s leading publication for data science, AI, and ML professionals.

Model Interpretability Using Credit Card Fraud Data

Why model interpretability is important

Recently, I stumbled upon an online book which describes different tools that can be used for machine learning model interpretability (https://christophm.github.io/interpretable-ml-book/). The idea that machine learning models should not be a black box and can be explained fascinated me, and I decided to dive deep into this topic. Previously, when I would start working on a new machine learning project, I would follow the same procedure: identifying the problem, getting familiar with a dataset, feature engineering, selection of a model, training/testing and hyperparameter tuning, and result analysis. However, I didn’t realize I was missing the most crucial step: model interpretability.

What is model interpretability?

Model interpretability is a process of explaining how the black box (model) works and how it makes predictions. Let’s imagine a situation where a person applied for a credit loan and was denied because the model made a negative prediction. Any person would want to know why it was denied and what possibly they can change so that the decision would be positive, and all the bank employee can do is point at a machine learning model and say "It said so!". This is not a great scenario and damages the bank’s reputation since it looks like it doesn’t have control over its product. It would be much better if the bank employee could explain to the customer what specific features in their data influenced the decision and what their impact was. That’s where model interpretability comes into play: it can explain not only which features are most important, but also point to how much they influenced the model’s decision.

There are different tools which can be used for model interpretability and you can find detailed descriptions of each one of them in an online book that I have mentioned previously, but in general, such techniques could be classified into 2 categories: global and local. Global techniques try to explain how features behave as a combination for all data points inside a model, while local focus on a specified sample and walk you through how specific features influenced the model’s decision. From my experience, global methods contribute only briefly to how each feature behaves, while local one helps you to focus more on particular examples, for example, False Positive or True Negative examples and understand which features impact more to the model’s prediction. It is a useful tool which might help you to focus on features that influence the decision the most, rather than trying to blindly change preprocessing technique.


Credit Card Fraud detection

To show the capabilities of the model interpretability methods, I decided to use a simple dataset from Kaggle, which contains information about transaction data, its features and whether a transaction is fraud or not.

Getting familiar with a dataset

The dataset contains information about transactions made by credit cards in September 2023. There are around 280K transactions and 32 features. The dataset itself is quite small, especially in terms of features. Unfortunately, the features’ names don’t provide any meaningful information (i.e. V1 or V14).

After checking for null values (there are none) and generally looking into the dataset, we can look into the correlation matrix and whether our features are highly correlated with each other (picture below). We can see that "Time" and "Amount" have a negative correlation with a couple of features, however, the majority of features do not correlate with each other at all. In my opinion, dimensionality reduction is not needed, since we don’t have so many features and the majority of them do not correlate with each other.

Correlation Matrix
Correlation Matrix

Next, we can look into features a bit more in detail and plot their distribution (I find this method the best to visually see what data we have and whether there might be any outliers). Below is a picture of the feature distribution for some of the features. I did not find extreme outliers in this dataset and the only change to features themselves that can be done is to transform the "Time" feature (in seconds) into the "Hour" feature.

Feature distribution
Feature distribution

There are only 492 transactions that are identified as fraudulent, which makes this dataset highly unbalanced. I tried to experiment with changing the distribution using downsampling, upsampling and SMOTE, but nothing helped increase the model’s performance.

Distribution of Class Variable
Distribution of Class Variable

Preprocessing

For preprocessing, I am using sklearn pipeline. Since all features are numerical, I am applying SimpleImputer and StandardScaler for all features. LabelBinarizer is used to label class data, but it is not needed in our case since the target is already encoded. Except that, there are no other preprocessing steps that I did.

Model selection and training

I will not go so much into detail about experimentations with different models (like RandomForest, LogisticRegression, SVM, KNN, etc.), since this is not the goal of this article, only wanted to mention that I tried to experiment with different models, using different hyperparameters, and (not surprisingly) XGBoost showed the best results. There is also LightGBM, which uses less processing power to produce similar or even higher results to XGBoost, but for simplicity, I decided to go with XGBoost.

I am using the GridSearchCV function which performs k-fold cross-validation (meaning the data is divided into k subsets or "folds" and it uses k-1 folds for training and remaining for validation). As my score metric, I decided to use the F2 score, which prioritized recall over precision compared to the F1 score, since in our case recall is much more important than precision. I am not using only recall as a metric because I don’t want to ignore precision completely and max out recall without caring about precision (even though I managed to get a recall as high as 0.94, however, precision dropped down to 0.18). In our case, the ROC curve might not be the best metric, since for credit card Fraud Detection problems, in addition to an unbalanced dataset, TPs are much more important than FPs (meaning that if we were to flag 100 transactions as fraudulent, we wouldn’t mind if some percentage of them would be falsely flagged).

from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
# Define the model
model = XGBClassifier(use_label_encoder=False)
# Define the parameters to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 4, 6],
    'learning_rate': [0.01, 0.1, 0.2]
}
# Define a custom scorer
f2_scorer = make_scorer(fbeta_score, beta=2)
# Define the grid search
grid_search = GridSearchCV(model, param_grid, cv=5, scoring=f2_scorer, verbose=2)
# Fit the grid search
grid_search.fit(X_train, y_train)
# Get the best parameters
best_params = grid_search.best_params_
print('Best parameters:', best_params)

After getting the best parameters for the XGBoost model, we can get a confusion matrix and see what the final recall and precision are. We can see that there are a couple of samples that were wrongly labelled as being fraudulent (False Positives: 4) and a bit more missed fraudulent transactions (True Negatives: 7).

Confusion matrix
Confusion matrix

Here is where we come to the first model interpretability part, which can be done without using any other libraries. We can plot feature importance directly from the XGBoost model using the code below:

# Get feature importance
feature_importance = best_model.feature_importances_
# Print feature importance
for i, score in enumerate(feature_importance):
    print(f'Feature: {i}, Importance: {score}')
# Plot feature importance
xgb.plot_importance(best_model)
plt.show()
Feature importance
Feature importance

Features V4 and V14 are 2 most important features of the model. However, except for knowing which features are the most important and least important to the model, we can’t do much with the given information. We can go back to the preprocessing step and try to remove the least important features or use different preprocessing techniques for the most important features, but it would be mostly guessing. We don’t know whether features V4 and V14 were the ones that influenced the most in FP or TN samples, nor whether there is any hidden correlation between features that the model can see. That’s where model interpretability tools come into play.

Local Methods

There are 2 main methods for model interpretability: local and global. Global methods describe how features contribute to the model outcome and how much they influence the decision. Local methods focus on a specific sample and explain why the model made a certain decision. I would like to start model interpretability by describing local methods rather than global ones since I found it to be more fascinating and help you to view features using different angles.

SHAP

SHAP (SHapley Additive exPlanations) method explains individual prediction by computing the contribution of each feature to the prediction. It computes Shapley values, which is a concept from cooperative game theory that assigns a contribution of each player based on their contribution to a total payout. In the case of a Machine Learning model, "player" means a feature or a set of features (i.e. in Computer Vision problems).

To calculate Shapley values, SHAP considers all possible combinations of features. For each feature, it computes the difference in the model’s prediction with and without that feature across these combinations. For example, if you have features A, B, and C, it would consider A, AB, AC, and ABC. The SHAP value for a feature is then the average of these marginal contributions across all subsets that include the feature, and those values can be either negative or positive. To use SHAP in Python, we can import the library and calculate Shapley values using the code below. We only need to provide the model and features (X_test) without ground truth (y_true), because the goal of SHAP is not to evaluate whether predictions are correct, but to evaluate feature contribution to the final model’s prediction.

import shap
# create SHAP explainer
explainer = shap.Explainer(best_model, X_test)
# Calcualte SHAP values
shap_values = explainer(X_test)

It might take some time for the SHAP value calculation step and would depend on how big your dataset is. After completion, you get an array of [n_samples, n_features] size (if you have a multiclass classification problem, the shape will be [n_samples, n_features, n_class] and for each code, you would need to include which class you want to plot, i.e. [0, :, 1]). SHAP library has a lot of different functions that can be used for Model Interpretability, but I will concentrate only on the ones that I found the most useful. The first one is a "waterfall" function, which shows you how much each feature contributed.

shap.plots.waterfall(shap_values[0, :], max_display=5)
SHAP waterfall plot for not fraudelent prediction
SHAP waterfall plot for not fraudelent prediction

Firstly, we need to understand how to read this plot. The y-axis is a list of features in your dataset. You can expand it by changing the _maxdisplay parameter. Each feature on the y-axis has a value (i.e. 0.899 = V4) which is what the actual value is in the X_test. x-axis shows the impact each feature has on the model’s output. Values in red increase the prediction, while values in blue decrease it.

To read the graph, you need to start from the base value, which is E[f(X)] = -12.138 (it is around 0.000006 if we apply the sigmoid function). Base value (or expected value) is calculated as the average prediction of the model over the entire training dataset when no features are present. This value is the same for all samples (in case you have a binary classification problem). So in our case, since we have much more negative predictions, the base value will be closer to 0 after the sigmoid function. Then, you add/subtract each feature’s Shapley value and the final value (f(x)) is the actual prediction. From the plot above we can see that feature V4 contributed the most to the prediction and it made the prediction closer to being fraudulent, when feature "Amount" negatively contributed, meaning that the value in "Amount" made the model decision closer to not fraudulent.

Let us look into another example, where the model predicted a transaction to be fraudulent. In the graph below you can see that the base value stays the same, but the final prediction is much larger than before (f(x) = 4.107 or 0.98), which indicates that the model is very confident that this transaction is fraudulent. Luckily, the SHAP value makes us understand why the model decided so. The magnitude of feature V14 became much larger than in the previous example, and we can see that the feature value is -9.81 compared to the previous 0.19. This example indicates that the model will most likely predict a transaction as fraudulent if the value of a feature V14 is much less than the average V14 value. We can also clearly see what other features contributed to the final prediction, either positively or negatively. For example, feature V20 contributed negatively, which means that the value of this feature is similar to non-fraudulent transactions.

SHAP waterfall plot for not a fraudulent prediction
SHAP waterfall plot for not a fraudulent prediction

We can also merge all these functions into a single y-axis and plot a general representation of those magnitudes using the function below.

shap.plots.force(shap_values[27671, :])
SHAP magnitude in a single plot
SHAP magnitude in a single plot

This representation is much easier to read and helps to visualize how much each feature contributed to the final prediction. You can also pass the link="logit" parameter to see what the actual values were for each feature.

SHAP magnitude in a single plot (using actual feature value)
SHAP magnitude in a single plot (using actual feature value)

We can also do a global analysis and plot the mean value for specific features across all samples. This graph represents feature importance and should be similar to our previous feature importance calculations.

shap.plots.bar(shap_values[:, :])
SHAP feature importance
SHAP feature importance

We can also plot a summary plot, which shows feature distribution, its values (high/low) and its magnitude. The y-axis represents the features ranked by their average absolute SHAP value (from the graph above). The x-axis represents SHAP values. Positive values for a given feature push the model’s prediction closer to the label being examined (label=1). In contrast, negative values push towards the opposite class (label=0). Basically, if we were to have 2 distinct groups of values (red/blue) it would mean that this feature is possibly a good representation of whether a transaction is fraudulent or not in the model’s decision-making process. For example, we can see if feature V14 is low, their SHAP value or magnitude will be much higher leading to a model prediction transaction as fraudulent. On the other hand, if the value is high, then it will be an indication to the model that the transaction is most likely not fraudulent.

shap.summary_plot(shap_values[:,:], X_test)
SHAP value impact on the model output
SHAP value impact on the model output

Next, we can use the dependency plot, which indicates the dependency between 2 features. In the example below, I use auto-detection and the function will decide which feature is highly correlated with the specified feature (V14), but you can decide 2nd feature yourself. In the graph below, we can see that when feature V14 is lower than -5 (x-axis) its SHAP value or the magnitude increases drastically (y-axis), which contributes positively to the final prediction. The colour of each point (red/blue) indicates the SHAP value of another feature (in this case V4). We can see that most probably when feature V14 positively contributes the most to the final prediction, feature V4 will contribute positively as well.

shap.dependence_plot('V14', shap_values.values[:, :], X_test, interaction_index='auto')
SHAP dependency plot for V14-V4 features
SHAP dependency plot for V14-V4 features

We can plot dependency on another feature (V14 – V7) and we can see that feature V7 will positively contribute only when the V14 feature is between around -7 and 3, otherwise, it will be a negative contribution.

SHAP dependency plot for V14-V7 features
SHAP dependency plot for V14-V7 features

However, I like to plot the distribution of each feature value over the SHAP value and highlight to what class each category belongs. We can see there when the feature V14 value is below -5, most of the transactions can be labeled as fraudulent with a couple of exceptions. We can also see that for the model, if the value is below -2, it will always positively affect the final prediction making it closer to predicting as fraudulent.

Feature V14 distribution over SHAP value with highlighted target
Feature V14 distribution over SHAP value with highlighted target

In case you want to see how SHAP values look for different samples in the same graph, the best method is to use the heatmap function. You will be able to see each instance (x-axis) and their SHAP values (y-axis) and whether it was positive or negative (red/blue). For instance, we can see that there are a couple of instances with very high SHAP values for V4 (10–15) or relatively low SHAP values (21–65). We can always go back to the original dataset and look into those samples in more detail.

shap.plots.heatmap(shap_values[0:100, :])
SHAP values heatmap plot
SHAP values heatmap plot

LIME

The next local method is called LIME (Local interpretable model-agnostic explanations). LIME can be used for any model since it treats the model as a black box. Instead of considering the entire dataset, LIME creates variations of the instance by making slight changes. It calculates weights by altering a single data instance, which adjusts the value of its features and monitors the impact on the output. These samples are then passed through the black-box model to get their predictions, and LIME assigns higher weights to samples that are more similar to the original instance. Using these weighted samples, LIME trains an interpretable model, such as linear regression or a decision tree, to approximate the black-box model’s behaviour locally. This model helps in understanding how the features influence the prediction for the specific instance.

# Import the LimeTabularExplainer module
from lime.lime_tabular import LimeTabularExplainer
# Get the class names
class_names = ['Not fraud', 'Fraud']
# Get the feature names
feature_names = list(X_train.columns)
# Fit the Explainer on the training data set using the LimeTabularExplainer
explainer = LimeTabularExplainer(X_train.values, feature_names=feature_names,
                                 class_names=class_names, 
                                 mode = 'classification') # classification or regression
exp = explainer.explain_instance(X_test.iloc[0], best_model.predict_proba, num_features=30)
exp.show_in_notebook(show_table=True)
LIME output
LIME output

There are 3 main parts to the LIME plot. The first one (left) shows a bar chart with classes that are being predicted and their probability. The center part represents the thresholds for each feature that influence the model’s prediction. For example, V4 > 0.52 means that the feature V4 being greater than 0.52 influences the prediction. On the right side, you can see a list of features with actual values. The colour indicates whether the feature contributes more towards the fraudulent class (orange) or not (blue). The value next to the feature shows the feature’s weight, when the higher the value the more impact it makes on the prediction.

Individual Conditional Expectation (ICE)

ICE plots show a line for each instance, illustrating how its prediction changes when a feature changes. It takes a subset (or a full set) of data and for each instance changes a given feature and records the prediction change based on the value of a given feature from the subset.

To illustrate how it works, I will take a sample that was described previously in SHAP waterfall, which model was predicted as fraudulent and where feature V14 impacted the most. Since we know the V14 feature impacts prediction significantly, it would be interesting to know what would happen if this value were different. For faster calculation, we can limit the amount of data that is passed to the function. You can’t just pass 1 data point, since then ICE wouldn’t know what other feature values there exist.

from sklearn.inspection import partial_dependence
pdp_lines = partial_dependence(best_model, 
                               X_test[:30000],
                               ['V14'],
                               percentiles=(0, 1),
                               grid_resolution=100,
                               kind='individual')

After calculation, we can take values for the sample which I described above and plot it.

ICE plot for the sample that was predicted as fraudulent
ICE plot for the sample that was predicted as fraudulent

We can see that if the feature V14 would be less than -2 or -1, the mode’s probability would drastically decrease. Even further, if the value is > 0, then this transaction would be predicted as not fraudulent. ICE technique helps to understand how significant a particular feature is for a given model’s prediction, and how much the probability would change if the value were different. This also opens another problem which is connected to a potential security threat. If a hacker knew that the model heavily relies on feature V14 and the outcome probability depends on whether V14 is lower or greater than 0, a hacker could potentially manipulate its data and bypass this machine learning model.

Global Methods

There are a couple of global methods that can be useful, however, as you can see, some of the local methods can be used to extract global feature behaviour, so I will not concentrate too much time on these methods.

Feature importance

The most simple one is feature importance, which shows how important a particular feature is for model prediction. In XGBoost it can be extracted using featureimportances (described above) and for other simple machine learning methods, there are similar attributes that can be used. For Neural Network models, you can iterate over each feature and change it by randomly shuffling it and feeding it to the model to get a prediction. This method gives you a general idea of which features influence model’s prediction the most.

Partial Dependence Plot (PDP)

PDP is very similar to the ICE that was described above, with the only change being that it calculates values for all instances and plots the average value. Unfortunately, for an unbalanced dataset, this plot doesn’t provide a lot of meaningful information, since the average will be shifted towards the class with the highest number of instances. However, below is a PDP graph on a subset. We can see that the average lowers as the V14 feature increases, which can be also derived from local methods.

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(best_model, X_test[27680:27690], ['V14'],
                                        kind='both')
PDP average plot for feature V14
PDP average plot for feature V14

You can also use it to plot the dependency of different features and see what combination of those feature values will produce the highest probability. Basically, the partial function tells us for a given value of features what the average marginal effect on the prediction is.

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(best_model, X_test[27670:27710], ['V4', 'V14', ['V4', 'V14']])
PDP plot for features V4 and V14
PDP plot for features V4 and V14

Accumulated Local Effects (ALE)

The final method that I want to cover is called ALE. It is very similar to PDP, however, if features are correlated, the partial dependence plot cannot be trusted. This is because in real life it would be extremely unlikely for one feature to change but not the other. Imagine you have 2 features about the apartment: apartment size and number of rooms. PDP may show that if we keep increasing the number of rooms, the price will increase as well. However, PDP doesn’t take into consideration that we can’t increase the number of rooms indefinitely without increasing the apartment size. ALE fixed this problem by limiting potential change for each data point. Unfortunately, it is hard to show how ALE works in a dataset that I am using in this article, but you can learn more about it here.

Conclusion

In my opinion, the SHAP and ICE methods are the most fascinating ones, which can help you to better understand samples that the model predicted wrongly and give you an idea of how each feature contributes to the final prediction. I have used a binary classification problem in this article, but for a multiclass problem, the code will look a bit different, but the main idea will be the same. Hopefully, you liked this article and learnt something new today.

Cheers!


Unless otherwise noted, all images are by the author


Related Articles