Ensemble Learning case study: Model Interpretability

Gabriel Signoretti
Towards Data Science
14 min readOct 15, 2019

--

This is the first of a two-part article where we will be exploring the 1994 census income dataset, which contains information such as age, years of education, marital status, race, and many others. We will be using this dataset to classify if the potential income of people into 2 categories: people who make less or equal to $50K a year (encoded as 0) and people who make more than $50K a year (encoded as 1).

In this first part, we will be using this dataset to compare the performance of a simple decision tree to the performance of ensemble methods. Later on, we will also explore some tools to help us interpret why the models made the decisions that they did following some principles of eXplainable Artificial Intelligence (XAI).

The first thing we need to do is to take a look at the chosen dataset to get to know it a little better. So, let's get right on to it!

1. Preparing the data

We will be working with a pre-processed version of the dataset free of null values. First of all, we load the basic libraries and the dataset itself and take a look at the dataframe info:

import   as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
# load the dataset
income = pd.read_csv("income.csv")
income.info()
Figure 1: DataFrame information

As we can see, the dataset has 32561 observations and 15 columns, 14 of them being features and one the target variable (high_income). Some of those features are categorical (type object) and some are numeric (type int64) so we will need to perform different preprocessing steps for them.

To accomplish this, we will create two Pipelines: one will perform the preprocessing steps on the categorical features and the other on the numerical ones. Then, we will use FeatureUnion to join these two pipelines together to form the final preprocessing pipeline. To do this and the following steps in this article we need to import the necessary modules from the scikit-learn library:

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

Using the BaseEstimator and the TranformerMixin classes we can create two custom transformers to put on our pipeline: one to split the data into categorical and numerical features and another to preprocess the categorical features. Both theses transformer can be seen below:

# Custom Transformer that extracts columns passed as argument
class FeatureSelector(BaseEstimator, TransformerMixin):
#Class Constructor
def __init__(self, feature_names):
self.feature_names = feature_names
#Return self nothing else to do here
def fit(self, X, y = None):
return self
#Method that describes what we need this transformer to do
def transform(self, X, y = None):
return X[self.feature_names]
# converts certain features to categorical
class CategoricalTransformer( BaseEstimator, TransformerMixin ):
#Class constructor method that takes a boolean as its argument
def __init__(self, new_features=True):
self.new_features = new_features
#Return self nothing else to do here
def fit( self, X, y = None ):
return self
#Transformer method we wrote for this transformer
def transform(self, X , y = None ):
df = X.copy()
if self.new_features:
# Treat ? workclass as unknown
df['workclass']= df['workclass'].replace('?','Unknown')
# Two many category level, convert just US and Non-US
df.loc[df['native_country']!=' United-States','native_country'] = 'non_usa'
df.loc[df['native_country']==' United-States','native_country'] = 'usa'
# convert columns to categorical
for name in df.columns.to_list():
col = pd.Categorical(df[name])
df[name] = col.codes
# returns numpy array
return df

With these custom transformers at hand, we can build the preprocessing pipeline. The first thing we need to do is create the X feature matrix and the y target vector:

# Create the X feature matrix and the y target vector
X = income.drop(labels="high_income", axis=1)
y = income["high_income"]
# the only step necessary to be done outside of pipeline
# convert the target column to categorical
col = pd.Categorical(y)
y = pd.Series(col.codes)
# global variables
seed = 108

After that, we extract the categorical and numerical feature names and create the 2 pipelines after defining the steps to be used in each of them. For the categorical pipeline, we use the FeatureSelector to select only the categorical columns followed by the CategoricalTransformer to transform the data to the desired format; as for the numerical pipeline, we will also use the FestureSelector, this time to select only the numerical features, followed by as StandardScaler to normalize the data. The code can be seen below:

# get the categorical feature names
categorical_features = X.select_dtypes("object").columns.to_list()
# get the numerical feature names
numerical_features = X.select_dtypes("int64").columns.to_list()
# create the steps for the categorical pipeline
categorical_steps = [
('cat_selector', FeatureSelector(categorical_features)),
('cat_transformer', CategoricalTransformer())
]
# create the steps for the numerical pipeline
numerical_steps = [
('num_selector', FeatureSelector(numerical_features)),
('std_scaler', StandardScaler()),
]
# create the 2 pipelines with the respective steps
categorical_pipeline = Pipeline(categorical_steps)
numerical_pipeline = Pipeline(numerical_steps)

Now we can use the FeatureUnion class to join the two pipeline horizontally, that way we end up with only one final preprocessing pipeline:

pipeline_list = [
('categorical_pipeline', categorical_pipeline),
('numerical_pipeline', numerical_pipeline)
]
# Combining the 2 pieplines horizontally into one full pipeline
preprocessing_pipeline =FeatureUnion(transformer_list=pipeline_list)

And that's it, now all you have to do to perform all the preprocessing steps on your data is to call the fit_transform method of the preprocessing_pipeline object passing the X matrix as an argument! simple and concise.

2. Training the first model

Before training any machine learning model we have to split the data into a train and a test set. To do this we use the train_test_split function:

# split-out train/validation and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed, shuffle=True, stratify=y)

The first model we will use is a simple DecisionTreeClassifier. To finish leveraging the full power of Pipelines, we can create a full pipeline passing the preprocessing pipeline as the first step and the desired classification model as the second step:

# we pass the preprocessing pipeline as a step to the full pipeline
full_pipeline_steps = [
('preprocessing_pipeline', preprocessing_pipeline),
('model', DecisionTreeClassifier(random_state=seed))
]
# create the full pipeline object
full_pipeline = Pipeline(steps=full_pipeline_steps)

This way, if we want to try different models (as we will later on), all we have to do is update the 'model' step of this pipeline!

The next step is to use RandomizedSearchCV to perform the model hyperparameter tunning. Randomized search is not as thorough as a regular grid search, as it does not test every possible combination of hyperparameters. On the other hand, it is more computationally cheap, enabling us to achieve 'good enough' results in lower-end hardware while tuning multiple hyperparameters. With the n_iter parameter, we limit the number of iterations to 50.

We will also use StratifiedKFold to perform cross-validation. Unlike the regular KFold, it preserves the sample's distribution across folds, potentially leading to better results.

The final step is to build a param_grid dictionary with the hyperparameters that we need to tune. The full code can be seen below. Notice how we pass the full_pipeline object as the RandomizedSearchCV estimator and then we call the fit method on the resulting object as we would any other sklearn model. This way, when we want to ty other models, all we have to do is change the model on the pipeline and create a new parameter grid to pass, simple as that!

# Create the grid search parameter grid and scoring funcitons
param_grid = {
"model": [DecisionTreeClassifier(random_state=seed)],
"model__criterion": ["gini","entropy"],
"model__splitter": ["best","random"],
"model__max_leaf_nodes": [16, 64, 128, 256],
"model__max_depth": np.linspace(1, 32, 32)
}
scoring = {
'AUC': 'roc_auc',
'Accuracy': make_scorer(accuracy_score)
}
# create the Kfold object
num_folds = 10
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed)
# create the grid search object with the full pipeline as estimator
n_iter=50
grid = RandomizedSearchCV(
estimator=full_pipeline,
param_distributions=param_grid,
cv=kfold,
scoring=scoring,
n_jobs=-1,
n_iter=n_iter,
refit="AUC"
)
# fit grid search
best_model = grid.fit(X_train,y_train)

For the DecisionTreeClassifier we are tunning the following parameters:

  • criterion: defines the function to measure the quality of the splits on the tree nodes;
  • splitter: defines the strategy used to choose the split at each tree node;
  • max_leaf_nodes: limits the maximum number of leaf nodes in the tree;
  • max_depth: limits the maximum depth that the tree can grow;

After the randomized search finishes and the best model is fitted to our data, we can make predictions and measure the model's performance:

# final Decision Tree model
pred_test = best_model.predict(X_test)
pred_train = best_model.predict(X_train)
print('Train Accuracy: ', accuracy_score(y_train, pred_train))
print('Test Accuraccy: ', accuracy_score(y_test, pred_test))
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,pred_test))
print('\nClassification Report:')
print(classification_report(y_test,pred_test))
Figure 2: Decision tree performance

3. Ensemble methods

Ensemble methods are models that combine several base models in order to produce one optimal predictive model. The RandomForestClassifier is a classical example of this, as it combines several simpler DecisionTrees to generate the output. With this, is possible to overcome several limitations of the decision tree model, such as it's tendency to overfit.

3.1. Bagging Classifier

The first ensemble method that we are going to work with is the BaggingClassifier. The name bagging comes from bootstrap aggregating and it works by splitting the data into several random subsamples that are then used to train each individual base estimator. This strategy can be performed in one of two ways: with replacement, meaning that the samples can be used several times for the same estimator; and without replacement, meaning that each individual sample can be used only one time (this approach is called Pasting).

Bagging generally yields better results than Pasting and it also has a neat trick up its sleeve: Out-of-Bag evaluation. Since the samples are drawn with replacement and the same sample can be used multiple times randomly, some samples on the training set may never be used to train any of the base estimators! This means that we can use these samples for further model evaluation!

So, with all that in mind, we are going to update our previous pipeline to work with the BaggingClassifier with a DecisionTree as the base estimator. To do that, all we have to do is change the ‘model’ step in the pipeline definition and redefine the RandomizedSearchCV parameter search space, as it can be seen in the code snippet below:

# we pass the preprocessing pipeline as a step to the full pipeline
full_pipeline_steps = [
('preprocessing_pipeline', preprocessing_pipeline),
('model', BaggingClassifier(
DecisionTreeClassifier(max_features="auto", splitter="random", max_leaf_nodes=128, random_state=seed),
random_state=seed
))
]
# create the full pipeline object
full_pipeline = Pipeline(steps=full_pipeline_steps)
# Create the grid search parameter grid
param_grid = {
"model": [BaggingClassifier(
DecisionTreeClassifier(max_features="auto", splitter="random", max_leaf_nodes=128, random_state=seed),
random_state=seed
)],
"model__n_estimators": np.arange(100, 1000, 100),
"model__max_samples":[0.8, 1.0],
"model__max_features": [0.8, 1.0],
"model__bootstrap": [True],
"model__oob_score": [True],
}
scoring = {
'AUC': 'roc_auc',
'Accuracy': make_scorer(accuracy_score)
}
# create the Kfold object
num_folds = 10
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed)
# create the grid search object with the full pipeline as estimator
n_iter=25
grid = RandomizedSearchCV(
estimator=full_pipeline,
param_distributions=param_grid,
cv=kfold,
scoring=scoring,
n_jobs=-1,
n_iter=n_iter,
refit="AUC"
)

# fit grid search
best_bag = grid.fit(X_train,y_train)

For the BaggingClassifier we are tunning the following parameters:

  • n_estimators: defines the total number of estimators (in this case decision trees) to use;
  • max_samples: The maximum percentage of samples to draw from X to train each base estimator;
  • max_features: The maximum percentage of features to draw from X to train each base estimator;
  • bootstrap: When set to False, draw samples without replacement (Pasting). When set to True, draw samples with replacement (bagging). Since we are going to use the out-of-bag score, we have to set this parameter to True;
  • oob_score: When set to True, return the out-of-bag score of the best model;

After the randomized search finishes and the best model is fitted to our data, we can check the tunned model hyperparameters, make predictions and measure the model’s performance:

print(f'Best score: {best_bag.best_score_}')
print(f'Best model: {best_bag.best_params_}')
Figure 3: Best Bagging model
pred_test = best_bag.predict(X_test)
pred_train = best_bag.predict(X_train)
print('Train Accuracy: ', accuracy_score(y_train, pred_train))
print('Test Accuraccy: ', accuracy_score(y_test, pred_test))
print("Out-of-Bag Accuracy: ", best_bag.best_params_['model'].oob_score_)
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,pred_test))
print('\nClassification Report:')
print(classification_report(y_test,pred_test))
Figure 4: Bagging performance

As we can see, all 3 accuracies achieved by the model are very close to each other, indicating that the model generalized well for new data and thus may not be overfitted.

3.2. Random Forest Classifier

The last model we will be training is the RandomForestClassifier. The process to use it is the same as the one presented above and can be seen in the code snippet below:

# we pass the preprocessing pipeline as a step to the full pipeline
full_pipeline_steps = [
('preprocessing_pipeline', preprocessing_pipeline),
('model', RandomForestClassifier(random_state=seed))
]
# create the full pipeline object
full_pipeline = Pipeline(steps=full_pipeline_steps)
# Create the grid search parameter grid and scoring funcitons
param_grid = {
"model": [RandomForestClassifier(random_state=seed)],
"model__max_depth": np.linspace(1, 32, 32),
"model__n_estimators": np.arange(100, 1000, 100),
"model__criterion": ["gini","entropy"],
"model__max_leaf_nodes": [16, 64, 128, 256],
"model__oob_score": [True],
}
scoring = {
'AUC': 'roc_auc',
'Accuracy': make_scorer(accuracy_score)
}
# create the Kfold object
num_folds = 10
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed)
# create the grid search object with the full pipeline as estimator
n_iter=50
grid = RandomizedSearchCV(
estimator=full_pipeline,
param_distributions=param_grid,
cv=kfold,
scoring=scoring,
n_jobs=-1,
n_iter=n_iter,
refit="AUC"
)
# fit grid search
best_rf = grid.fit(X_train,y_train)

For the RandomForestClassifier we are tunning the following parameters:

  • criterion: defines the function to measure the quality of the splits on the tree nodes;
  • max_leaf_nodes: limits the maximum number of leaf nodes in the trees;
  • max_depth: limits the maximum depth that the trees can grow;
  • n_estimators: defines the total number of trees to use in the forest
  • oob_score: When set to True, return the out-of-bag score of the best model;

As we did before, after the randomized search finishes and the best model is fitted to our data, we can make predictions and measure the model’s performance:

print(f'Best score: {best_rf.best_score_}')
print(f'Best model: {best_rf.best_params_}')
Figure 5: Best Random Forest model
pred_test = best_rf.predict(X_test)
pred_train = best_rf.predict(X_train)
print('Train Accuracy: ', accuracy_score(y_train, pred_train))
print('Test Accuraccy: ', accuracy_score(y_test, pred_test))
print("Out-of-Bag Accuracy: ", best_rf.best_params_['model'].oob_score_)
print('\nConfusion Matrix:')
print(confusion_matrix(y_test,pred_test))
print('\nClassification Report:')
print(classification_report(y_test,pred_test))
Figure 6: Random Forest performance

Similar to the BaggingClassifier, the accuracies are very close to each other, indicating good generalization for unseen data and thus, no overfitting. As this was the best performing model we have found, we will use it as an example for the rest of the article.

4. Model interpretability

In some cases, getting good predictions from your model is arguably as important as to understand why it is giving the answers that it does. In those cases is where we can leverage the concepts of eXplainable Artificial Intelligence (XAI) and try to make the models more human interpretable.

One way to do this is by analyzing the importance of each feature in the model outcome. Luckily for us, the random forest has a built-in attribute that can tell us just that, its called feature_importances_. The steps performed to access and plot this information is presented in the code below:

# lets get the random forest model configuration and feature names
rf_model = best_rf.best_params_['model']
features = np.array(X_train.columns)
# Transforming the test data.
new_X_test = preprocessing_pipeline.fit_transform(X_test)
new_X_test = pd.DataFrame(new_X_test, columns=X_test.columns)
# get the predicitons from the random forest object
y_pred = rf_model.predict(new_X_test)
# get the feature importances
importances = rf_model.feature_importances_
# sort the indexes
sorted_index = np.argsort(importances)
sorted_importances = importances[sorted_index]
sorted_features = features[sorted_index]
# plot the explained variance using a barplot
fig, ax = plt.subplots()
ax.barh(sorted_features , sorted_importances)
ax.set_xlabel('Importances')
ax.set_ylabel('Features')
Figure 7: Feature importances

This is an example of a simple, and yet effective, way to gain more insight both on your data and on your model's reasoning. With the plot, we can see that the most important features were education_num, capital_loss, fnlwgt, capital_gain, and race.

To go even further, we will briefly explore two third party libraries that enable us to get different visualizations: ELI5 and SHAP.

4.1. Model interpretation with ELI5

ELI5 is a Python package that helps to debug machine learning classifiers and explain their predictions. It provides support for some machine learning libraries, including scikit-learn and XGBoost.

With it, we are able to look at a classification model in two main ways: the first is to inspect the model parameters and analyze how the model works globally (similarly to the default feature importance attribute); the second is to inspect individual predictions to figure out why the model makes the decisions that it does.

For the first use case we use the show_weights() function as shown in the code snippet below:

import eli5# lets get the random forest model configuration and feature names
rf_model = best_rf.best_params_['model']
features = np.array(X_train.columns)
eli5.show_weights(rf_model, feature_names=features)
Figure 8: ELI5 feature weights

As we can see in the image above, the results are pretty similar to the one we obtained from the tree feature importances.

As for the second use case, we can use the explain_prediction() function to inspect and analyze individual predictions of the model. To test this out, we checked a true negative prediction (actual value was 0 and the predicted value was also 0) and a true positive prediction (actual value was 1 and the predicted value was also 1):

# predicting a person earns less than 50k/year (true negative)
index = 4
print('Actual Label:', y_test.iloc[index])
print('Predicted Label:', y_pred[index])
eli5.explain_prediction(rf_model, new_X_test.iloc[index], feature_names=features)
# predicting a person earns more than 50k/year (true positive)
index = 7
print('Actual Label:', y_test.iloc[index])
print('Predicted Label:', y_pred[index])
eli5.explain_prediction(rf_model, new_X_test.iloc[index], feature_names=features)
Figure 9: ELI5 true negative prediction example
Figure 10: ELI5 true positive prediction example

So, the most influential features that contributed for the model to predict that these particular observations were respectively race and education_num.

4.2. Model interpretation with SHAP

The SHAP (SHapley Additive exPlanations) library is a unified approach to explain the output of any machine learning model. Similarly to the ELI5, it also has support for several machine learning libraries, including scikit-learn and XGBoost.

To use its functionality for our Random Forest model we first need to create a TreeExplainer object and obtain the so-called shap_values for the model. This process is shown in the code below:

import shap
shap.initjs()
# Create the explainer object
explainer = shap.TreeExplainer(rf_model)
print('Expected Value:', explainer.expected_value)
# get the shap values from the explainer
shap_values = explainer.shap_values(new_X_test)

As we did with the ELI5, we can also use the SHAP library to explain individual model predictions, as shown below for the same data points that we worked with previously:

# predicting a person earns less than 50k/year (true negative)
shap.force_plot(explainer.expected_value[0],
shap_values[0][4], X_test.iloc[4])
# predicting a person earns more than 50k/year (true positive)
shap.force_plot(explainer.expected_value[1],
shap_values[1][7], X_test.iloc[7])
Figure 11: SHAP true negative prediction example
Figure 12: SHAP true positive prediction example

In addition, it is also possible to visualize multiple predictions at once, as it is shown below for the first 1000 samples of the dataset:

shap.force_plot(explainer.expected_value[0],
shap_values[0][:1000,:], X_test.iloc[:1000,:])
Figure 12: SHAP prediction explanation for the first 1000 samples

In the same plot, it is also possible to analyze the impact of different features in the final model prediction. To test this, we configured the plot to show the importance of the education_num feature on the same samples:

Figure 13: SHAP education_num effect for the first 1000 samples

Finally, we can use the summary_plot function to plot the feature importances divided by class:

shap.summary_plot(shap_values, X_test)
Figure 14: SHAP feature importances by class

We can see that the obtained result is very similar to the ones obtained both by the tree built-in feature importances and the ELI5 library.

5. Conclusion

We barely scratched the surface of several important topics in the machine learning landscape, such as Pipelines, hyperparameter tuning, ensemble methods, and model interpretability. There is much more to cover!

In the next part of this series, we are going to take a look at the new 'cool kid on the block' of ensemble methods: Gradient Boosting. We will take a look at the implementation provided by the XGBoost library, so stay tuned!

--

--

Computer engineer, developer and researcher. Sometimes data scientist, full-time experimenter!