Boruta SHAP: A Tool for Feature Selection Every Data Scientist Should Know

How we can use Boruta and SHAP to build an amazing feature selection process — with python examples

Vinícius Trevisan
Towards Data Science

--

Original by Noah Näf on Unsplash

When building a machine learning model, we know that having too many features brings issues such as the curse of dimensionality, besides the need for more memory, processing time, and power.

On our Feature Engineering pipelines we employ feature selection techniques to try to remove less useful features from our datasets. This raises a problem: how can we determine which features are useful?

For this task we can use Boruta, a feature selection algorithm based on a statistical approach. It relies in two principles: shadow features and binomial distributions.

1. Shadow Features

The first step of the Boruta algorithm is to evaluate the feature importances. This is usually done in tree-based algorithms, but on Boruta the features do not compete among themselves, they compete with their randomized versions called “shadow features”.

Say we have a dataset with 3 features and 100 observations. In this case we make a copy of the dataset and shuffle each feature column. The permuted features are then called “shadow features” (cool name, by the way) and create a new dataset, the Boruta dataset, joining all 3 original and the 3 new shadow features.

Image by author

Now we evaluate the feature importances of all 6 features using any method of preference, as a RandomForest or any other. The idea of the Boruta algorithm is to select features that perform better than pure randomness, represented here by the shadow features, so we compare the importances of the original features with the highest feature importance of the shadow features. Every time a feature has a higher importance than this threshold, we call it a “hit”.

We can now keep only the hits and discard the others, right? But what if some of them were discarded out of luck? The answer is on iterations, and that is the next step.

2. Binomial Distributions

All features will have only two outcomes: “hit” or “not hit”, therefore we can perform the previous step several times and build a binomial distribution out of the features.

Consider a movie dataset with three features: “genre”, “audience_score” and “critic_score”. Out of 20 iterations we could get the following results:

We can place these results on a Binomial Distribution plot:

Image by author

The tails of the distribution are the most important part. In the example, each account for 0,5% of the probability.

The genre variable fell on the red area, which is the “reject” area. Here we are sure that the features will have no effect on the target variable.

The green area is the “acceptance” area. We are also sure that these features are predictive and meant to be kept in the model. In this example, critic_score is a good feature that should be kept.

On the blue area Boruta is indecisive of whether the feature is predictive or not. In this case we can keep the features and maybe even use other methods to see if they would have any influence on the model prediction. On the example, audience__score is on this area, even though it is close to the green tail.

We keep the features that were classified on the green and blue areas, and discard the ones on the red area.

You can check a great explanation of the Boruta algorithm here.

3. Boruta in Python

The codes for the examples are also available on my github, so feel free to skip this section.

To use Boruta we can use the BorutaPy library [1]:

pip install boruta

Then we can import the Diabetes Dataset (available from Scikit-Learn [2]):

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
# Fetches the data
dataset = load_diabetes(as_frame = True)
# Gets the independent variables
X = dataset['data']
# Gets the dependent variable (the target)
y = dataset['target']
# Splits the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In order to use Boruta we need to define an estimator, which will be used to estimate the feature importances. In this case I chose the RandomForestRegressor:

from sklearn.ensemble import RandomForestRegressor# Defines the estimator used by the Boruta algorithm
estimator = RandomForestRegressor()

Now we can create the BorutaPy object and fit it to the data using the estimator:

from boruta import BorutaPy# Creates the BorutaPy object
boruta = BorutaPy(estimator = estimator, n_estimators = 'auto', max_iter = 100)
# Fits Boruta
boruta.fit(np.array(X_train), np.array(y_train))

Finally we can discover which features are important, which are uninportant and which are uncertain:

# Important features
important = list(X.columns[boruta.support_])
print(f"Features confirmed as important: {important}")
# Tentative features
tentative = list(X.columns[boruta.support_weak_])
print(f"Unconfirmed features (tentative): {tentative}")
# Unimportant features
unimportant = list(X.columns[~(boruta.support_ | boruta.support_weak_)])
print(f"Features confirmed as unimportant: {unimportant}")

The output is:

Features confirmed as important: ['bmi', 'bp', 's5', 's6']
Unconfirmed features (tentative): []
Features confirmed as unimportant: ['age', 'sex', 's1', 's2', 's3', 's4']

4. Boruta SHAP Feature Selection

Boruta is a robust method for feature selection, but it strongly relies on the calculation of the feature importances, which might be biased or not good enough for the data.

This is where SHAP [3] joins the team. By using SHAP Values as the feature selection method in Boruta, we get the Boruta SHAP Feature Selection Algorithm. With this approach we can get the strong addictive feature explanations existent in SHAP method while having the robustness of Boruta algorithm to ensure only significant variables remain on the set.

If you don’t know what SHAP is, take a look at my article that explains it:

5. Boruta SHAP in Python

To use Boruta we can use the BorutaShap library [4]:

pip install BorutaShap

First we need to create a BorutaShap object. The default value for importance_measure is “shap” since we want to use SHAP as the feature importance discriminator. We can change the classification parameter to True when the problem is a classification one.

from BorutaShap import BorutaShap# Creates a BorutaShap selector for regression
selector = BorutaShap(importance_measure = 'shap', classification = False)

Then we fit the BorutaShap selector in the data or a sample of the data. The n_trials parameter defines the number of iterations of the Boruta algorithm, while the sample boolean determines if the method will internally sample the data to speed up the process.

# Fits the selector
selector.fit(X = X_train, y = y_train, n_trials = 100, sample = False, verbose = True)
# n_trials -> number of iterations for Boruta algorithm
# sample -> samples the data so it goes faster

After the fit, the following result will be shown:

4 attributes confirmed important: ['s5', 'bp', 'bmi', 's6']
5 attributes confirmed unimportant: ['s2', 's4', 's3', 'age', 'sex']
1 tentative attributes remains: ['s1']

Finally we can see which features will be removed and drop them from our data:

# Display features to be removed
features_to_remove = selector.features_to_remove
print(features_to_remove)
# Removes them
X_train_boruta_shap = X_train.drop(columns = features_to_remove)
X_test_boruta_shap = X_test.drop(columns = features_to_remove)

6. Conclusion

As important as feature selection is to our ML pipelines, we need to use the best algorithms to ensure the best results.

A downside of this method is the evaluation time, which might be too long for many Boruta iterations, or when the SHAP is fitted to many observations. Beware of the time!

With that in mind, Boruta SHAP is one of the best methods we can employ to select the most important features on our machine learning pipelines.

Use it always, but remember to compare the results to other methods, to ensure a greater reliability.

If you like this post…

Support me with a coffee!

Buy me a coffee!

And read this awesome post

References

[1 ] BorutaPy package: https://github.com/scikit-learn-contrib/boruta_py

[2] Scikit-Learn Diabetes Dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html

[3] SHAP package: https://shap.readthedocs.io/en/latest/index.html

[4] BorutaShap package: https://github.com/Ekeany/Boruta-Shap

[5] https://medium.com/analytics-vidhya/is-this-the-best-feature-selection-algorithm-borutashap-8bc238aa1677

[6] https://towardsdatascience.com/boruta-explained-the-way-i-wish-someone-explained-it-to-me-4489d70e154a

--

--