Confidence intervals for permutation importance

A new theoretical perspective on an old measure of feature importance

Luke Merrick
Towards Data Science

--

Feature importance helps us find the features that matter.

Introduction

In this post, we explain how a new theoretical perspective on the popular permutation feature importance technique allows us to quantify its uncertainty with confidence intervals and avoid potential pitfalls in its use.

First, let’s motivate the “why” of using this technique in the first place. Let’s imagine you just got hired onto the data science team at a major international retailer. Prior to your arrival, this team built a complex model to forecast weekly sales at each of your dozens of locations around the globe. The model takes into account a multitude of factors: geographic data (like local population density and demographics), seasonality data, weather forecast data, information about individual stores (like total square footage), and even the number of likes your company’s tweets have been getting recently. Let’s assume, too, that this model works wonders, giving the business team advance insight into future sales patterns weeks in advance. There is just one problem. Can you guess what it is?

Nobody knows why the sales forecast model works so well.

Why is this a problem? A number of reasons. The business folks relying on the model’s predictions have no idea how reliable they would be if, say, Twitter experienced an outage and tweet likes decreased one week. On the data science team, you have little sense of what factors are most useful to the model, so you’re flying blind when it comes to identifying new signals with which to bolster your model’s performance. And let’s not forget other stakeholders. If a decision based on this model’s forecast were to lead to bad results for the company, the board will want to know a lot more about this model than “it just works,” especially as AI continues to grow more regulated.

So what can we do? A great first step is to get some measure of feature importance. This means assigning a numerical score of importance to each of the factors that your model uses. These numerical scores represent how important these features are to your model’s ability to make quality predictions.

Many modeling techniques come with built-in feature importance measurements. Perhaps you can use the information-gain-based importance measure that comes by default with your xgboost model? Not so fast! As your teammates will point out, there is no guarantee that these feature importances will describe your complex ensemble, and besides, gain-based importance measures are biased [1].

So what can we do instead? We can use “randomized ablation” (aka “permutation”) feature importance measurements. Christoph Molnar offers a clear and concise description of this technique in his Interpretable ML Book [2]:

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.

Background

Where did this technique come from? Randomized ablation feature importance is certainly not new. Indeed, its inception dates back to at least 2001, when a variant of this technique was introduced as the “noising” of variables to better understand how random forest models use them [3]. Recently, however, this technique has seen a resurgence in use and variation. For example, an implementation of this technique will be included in the upcoming version 0.22 of the popular Scikit-learn library [4]. For a more theoretical example, consider the recently-introduced framework of “model class reliance,” which has termed a variant of the randomized ablation feature importance “model reliance” and used it as a core building block [5].

A new theoretical perspective

While working with this technique at Fiddler Labs, we have sought to develop a clear sense of what it means, theoretically, to permute a column of your features, run that through your model, and see how much the model’s error increases. This has led us to use the theoretical lens of randomized ablation, hence our new name for what is commonly called permutation feature importance.

In a recent preprint released on arXiv, we develop a clear theoretical formulation of this technique as it relates to the classic statistical learning problem statement. We find that the notion of measuring error after permuting features (or, more formally, ablating them through randomization) actually fits in quite nicely with the mathematics of risk minimization in supervised learning [6]. If you are familiar with this body of theory, we hope this connection will be as helpful to your intuition as it has been to ours.

Additionally, our reformulation provides two ways of constructing confidence intervals around the randomization ablation feature importance scores, a technique that practitioners can use to avoid potential pitfalls in the application of randomized ablation feature importance. To the best of our knowledge, current formulations and implementations of this technique do not include these confidence measurements.

Confidence intervals on feature importance

Consider what might happen if we were to re-run randomized ablation feature importance with a different randomized ablation (e.g. by using a different random seed), or if we run it on two different random subsets of a very large dataset (e.g. to avoid using a full dataset that would exceed our machine’s memory capacity). Our feature importances might change! Ideally, we would want to use a large dataset and average over many ablations to mitigate the randomness inherent in the algorithm, but in practice, we may not have enough data or compute power to do so.

There are two sources of uncertainty in the randomized ablation feature importance scores: the data points we use, and the random ablation values (i.e. permutation) we use. By running the algorithm multiple times and examining the run-to-run variance, we can construct a confidence interval (CI) that measures the uncertainty stemming from the ablation used. Similarly, by looking point-by-point at the loss increases caused by ablation (instead of just averaging loss over our dataset), we can construct a CI that measures the uncertainty stemming from our finite dataset.

Example: forecasting the price of a home

To demonstrate the use of randomized ablation feature importance values with CIs, let’s apply the technique to a real model. To this end, I used the Ames Housing Dataset [7] to build a complex model that estimates the sale price of houses. The full code for this example is available in a Jupyter notebook here.

To show the importance of confidence intervals, we run randomized ablation feature importance using just 100 points, with just K=3 repetitions. This gives us the following top-10 features by score, with a 95% confidence interval indicated by the black error bars:

Randomized ablation feature importance for 100 points after 3 repetitions.

As we can see from our error bars, it is uncertain which feature is actually the third most important over these 100 points. Re-running randomized ablation feature importance with K=30 iterations, we arrive at much tighter error bounds, and we find with confidence that a house’s neighborhood actually edges out its total basement square footage in importance to our model:

Randomized ablation feature importance for the same 100 points after 30 repetitions.

However, it turns out that a larger source of uncertainty in these feature importance scores actually stems from the small size of the dataset used, rather than the small number of ablation repetitions. This fact is uncovered by using the other CI methodology presented in our paper, which captures uncertainty resulting from both ablation and the size of the dataset. Running this other CI technique on another 100 points of our dataset (with just one repetition) we observe the following wide CIs:

Randomized ablation feature importance for 100 points with point-by-point CIs.

By increasing the number of points to 500 instead of 100, our confidence improves significantly, and we become fairly confident that neighborhood is the third most important feature to our model overall (not just in our limited dataset).

Randomized ablation feature importance for 500 points with point-by-point CIs.

Conclusion

Feature importance techniques are a powerful and easy way to gain valuable insight about your machine learning models. The randomized ablation feature importance technique, often referred to as “permutation” importance, offers a straightforward and broadly-applicable technique for computing feature importances. We also showed here how, through a new way of theorizing and formulating the “true” value of randomized ablation feature importance, we are able to construct confidence intervals around our feature importance measurements. These confidence intervals are a useful tool for avoiding pitfalls in practice, especially when datasets are not large.

If you liked this post, you can find more like it on Fiddler’s blog, and if you want a deeper dive into CIs for randomized ablation feature importance, be sure to check out the full paper. Don’t worry, it’s only four pages long!

References

[1] Parr et. al. Beware Default Random Forest Importances (2018). https://explained.ai/rf-importance/

[2] Molnar, Christoph. Interpretable Machine Learning (2019). https://christophm.github.io/interpretable-ml-book/feature-importance.html

[3] Breiman, Leo. Random Forests (2001). https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf

[4] Scikit-learn Contributors. Permutation feature importance (2019). https://scikit-learn.org/dev/modules/permutation_importance.html

[5] Fisher et. al. Model Class Reliance (2019). https://arxiv.org/abs/1801.01489

[6] Merrick, Luke. Randomized Ablation Feature Importance (2019). https://arxiv.org/abs/1910.00174

[7] De Cock, Dean. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project (2011). http://jse.amstat.org/v19n3/decock.pdf

--

--