Variable Importance in Random Forests

Traditional Methods and New Developments

Jeffrey Näf
Towards Data Science

--

Features of (Distributional) Random Forests. In this article: The ability to produce variable importance. Source: Author.

Random Forest and generalizations (in particular, Generalized Random Forests (GRF) and Distributional Random Forests (DRF) ) are powerful and easy-to-use machine learning methods that should not be absent in the toolbox of any data scientist. They not only show robust performance over a large range of datasets without the need for tuning, but can also easily handle missing values, and even provide confidence intervals. In this article, we focus on another feature they are able to provide: notions of feature importance. In particular, we focus on:

  1. Traditional Random Forest (RF), which is used to predict the conditional expectation of a variable Y given p predictors X.
  2. The Distributional Random Forest, which is used to predict the whole conditional distribution of a d-variate Y given p predictors X.

Unfortunately, like many modern machine learning methods, both forests lack interpretability. That is, there are so many operations involved, it seems impossible to determine what the functional relationship between the predictors and Y actually is. A common way to tackle this problem is to define Variable Importance measures (VIMP), that at least help decide which predictors are important. Generally, this has two different objectives:

(1) finding a small number of variables with maximal accuracy,

(2) detecting and ranking all influential variables to focus on for further exploration.

The difference between (1) and (2) matters as soon as there is dependence between the elements in X (so pretty much always). For example, if two variables are highly correlated together and with Y, one of the two inputs can be removed without hurting accuracy for objective (1), since both variables convey the same information. However, both should be included for objective (2), since these two variables may have different meanings in practice for domain experts.

Today we focus on (1) and try to find a smaller number of predictors that display more or less the same predictive accuracy. For instance, in the wage example below, we are able to reduce the number of predictors from 79 to about 20, with only a small reduction in accuracy. These most important predictors contain variables such as age and education which are well-known to influence wages. There are also many great articles on medium about (2), using Shapley values such as this one or this one. There is also very recent and exciting academic literature on how to efficiently calculate Shapley values with Random Forest. But this is material for a second article.

The two measures we look at today are actually more general variable importance measures that can be used for any method, based on the drop-and-relearn principle which we will look at below. We focus exclusively on tree-based methods here, however. Moreover, we don’t go into great detail explaining the methods, but rather try to focus on their applications and why newer versions are preferable to the more traditional ones.

Overview of Variable Importance Measures for Random Forests. Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA) were both postulated by Breiman. Due to their empirical nature, however, several problems remained, which were recently addressed by Sobol-MDA. Source: Author

The Beginnings

Variable importance measures for RFs are in fact as old as RF itself. The first accuracy the Mean Decrease Accuracy (MDA) was proposed by Breiman in his seminal Random Forest paper [1]. The idea is simple: For every dimension j=1,…,p, one compares the accuracy of the full prediction with the accuracy of the prediction when X_j is randomly permuted. The idea of this is to break the relationship between X_j and Y and compare the accuracy when X_j is not helping to predict Y by design, to the case when it is potentially of use.

There are various different versions of MDA implemented in R and Python:

Different Versions of MDA, implemented in different packages. Source: Table 1 in [3]

Unfortunately, permuting variable X_j in this way not only breaks its relationship to Y, but also to the other variables in X. This is not a problem if X_j is independent from all other variables, but it becomes a problem once there is dependence. Consequently, [3] is able to show that as soon as there is dependence in X, the MDA converges to something nonsensical. In particular, MDA can give high importance to a variable X_j that is not important to predict Y, but is highly correlated with another variable, say X_l, that is actually important for predicting Y (as demonstrated in the example below). At the same time, it can fail to detect variables that are actually relevant, as demonstrated by a long list of papers in [3, Section 2.1]. Intuitively, what we would want to measure is the performance of the model if X_j is not included, and instead, we measure the performance of a model with a permuted X_j variable.

The second traditional accuracy measure is Mean Decrease Impurity (MDI), which sums the weighted decreases of impurity over all nodes that split on a given covariate, averaged over all trees in the forest. Unfortunately, MDI is ill-defined from the start (it's not clear what it should measure) and several papers highlight the practical problem of this approach (e.g. [5]) As such, we will not go into detail about MDI, as MDA is often the preferred choice.

Modern Developments I: Sobol-MDA

For the longest time, I thought these somewhat informal measures were the best we could do. One paper that changed that, came out only very recently. In this paper, the authors demonstrate theoretically that the popular measures above are actually quite flawed and do not measure what we want to measure. So the first question might be: What do we actually want to measure? One potential answer: The Sobol-index (originally proposed in the computer science literature):

Let’s unpack this. First, tau(X)=E[ Y | X] is the conditional expectation function we would like to estimate. This is a random variable because it is a function of the random X. Now X^{(-j)} is the p-1 vector with covariate j removed. Thus ST^{(j)} is the reduction in output explained variance if the jth output variable is removed.

The above is the more traditional way of writing the measure. However, for me writing:

is much more intuitive. Here d is a distance between two random vectors and for the ST^{(j)} above, this distance is simply the usual Euclidean distance. Thus the upper part of ST^{(j)} is simply measuring the average squared distance between what we want (tau(X)) and what we get without variable j. The latter is

The question becomes how to estimate this efficiently. It turns out that the intuitive drop-and-relearn principle would be enough: Simply estimating tau(X) using RF and then dropping X_j and refitting the RF to obtain an estimate of tau(X^{(-j)}), one obtains the consistent estimator:

where tau_n(X_i) is the RF estimate for a test point X_i using all p predictors and similarly tau_n(X_i^{(-j)}) is the refitted forest using only p-1 predictors.

However, this means the forest needs to be refitted p times, not very efficient when p is large! As such the authors in [3] develop what they call the Sobol-MDA. Instead of refitting the forest each time, the forest is only fitted once. Then test points are dropped down the same forest and the resulting prediction is “projected” to form the measure in (1). That is, splits on X_j are simply ignored (remember the goal is to obtain an estimate without X_j). The authors are able to show that calculating (1) above with this projected approach also results in a consistent estimator! This is a beautiful idea indeed and renders the algorithm applicable even in high dimensions.

Illustration of the projection approach. On the left the division of the two-dimensional space by RF. On the right the projection approach ignores splits in X^(2), thereby removing it when making predictions. As can be seen the point X gets projected onto X^{(-j)} on the right using this principle. Source: Figure 1 in [3]

The method is implemented in R in the soboldMDA package, based on the very fast ranger package.

Modern Developments II: MMD-based sensitivity index

Looking at the formulation using the distance d, a natural question is to ask whether different distances could be used to get variable importance measures for more difficult problems. One such recent example is to use the MMD distance as d:

The MMD distance is a wonderful tool, that allows to quite easily build a distance between distributions using a kernel k (such as the Gaussian kernel):

For the moment I leave the details to further articles. The most important takeaway is simply that I^{(j)} considers a more general target than the conditional expectation. It recognizes a variable X_j as important, as soon as it influences the distribution of Y in any way. It might be that X_j only changes the variance or the quantiles and leaves the conditional mean of Y untouched (see example below). In this case, the Sobol-MDA would not recognize X_j as important, but the MMD method would. This doesn’t necessarily make it better, it is simply a different tool: If one is interested in predicting the conditional expectation, ST^{(j)} is the right measure. However, if one is interested in predicting other aspects of the distribution, especially quantiles, I^{(j)} would be better suited. Again I^{(j)} can be consistently estimated using the drop-and-relearn principle (refitting DRF for j=1,…,p eacht time with variable $j$ removed), or the same projection approach as for Sobol-MDA can be used. A drop-and-relearn-based implementation is attached at the end of this article. We refer to this method here as MMD-MDA.

Simulated Data

We now illustrate these two modern measures on a simple simulated example: We first download and install the Sobol-MDA package from Gitlab and then load all the packages necessary for this example:

library(kernlab)
library(drf)
library(Matrix)
library(DescTools)
library(mice)
library(sobolMDA)
source("compute_drf_vimp.R") ##Contents of this file can be found below
source("evaluation.R") ##Contents of this file can be found below

Then we simulate from this simple example: We take X_1, X_2, X_4, …, X_10 independently uniform between (-1,1) and create dependence between X_1 and X_3 by taking X_3=X_1 + uniform error. Then we simulate Y as

##Simulate Data that experiences both a mean as well as sd shift

# Simulate from X
x1 <- runif(n,-1,1)
x2 <- runif(n,-1,1)
X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)
x3 <- x1+ runif(n,-1,1)
X <- cbind(x1,x2, x3, X0)

# Simulate dependent variable Y
Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))
colnames(X)<-paste0("X", 1:10)


head(cbind(Y,X))

We then analyze the Sobol-MDA approach to estimate the conditional expectation of Y given X:

## Variable importance for conditional Expectation Estimation

XY <- as.data.frame(cbind(Xfull, Y))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')
sobolMDA <- forest$variable.importance
names(sobolMDA) <- colnames(X)

sort(sobolMDA, decreasing = T)

X1 X8 X7 X6 X5 X9
0.062220958 0.021946135 0.016818860 0.016777223 -0.001290326 -0.001540919
X3 X10 X4 X2
-0.001578540 -0.007400854 -0.008299478 -0.020334150

As can be seen, it correctly identifies that X_1 is the most important variable, while the others are ranked equally (un)important. This makes sense because the conditional expectation of Y is only changed by X_1. Crucially, the measure manages to do this despite the dependence between X_1 and X_3. Thus we successfully pursued goal (1), as explained above, in this example. On the other hand, we can also have a look at the traditional MDA:

forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'permutation')
MDA <- forest$variable.importance
names(MDA) <- colnames(X)

sort(MDA, decreasing = T)

X1 X3 X6 X7 X8 X2
0.464516976 0.118147061 0.063969310 0.032741521 0.029004312 -0.004494380
X4 X9 X10 X5
-0.009977733 -0.011030996 -0.014281844 -0.018062544

In this case, while it correctly identifies X_1 as the most important variable, it also places X_3 in second place, with a value that seems quite a bit higher than the remaining variables. This despite the fact, that X_3 is just as unimportant as X_2, X_4,…, X_10!

But what if we are interested in predicting the distribution of Y more generally, say for estimating quantiles? In this case, we need a measure that is able to recognize the influence of X_2 on the conditional variance of Y. Here the MMD variable importance measure comes into play:

MMDVimp <- compute_drf_vimp(X=X,Y=Y)
sort(MMDVimp, decreasing = T)

X2 X1 X10 X6 X8 X3
0.683315006 0.318517259 0.014066410 0.009904518 0.006859128 0.005529749
X7 X9 X4 X5
0.003476256 0.003290550 0.002417677 0.002036174

Again the measure is able to correctly identify what matters: X_1 and X_2 are the two most important variables. And again, it does this despite the dependence between X_1 and X_3. Interestingly, it also gives the variance shift from X_2 a higher importance than the expectation shift from X_1.

Real Data

Finally, I present a real data application to demonstrate the variable importance measure. Note that with DRF, we could look even at multivariate Y but to keep things more simple, we focus on a univariate setting and consider the US wage data from the 2018 American Community Survey by the US Census Bureau. In the first DRF paper, we obtained data on approximately 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that might be relevant for salaries. This wealth of data is ideal to experiment with a method like DRF (in fact we will only use a tiny subset for this analysis). The data we load can be found here.

# Load data (https://github.com/lorismichel/drf/blob/master/applications/wage_data/data/datasets/wage_benchmark.Rdata)
load("wage_benchmark.Rdata")

##Define the training data

n<-1000

Xtrain<-X[1:n,]
Ytrain<-Y[1:n,]
Xtrain<-cbind(Xtrain,Ytrain[,"male"])
colnames(Xtrain)[ncol(Xtrain)]<-"male"
Ytrain<-Ytrain[,1, drop=F]


##Define the test data
ntest<-2000
Xtest<-X[(n+1):(n+ntest),]
Ytest<-Y[(n+1):(n+ntest),]
Xtest<-cbind(Xtest,Ytest[,"male"])
colnames(Xtest)[ncol(Xtest)]<-"male"
Ytest<-Ytest[,1, drop=F]

We now calculate both variable importance measures (this will take a while as only the drop-and-relearn method is implemented for DRF):

# Calculate variable importance for both measures
# 1. Sobol-MDA
XY <- as.data.frame(cbind(Xtrain, Ytrain))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')
SobolMDA <- forest$variable.importance
names(SobolMDA) <- colnames(Xtrain)

# 2. MMD-MDA
MMDVimp <- compute_drf_vimp(X=Xtrain,Y=Ytrain,silent=T)



print("Top 10 most important variables for conditional Expectation estimation")
sort(SobolMDA, decreasing = T)[1:10]
print("Top 5 most important variables for conditional Distribution estimation")
sort(MMDVimp, decreasing = T)[1:10]
Sobol-MDA:

education_level age male
0.073506769 0.027079349 0.013722756
occupation_11 occupation_43 industry_54
0.013550320 0.010025332 0.007744589
industry_44 occupation_23 occupation_15
0.006657918 0.005772662 0.004610835
marital_never married
0.004545964
MMD-MDA:

education_level age male
0.420316085 0.109212519 0.027356393
occupation_43 occupation_11 marital_never married
0.016861954 0.014122583 0.003449910
occupation_29 marital_married industry_81
0.002272629 0.002085207 0.001152210
industry_72
0.000984725

In this case, the two variable importance measures agree quite a bit on which variables are important. While this is not a causal analysis, it is also nice that variables that are known to be important to predict wages, specifically “age”, “education_level” and “gender”, are indeed seen as very important by the two measures.

To obtain a small set of predictive variables, one could now for j=1,…p-1,

(I) Remove the least important variable

(II) Calculate the loss (e.g. mean squared error) on a test set

(III) Recalculate the variable importance for the remaining variable

(IV) Repeat until a certain stopping criterion is met

One could stop, for instance, if the loss increased by more than 5%. To make my life easier in this article, I just use the same variable importance values saved in “SobolMDA” and “MMDVimp” above. That is, I ignore step (III) and only consider (I), (II) and (IV). When the goal of estimation is the full conditional distribution, step (II) is also not entirely clear. We use what we refer to as MMD loss, described in more detail in our paper ([4]). This loss considers the error we are making in the prediction of the distribution. For the conditional mean, we simply use the mean-squared error. This is done in the function “evalall” found below:

# Remove variables one-by-one accoring to the importance values saved in SobolMDA
# and MMDVimp.
evallistSobol<-evalall(SobolMDA, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MSE"), num.trees )
evallistMMD<-evalall(MMDVimp, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MMD"), num.trees )


plot(evallistSobol$evalMSE, type="l", lwd=2, cex=0.8, col="darkgreen", main="MSE loss" , xlab="Number of Variables removed", ylab="Values")
plot(evallistMMD$evalMMD, type="l", lwd=2, cex=0.8, col="darkgreen", main="MMD loss" , xlab="Number of Variables removed", ylab="Values")

This results in the following two pictures:

Notice that both have somewhat wiggly lines, which is first due to the fact that I did not recalculate the importance measure, e.g., left out step (III), and second due to the randomness of the forests. Aside from this, the graphs nicely show how the errors successively increase with each variable that is removed. This increase is first slow for the least important variables and then gets quicker for the most important ones, exactly as one would expect. In particular, the loss in both cases remains virtually unchanged if one removes the 50 least important variables! In fact, one could remove about 70 variables in both cases without increasing the loss by more than 6%. One has to note though that many predictors are part of one-hot encoded categorical variables and thus one needs to be somewhat careful when removing predictors, as they correspond to levels of one categorical variable. However, in an actual application, this might still be desirable.

Conclusion

In this article, we looked at modern approaches to variable importance in Random Forests, with the goal of obtaining a small set of predictors or covariates, both with respect to the conditional expectation and for the conditional distribution more generally. We have seen in the wage data example, that this can lead to a substantial reduction in predictors with virtually the same accuracy.

As noted above the measures presented are not strictly constrained to Random Forest, but can be used more generally in principle. However, forests allow for the elegant projection approach that allows for the calculation of the importance measure for all variables j, without having to refit the forest each time (!) This is described in both [3] and [4].

Literature

[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[2] Breiman, L. (2003a). Setting up, using, and understanding random forests v3.1. Technical report, UC Berkeley, Department of Statistics

[3] Bénard, C., Da Veiga, S., and Scornet, E. (2022). Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA. Biometrika, 109(4):881–900.

[4] Clément Bénard, Jeffrey Näf, and Julie Josse. MMD-based variable importance for distributional random forest, 2023.

[5] Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8:25.

Appendix : Code


#### Contents of compute_drf_vimp.R ######

#' Variable importance for Distributional Random Forests
#'
#' @param X Matrix with input training data.
#' @param Y Matrix with output training data.
#' @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.
#' @param num.trees Number of trees to fit DRF. Default value is 500 trees.
#' @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.
#'
#' @return The list of importance values for all input variables.
#' @export
#'
#' @examples
compute_drf_vimp <- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){

# fit initial DRF
bandwidth_Y <- drf:::medianHeuristic(Y)
k_Y <- rbfdot(sigma = bandwidth_Y)
K <- kernelMatrix(k_Y, Y, Y)
DRF <- drf(X, Y, num.trees = num.trees)
wall <- predict(DRF, X_test)$weights

# compute normalization constant
wbar <- colMeans(wall)
wall_wbar <- sweep(wall, 2, wbar, "-")
I0 <- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))

# compute drf importance dropping variables one by one
I <- sapply(1:ncol(X), function(j) {
if (!silent){print(paste0('Running importance for variable X', j, '...'))}
DRFj <- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees)
DRFpredj <- predict(DRFj, X_test[, -j])
wj <- DRFpredj$weights
Ij <- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0
return(Ij)
})

# compute retraining bias
DRF0 <- drf(X = X, Y = Y, num.trees = num.trees)
DRFpred0 = predict(DRF0, X_test)
w0 <- DRFpred0$weights
vimp0 <- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0

# compute final importance (remove bias & truncate negative values)
vimp <- sapply(I - vimp0, function(x){max(0,x)})

names(vimp)<-colnames(X)

return(vimp)

}



#### Contents of evaluation.R ######


compute_mmd_loss <- function(Y_train, Y_test, weights){
# Y_train <- scale(Y_train)
# Y_test <- scale(Y_test)
bandwidth_Y <- (1/drf:::medianHeuristic(Y_train))^2
k_Y <- rbfdot(sigma = bandwidth_Y)
K_train <- matrix(kernelMatrix(k_Y, Y_train, Y_train), ncol = nrow(Y_train))
K_cross <- matrix(kernelMatrix(k_Y, Y_test, Y_train), ncol = nrow(Y_train))
weights <- matrix(weights, ncol = ncol(weights))
t1 <- diag(weights%*%K_train%*%t(weights))
t2 <- diag(K_cross%*%t(weights))
mmd_loss <- mean(t1) - 2*mean(t2)
mmd_loss
}

evalall <- function(Vimp, X ,Y ,Xtest, Ytest, metrics=c("MMD","MSE"), num.trees ){

if (ncol(Ytest) > 1 & "MSE" %in% metrics){
metrics <- metrics[!( metrics %in% "MSE") ]
}

# Sort for increasing importance, such that the least important variables are removed first
Vimp<-sort(Vimp)

if ( is.null(names(Vimp)) ){
stop("Need names for later")
}


evalMMD<-matrix(0, nrow=ncol(X))
evalMSE<-matrix(0, nrow=ncol(X))

###Idea: Create a function that takes a variable importance measure and does this loop!!

for (j in 1:ncol(X)){



if (j==1){

if ("MMD" %in% metrics){

DRFred<- drf(X=X,Y=Y)
weights<- predict(DRFred, newdata=Xtest)$weights
evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)

}

if ("MSE" %in% metrics){

XY <- as.data.frame(cbind(X, Y))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)
XtestRF<-Xtest
colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')
predRF<-predict(RFfull, data=XtestRF)
evalMSE[j] <- mean((Ytest - predRF$predictions)^2)

}

}else{


if ("MMD" %in% metrics){

DRFred<- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F],Y=Y)
weights<- predict(DRFred, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F])$weights
evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)

}



if ("MSE" %in% metrics){

XY <- as.data.frame(cbind(X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)
XtestRF<-Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F]
colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')
predRF<-predict(RFfull, data=XtestRF)
evalMSE[j] <- mean((Ytest - predRF$predictions)^2)

# DRFall <- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y=Y, num.trees=num.trees)
# quantpredictall<-predict(DRFall, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F], functional="quantile",quantiles=c(0.5))
# evalMAD[j] <- mean(sapply(1:nrow(Xtest), function(j) abs(Ytest[j] - quantpredictall$quantile[,,"q=0.5"][j]) ))
}

}

}

return(list(Vimp=Vimp, evalMMD=evalMMD, evalMSE=evalMSE ))

}

--

--

I am a researcher with a PhD in statistics and always happy to study and share research, data-science skills, deep math and life-changing views.