Random Forests and Missing Values

There is a very Intriguing Practical Fix

Jeffrey Näf
Towards Data Science

--

Features of (Distributional) Random Forests. In this article: The ability to deal with missing values. Source: Author.

Outside of some excessively cleaned data sets that one finds online, missing values are everywhere. In fact, the more complex and large the dataset, the more likely it is that missing values are present. Missing values are a fascinating field of statistical research, but in practice they are often a nuisance.

If you deal with a prediction problem where you want to predict a variable Y from p-dimensional covariates X=(X_1,…,X_p) and you face missing values in X, there is an interesting solution for tree-based methods. This method is actually rather old but appears to work remarkably well in a wide range of data sets. I am talking about the “missing incorporated in attributes criterion” (MIA; [1]). While there are many good articles about missing values (such as this one), this powerful approach seems somewhat underused. In particular, one does not need to impute, delete or predict your missing values in any way, but instead can just run your prediction as if you have fully observed data.

I will quickly explain how the method itself works, and then present an example with the distributional random forest (DRF) explained here. I chose DRF because it is a very general version of Random Forest (in particular, it can also be used to predict a random vector Y) and because I am somewhat biased here. MIA is actually implemented for the generalized random forest (GRF), which covers a wide range of forest implementations. In particular, since the implementation of DRF on CRAN is based on GRF, after a slight modification, it can use the MIA method as well.

Of course, be aware that this is a quick fix that (as far as I know) has no theoretical guarantees. Depending on the missingness mechanism, it might heavily bias the analysis. On the other hand, most commonly used methods for dealing with missing values don’t have any theoretical guarantees or are outright known to bias the analysis and, at least empirically, MIA appears to work well and

How it works

Recall that in a RF, splits are build of the form X_j < S or X_j ≥ S, for a dimension j=1,…,p. To find this split valule S it optimizes some kind of criterion on the Y’s, for example the CART criterion. Thus the observations are successively divided through decision rules that depend on X.

Illustration of the splitting done in a RF. Image by author.

The original paper explains it a bit confusingly, but as far as I understand MIA works as follows: Let us consider a sample (Y_1, X_1),…, (Y_n, X_n), with

X_i=(X_i1,…,X_ip)’.

Splitting without missing values is just looking for the value S as above and then throwing all Y_i with X_ij < S in Node 1 and all Y_i with X_ij ≥ S in Node 2. Calculating the target criterion such as CART for each value S, we can choose the best one. With missing values there are instead 3 options for every candidate split value S to consider:

  • Use the usual rule for all observations i such that X_ij is observed and send i to Node 1 if X_ij is missing.
  • Use the usual rule for all observations i such that X_ij is observed and send i to Node 2 if X_ij is missing.
  • Ignore the usual rule and just send i to Node 1 if X_ij is missing and to Node 2 if it is observed.

Which of these rules to follow is again decided according to the criterion on Y_i we use.

Illustration of how I understand the MIA procedure. Given observations in the parent node, we are looking for the best split value S. For each split value we consider the 3 options and try until we find the minimum. The sets {} on the left indicate the observations i that get sent to the left or the right. Image by author.

A Small Example

It needs to be mentioned at this point that the drf package on CRAN is not yet updated with the newest methodology. There will be a point in the future where all of this is implemented in one package on CRAN(!) However, for the moment, there are two versions:

If you want to use the fast drf implementation with missing values (without confidence intervals), you can use the “drfown” function attached at the end of this article. This code is adapted from

lorismichel/drf: Distributional Random Forests (Cevid et al., 2020) (github.com)

If on the other hand, you want confidence intervals with your parameters, use this (slower) code

drfinference/drf-foo.R at main · JeffNaef/drfinference (github.com)

In particular, drf-foo.R contains all you need in the latter case.

We will focus on the slower code with confidence intervals, as explained in this article and also consider the same example as in said article:

set.seed(2)

n<-2000
beta1<-1
beta2<--1.8


# Model Simulation
X<-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))
u<-rnorm(n=n, sd = sqrt(exp(X[,1])))
Y<- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)

Note that this is a heteroskedastic linear model with p=2 and with the variance of the error term depending on the X_1 values. Now we also add missing values to X_1 in a Missing at Random (MAR) fashion:

prob_na <- 0.3
X[, 1] <- ifelse(X[, 2] <= -0.2 & runif(n) < prob_na, NA, X[, 1])

This means that X_1 is missing with a probability of 0.3, whenever X_2 has a value smaller than -0.2. Thus the probability of X_1 being missing depends on X_2, which is what is referred to as “Missing at Random”. This is already a complex situation and there is information to be gained by looking at the pattern of missing values. That is, the missingness is not “Missing Completely at Random (MCAR)”, because the missingness of X_1 depends on the value of X_2. This in turn means that the distribution of X_2 we draw from is different, conditional on whether X_1 is missing or not. This in particular means that deleting the rows with missing values might severely bias the analysis.

We now fix x and estimate the conditional expectation and variance given X=x, exactly as in the previous article.

# Choose an x that is not too far out
x<-matrix(c(1,1),ncol=2)

# Choose alpha for CIs
alpha<-0.05

We then also fit DRF and predict the weights for the test point x (which corresponds to predicting the conditional distribution of Y|X=x):

## Fit the new DRF framework
drf_fit <- drfCI(X=X, Y=Y, min.node.size = 5, splitting.rule='FourierMMD', num.features=10, B=100)

## predict weights
DRF = predictdrf(drf_fit, x=x)
weights <- DRF$weights[1,]

Example 1: Conditional Expectation

We first estimate the conditional expectation of Y|X=x.

# Estimate the conditional expectation at x:
condexpest<- sum(weights*Y)

# Use the distribution of weights, see below
distofcondexpest<-unlist(lapply(DRF$weightsb, function(wb) sum(wb[1,]*Y) ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondexpest-condexpest)

# build 95%-CI
lower<-condexpest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condexpest + qnorm(1-alpha/2)*sqrt(varest)
round(c(lower, condexpest, upper),2)

# without NAs: (-1.00, -0.69 -0.37)
# with NAs: (-1.15, -0.67, -0.19)

Remarkably, the values obtained with NAs are very close to the ones from the first analysis without NAs in the previous article! This really is quite astounding to me, as this missing mechanism is not easy to deal with. Interestingly, the estimated variance of the estimator also doubles, from around 0.025 without missing values to roughly 0.06 with missing values.

The truth is given as:

so we have a slight error, but the confidence intervals contain the truth, as they should.

The result looks similar for a more complex target, like the conditional variance:

# Estimate the conditional expectation at x:
condvarest<- sum(weights*Y^2) - condexpest^2

distofcondvarest<-unlist(lapply(DRF$weightsb, function(wb) {
sum(wb[1,]*Y^2) - sum(wb[1,]*Y)^2
} ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondvarest-condvarest)

# build 95%-CI
lower<-condvarest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condvarest + qnorm(1-alpha/2)*sqrt(varest)

c(lower, condvarest, upper)

# without NAs: (1.89, 2.65, 3.42)
# with NAs: (1.79, 2.74, 3.69)

Here the difference in the estimated values is a bit larger. As the truth is given as

the estimate with NAs is even slightly more accurate (though of course this is likely just randomness). Again the variance estimate of the (variance) estimator increases with missing values, from 0.15 (no missing values) to 0.23.

Conclusion

In this article, we discussed MIA, which is an adaptation of the splitting method in Random Forest to deal with missing values. Since it is implemented in GRF and DRF, it can be used broadly and the small example we looked at indicates that it works remarkably well.

However, I’d like to note again that there is no theoretical guarantee for consistency or for the confidence intervals to make sense, even for a very large number of datapoints. The reason for missing values are numerous and one has to be very careful to not bias one’s analysis through a careless handling of this issue. The MIA method is by no means a well-understood fix for this problem. However, it seems to be a reasonable quick fix for the moment, that appears to be able to make some use of the pattern of missingness in the data. If somebody does/has a more extensive simulation analysis I would be curious about the results.

Code

require(drf)            

drfown <- function(X, Y,
num.trees = 500,
splitting.rule = "FourierMMD",
num.features = 10,
bandwidth = NULL,
response.scaling = TRUE,
node.scaling = FALSE,
sample.weights = NULL,
sample.fraction = 0.5,
mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),
min.node.size = 15,
honesty = TRUE,
honesty.fraction = 0.5,
honesty.prune.leaves = TRUE,
alpha = 0.05,
imbalance.penalty = 0,
compute.oob.predictions = TRUE,
num.threads = NULL,
seed = stats::runif(1, 0, .Machine$integer.max),
compute.variable.importance = FALSE) {

# initial checks for X and Y
if (is.data.frame(X)) {

if (is.null(names(X))) {
stop("the regressor should be named if provided under data.frame format.")
}

if (any(apply(X, 2, class) %in% c("factor", "character"))) {
any.factor.or.character <- TRUE
X.mat <- as.matrix(fastDummies::dummy_cols(X, remove_selected_columns = TRUE))
} else {
any.factor.or.character <- FALSE
X.mat <- as.matrix(X)
}

mat.col.names.df <- names(X)
mat.col.names <- colnames(X.mat)
} else {
X.mat <- X
mat.col.names <- NULL
mat.col.names.df <- NULL
any.factor.or.character <- FALSE
}

if (is.data.frame(Y)) {

if (any(apply(Y, 2, class) %in% c("factor", "character"))) {
stop("Y should only contain numeric variables.")
}
Y <- as.matrix(Y)
}

if (is.vector(Y)) {
Y <- matrix(Y,ncol=1)
}


#validate_X(X.mat)

if (inherits(X, "Matrix") && !(inherits(X, "dgCMatrix"))) {
stop("Currently only sparse data of class 'dgCMatrix' is supported.")
}

drf:::validate_sample_weights(sample.weights, X.mat)
#Y <- validate_observations(Y, X)

# set legacy GRF parameters
clusters <- vector(mode = "numeric", length = 0)
samples.per.cluster <- 0
equalize.cluster.weights <- FALSE
ci.group.size <- 1

num.threads <- drf:::validate_num_threads(num.threads)

all.tunable.params <- c("sample.fraction", "mtry", "min.node.size", "honesty.fraction",
"honesty.prune.leaves", "alpha", "imbalance.penalty")

# should we scale or not the data
if (response.scaling) {
Y.transformed <- scale(Y)
} else {
Y.transformed <- Y
}

data <- drf:::create_data_matrices(X.mat, outcome = Y.transformed, sample.weights = sample.weights)

# bandwidth using median heuristic by default
if (is.null(bandwidth)) {
bandwidth <- drf:::medianHeuristic(Y.transformed)
}


args <- list(num.trees = num.trees,
clusters = clusters,
samples.per.cluster = samples.per.cluster,
sample.fraction = sample.fraction,
mtry = mtry,
min.node.size = min.node.size,
honesty = honesty,
honesty.fraction = honesty.fraction,
honesty.prune.leaves = honesty.prune.leaves,
alpha = alpha,
imbalance.penalty = imbalance.penalty,
ci.group.size = ci.group.size,
compute.oob.predictions = compute.oob.predictions,
num.threads = num.threads,
seed = seed,
num_features = num.features,
bandwidth = bandwidth,
node_scaling = ifelse(node.scaling, 1, 0))

if (splitting.rule == "CART") {
##forest <- do.call(gini_train, c(data, args))
forest <- drf:::do.call.rcpp(drf:::gini_train, c(data, args))
##forest <- do.call(gini_train, c(data, args))
} else if (splitting.rule == "FourierMMD") {
forest <- drf:::do.call.rcpp(drf:::fourier_train, c(data, args))
} else {
stop("splitting rule not available.")
}

class(forest) <- c("drf")
forest[["ci.group.size"]] <- ci.group.size
forest[["X.orig"]] <- X.mat
forest[["is.df.X"]] <- is.data.frame(X)
forest[["Y.orig"]] <- Y
forest[["sample.weights"]] <- sample.weights
forest[["clusters"]] <- clusters
forest[["equalize.cluster.weights"]] <- equalize.cluster.weights
forest[["tunable.params"]] <- args[all.tunable.params]
forest[["mat.col.names"]] <- mat.col.names
forest[["mat.col.names.df"]] <- mat.col.names.df
forest[["any.factor.or.character"]] <- any.factor.or.character

if (compute.variable.importance) {
forest[['variable.importance']] <- variableImportance(forest, h = bandwidth)
}

forest
}

Citations

[1] Twala, B. E. T. H., M. C. Jones, and David J. Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters 29,2008.

--

--

I am a researcher with a PhD in statistics and always happy to study and share research, data-science skills, deep math and life-changing views.