How to Handle Missing Data

The fastest multiple imputation method using XGBoost

Published in

Towards Data Science

6 min readAug 16, 2021

Missing data sucks. It prevents the use of certain models and often requires complex judgement calls by the engineer. However, in 2021, researchers at the University of Auckland developed a solution…

XGBoost data imputation missing data handle missing data predictive mean matching variance confidence interval p value statistical signifigance — Figure 1: missing data for a large dataset. Image by author.

Their method leverages the world-famous XGBoost algorithm to impute missing data. By relying on a model that’s optimized for speed, we can see 10–100x performance boosts relative to traditional imputation methods. XGBoost also requires little to no hyperparameter tuning which significantly reduces engineer workload. XGBoost is also able to maintain complex relationships observed in the data, such as interactions and non-linear relationships.

So, if you dataset has more than 3000 rows, you may want to consider using XGBoost to impute missing data. Here’s how the method works…

Technical TLDR

Use XGBoost to perform multiple imputation. It’s implemented in a MICE framework — instead of using linear/logistic regression we use XGBoost.
Use Predictive Mean Matching (PMM) to improve our variance estimates. PMM is required because XGBoost underestimates the variance of imputed data, leading to confidence intervals with poor coverage.

Ok that’s great, but how does this method actually work?

Let’s slow down a bit and understand why this method is so effective.

Our Goal

First, let’s start with our goal. Many real-world datasets have missing data, which causes problems for both modeling and analysis. In hopes of making our lives easier, we’re going to try to fill those missing values with realistic predictions.

One common method for filling missing data is to simply input the mean, median, or mode. However, as you’d expect we gain little signal and often the variance of these estimates are too low.

The Importance of Variance

But why should we care? Well, variance is the basis of all statistical significance and confidence interval calculations.

Based on the central limit theorem, we know that the average of many samples will resemble a normal distribution. If we observe a sample who’s mean is far away from the center of this theoretical distribution (the population), we can deem it highly unlikely and thereby statistically significant. And, the range where we wouldn’t call something statistically significant is called our confidence interval.

To estimate the spread of this theoretical population, we use our data’s standard deviation, which is the square root of our data’s variance, shown in figure 3. So, variance is the backbone of all confidence-based calculations.

Variance of Imputed Data

When imputing data, we’re looking to estimate unobserved data using observed data. If we had a perfectly representative sample, we could perfectly impute missing data. However, samples are never perfect and are often missing crucial pieces of information about the missing data.

Due to this fact, most data imputation methods underestimate the variance of missing data.

Now, it’s really hard to systematically introduce variance in the correct way. One naive method would be to simply add some random noise to each imputed value. That would certainly make our data more diverse and probably increase the variance. But, that uniformly distributed noise may not representative of our population.

Here’s where mixgb comes in…

The Method

The method proposed by researchers at the University of Aukland uses the popular modeling technique, XGBoost, and Predictive Mean Matching (PMM) to impute data. Let’s take a look at each one in turn.

1 — XGBoost in a MICE Framework

XGBoost is super popular tree-based algorithm due to its speed, versatility, and out-of-the-box accuracy. There’s a phenomenal explanation in the comments, but for this post you can just think of XGBoost as a black box that takes in predictors and outputs an estimate of our missing data.

The MICE framework, on the other hand, we will cover. MICE stands for Multiple Imputation by Chained Equations. It’s not as bad as it sounds.

The way MICE works is it creates M copies of the data. Then it sequentially goes through the columns in the first copied dataset (M1 in figure 4) and uses a linear model to predict the missing values. The predictors are all other variables in the row. MICE then repeats this process for the rest of the M datasets, resulting in M complete datasets.

From there, we take the mean of the value at each index for all M datasets and these averages become our final imputed dataset.

Now, if you’re paying close attention you’ll notice that all datasets will be identical. So, to give the appearance natural variation we just add some random noise to each prediction.

Pretty straight forward, right?

Now linear regression has its limitations — it doesn’t allow for non-linear relationships and needs manual intervention to handle interactions. XGBoost is great with both non-linear relationships and interactions, so we just use XGBoost instead of linear regression to predict our missing data.

2 — Predictive Mean Matching to Handle Low Variance

Now, XGBoost is limited to the data we give it, so it often underestimates the variance of our predictions. To increase the variance, we implement a method called Predictive Mean Matching (PMM).

PMM randomly selects one of the five nearest observed data points to our prediction. So, in figure 5 above, the green dot is our predicted value and the highlighted circles around it are candidates values that our prediction will become.

By replacing a prediction with an observed data point, we ensure that introduced variance will have the same structure as the variance in our population.

We repeat this for all predicted value until we’ve replaced all empty data points with an observed data point near our predicted value.

Summary

And, there you have it. To quickly summarize…

XGBoost is a highly performant algorithm that can model complex relationships in data.
The mixgb package leverages XGBoost to impute missing data.
To ensure that we can calculate accurate confidence intervals, we use predictive mean matching to increase the variance of our imputed data.

Implementation Notes

On smaller datasets, XGBoost can be outperformed in computation speed. The main competition comes from a Random Forest implementation, but XGBoost still performs better datasets larger than 3000 X 20.
For PMM, there are several variations of the method and none are limited to a donor pool size of 5. Other common values are 2, 3, and 10.
Currently, I’m unaware of a python package that supports this method.
In the paper, we started seeing mixgb outperform all other methods on a 3915 x 20 dataset in terms of computation speed. For all larger datasets, XGBoost was the clear winner.

Thanks for reading! I’ll be writing 39 more posts that bring academic research to the DS industry. Check out my comments for links to the main source for this post as well as the R package.