The world’s leading publication for data science, AI, and ML professionals.

How to Use Python and MissForest Algorithm to Impute Missing Data

Step-by-step guide on using Random Forests to handle missing data

Missing value imputation is an ever-old question in data science and machine learning. Techniques go from the simple mean/median imputation to more sophisticated methods based on machine learning. How much of an impact approach selection has on the final results? As it turns out, a lot.

Photo by Ryoji Iwata on Unsplash
Photo by Ryoji Iwata on Unsplash

If you are more of a video person, there’s something for you too:

Let’s get a couple of things straight – missing value imputation is domain-specific more often than not. For example, a dataset might contain missing values because a customer isn’t using some service, so imputation would be the wrong thing to do.

Further, simple techniques like mean/median/mode imputation often don’t work well. And it’s easy to reason why. Extremes can influence average values in the dataset, the mean in particular. Also, filling 10% or more of the data with the same value doesn’t sound too peachy, at least for the continuous variables.

The article is structured as follows:

  • Problems with KNN imputation
  • What is MissForest?
  • MissForest in practice
  • MissForest evaluation
  • Conclusion

Problems with KNN imputation

Even some of the Machine Learning-based imputation techniques have issues. For example, KNN imputation is a great stepping stone from the simple average imputation but poses a couple of problems:

  • You need to choose a value for K – not an issue for small datasets
  • Is sensitive to outliers because it uses Euclidean distance below the surface
  • Can’t be applied to categorical data, as some form of conversion to numerical representation is required
  • Can be computationally expensive, but that depends on the size of your dataset

Don’t get me wrong, I would pick KNN imputation over a simple average any day, but there are still better methods. If you want to find out more on the topic, here’s my recent article:

How to Handle Missing Data with Python and KNN


What is MissForest?

MissForest is a machine learning-based imputation technique. It uses a Random Forest algorithm to do the task. It is based on an iterative approach, and at each iteration the generated predictions are better. You can read more about the theory of the algorithm below, as Andre Ye made great explanations and beautiful visuals:

MissForest: The Best Missing Data Imputation Algorithm?

This article aims more towards practical application, so we won’t dive too much into the theory. To summarize, MisForrest is excellent because:

  • Doesn’t require extensive data preparation – as a Random forest algorithm can determine which features are important
  • Doesn’t require any tuning – like K in K-Nearest Neighbors
  • Doesn’t care about categorical data types – Random forest knows how to handle them

Next, we’ll dive deep into a practical example.


MissForest in practice

We’ll work with the Iris dataset for the practical part. The dataset doesn’t contain any missing values, but that’s the whole point. We will produce missing values randomly, so we can later evaluate the performance of the MissForest algorithm.

Before I forget, please install the required library by executing pip install missingpy from the Terminal.

Great! Next, let’s import Numpy and Pandas and read in the mentioned Iris dataset. We’ll also make a copy of the dataset so that we can evaluate with real values later on:

Image by author
Image by author

All right, let’s now make two lists of unique random numbers ranging from zero to the Iris dataset’s length. With some Pandas manipulation, we’ll replace the values of sepal_lengthand petal_width with NaNs, based on the index positions generated randomly:

Image by author
Image by author

As you can see, the petal_width contains only 14 missing values. That’s because the randomization process created two identical random numbers. It doesn’t pose any problem to us, as in the end, the number of missing values is arbitrary.

The next step is to, well, perform the imputation. We’ll have to remove the target variable from the picture too. Here’s how:

And that’s it – missing values are now imputed!

But how do we evaluate the damn thing? That’s the question we’ll answer next.


MissForest evaluation

To perform the evaluation, we’ll make use of our copied, untouched dataset. We’ll add two additional columns representing the imputed columns from the MissForest algorithm – both for sepal_length and petal_width.

We’ll then create a new dataset containing only these two columns – in the original and imputed states. Finally, we will calculate the absolute errors for further inspection.

Here’s the code:

As you can see, the last line of code selects only those rows on which imputation was performed. Let’s take a look:

Image by author
Image by author

All absolute errors are small and well within a single standard deviation from the original’s average. The imputed value looks natural if you don’t take into account the added decimal places. That can be easily fixed if necessary.


Parting words

This was a short, simple, and to the point article on missing value imputation with machine learning methods. You’ve learned why machine learning is better than the simple average in this realm and why MissForest outperforms KNN imputer.

I hope it was a good read for you. Take care.


Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.

Join Medium with my referral link – Dario Radečić


Originally published at https://betterdatascience.com on November 5, 2020.


Related Articles