Iterative Imputation with Scikit-learn

Enhancing model results with an advanced imputation strategy

T.J. Kyner
Towards Data Science

--

Introduction

While having a nice, clean dataset with minimal preprocessing needs is the ideal scenario for many data scientists, real world data is typically anything but ideal. Certain preprocessing steps such as normalization and transformation aid in creating the best model possible, but are technically optional — that is, a model can be created without these steps if one ignores the differences in output quality. One issue that commonly arises that cannot be ignored, however, is missing data.

The process of filling in missing data with a given statistical method is known as imputation and comes in a variety of flavors. In this article, I’ll discuss some of the most common imputation methods and compare them with a more advanced method, iterative imputation, that can lead to enhanced model results.

Common Imputation Methods

Some of the most common imputation methods include filling in missing data with either the mean or median of a given variable based on the data that does exist. Deciding between the two depends largely on the data being worked with. In instances where the data is skewed one way or the other, the median is likely more appropriate. Conversely, data that is normally distributed can use either the mean or the median as both will be relatively the same. Let’s take a look at a couple ways to implement them.

Using NumPy and Pandas

Imputing values with NumPy and Pandas is a piece of cake. In the example below, both columns A and B have one missing value each. Column A then has its mean taken using the nanmean() function in NumPy which calculates the mean while ignoring any missing values. The same process is applied to column B with the median instead. The fillna() function is then applied to each column to fill in the missing values. The output of this code is shown below with the righthand side containing the imputed values for each respective column.

The original and imputed dataframes using NumPy and Pandas
Image by author. Output of the code directly above.

Using SimpleImputer

Scitkit-learn’s SimpleImputer (view documentation) is another way to impute missing values. While it may seem slightly more convoluted than the example with NumPy and Pandas, there are a few key benefits to using SimpleImputer. First, the missing value can be set to whatever value you’d like and does not have to be equivalent to np.nan as it does using the fillna() function in Pandas. Additionally, the imputation strategy can easily be changed between one of the following four options simply by altering the strategy argument:

  • "mean" — replaces missing values with the mean
  • "median" — replaces missing values with the median
  • "most_frequent" — replaces missing values with the most frequent value
  • "constant" — replaces missing values with whatever value is specified in the fill_value argument. This could be useful in a scenario in which you want to replace missing values with a string saying “missing” rather than an actual value.
The original and imputed dataframes using SimpleImputer in scikit-learn
Image by author. Output of the code directly above.

Iterative Imputation

Useful only when working with multivariate data, the IterativeImputer in scikit-learn (view documentation) utilizes the data available in other features in order to estimate the missing values being imputed. It does so through an…

…iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned. [Source]

If that still seems a bit abstract, hopefully the following example will help clear things up. Due to the IterativeImputer still being experimental, importing enable_iterative_imputer is a requirement for use.

The original and imputed dataframes using IterativeImputer in scikit-learn
Image by author. Output of the code directly above.

The underlying pattern for this dummy data was intended for column B to be the square of column A. While not perfect, the IterativeImputer did result in a value somewhat close to the “true” value of 16.0 being filled in. Comparing the result to what would have been achieved using a mean imputation strategy (12.7) or a median imputation strategy (9.0) clearly shows the benefit of using an iterative imputation strategy in this case.

A Comparison using Real Data

I recently had an opportunity to put the IterativeImputer to test with a real world dataset while creating a next-day prediction model for rain in Australia. While I won’t be detailing the entire project here, it does serve as a good example of how iterative imputation can be more beneficial than some of the simpler strategies. Using the weatherAUS.csv file from the source dataset, the continuous features are imputed below using three different strategies:

  1. Mean imputation
  2. Median imputation
  3. Iterative imputation

I chose to compare the Pressure9am and Pressure3pm features as they are directly related to one another and exhibit a linear relationship which will be useful for evaluation purposes. The code below imputes the missing data with the three different strategies, plots the data along with a regression line, and then displays the root mean squared error (RMSE, lower is better).

Plots comparing the mean, median, and iterative imputation strategies along with the RMSE values
Image by author. Output of the code directly above.

Without even paying attention to the RMSE values, the iterative imputation strategy should stand out as having better fit values just by looking at it. Since the mean and median strategies fill in all missing values with the same value, a cross-like shape is formed near the center of the data that does not necessarily fit the overall trend. The iterative imputation strategy, however, is able to utilize the information contained in other features to approximate the value instead, leading to a much cleaner plot that more accurately fits the trend.

You may be thinking that an improvement in the RMSE from 1.874 to 1.871 is not that big of a deal — and you’d be right. While not anything spectacular on its own, there’s a couple things to keep in mind:

  1. The amount of missing data in the Pressure9am and Pressure3pm features was only about 10%. As a result, the RMSE values can only improve so much compared to the mean and median strategies.
  2. This comparison is only looking at two features while the full dataset contains many more. Small improvements across each of these features can lead to a large improvement overall when using all of the data in the modeling process.

Conclusion

Simple imputation strategies such as using the mean or median can be effective when working with univariate data. When working with multivariate data, more advanced imputation methods such as iterative imputation can lead to even better results. Scikit-learn’s IterativeImputer provides a quick and easy way to implement such a strategy.

Github: https://github.com/tjkyner
Medium: https://tjkyner.medium.com/
LinkedIn: https://www.linkedin.com/in/tjkyner/

--

--