The world’s leading publication for data science, AI, and ML professionals.

Stop Wasting Useful Information When Imputing Missing Values

There are better ways to impute missing values than just taking the average

Missing values is one of the most common problems during data analysis and machine learning. Machine learning models require that a dataset does not contain any missing values before they can be fitted to the data. Therefore, it is crucial that we learn how to properly handle them.


I put out a video a while ago about handling missing data using Pandas and in that video, I spoke about the two main ways to deal with missing values in a dataset:

  1. If there are only a few rows with missing values or if a column has an overwhelming number of missing values, we can simply drop them from the dataset without running into the risk of losing too much information.
  2. An alternative approach to dropping missing values is via a process called Imputation. Imputation is when missing values in a dataset are replaced with some substituted values.

In this article, we will examine the process of imputation in greater detail and more specifically, we will learn about the differences between a univariate approach to imputing versus a multivariate approach to imputing.

You can find the complete notebook on my GitHub here.


Introduction

Univariate imputation implies that we are only considering the values of a single column when performing imputation. Multivariate imputation, on the other hand, involves taking into account other features in the dataset when performing imputation.

The multivariate approach to imputing is generally preferred over the univariate approach as it is more robust and provides our model with a more accurate representation of the missing values in order to make better predictions.

In this article, we will explore 3 different imputation techniques with reference to the Kaggle Titanic dataset:

  1. Simple imputer
  2. Iterative imputer
  3. KNN imputer

Data preparation

For the purpose of this tutorial, I went ahead and dropped the PassengerId, Name, Ticket and Cabin columns from the original dataset.

After dropping those 4 columns, we are left with the following data frame.


Missing data

Because there are only 2 rows with missing Embarked values, I decided to drop them from the dataset, leaving us with 177 rows with missing values in the Age column.


Explore the Age feature

Before we look at how we can impute the missing values in the Age column, let us briefly explore the Age feature.

We can observe that Fare is slightly positively correlated with Age. In other words, passengers that were older paid a higher fare for their ticket.

What about the distribution of passenger age?


Missing values imputation

Now onto the main purpose of this article. In this section, we will look at 3 different imputation techniques using the Scikit-learn library in Python.

  1. Simple imputer
  2. Iterative imputer
  3. KNN imputer

To demonstrate the differences between the 3 techniques, I have created a sample data frame as follows.

Notice the missing value on row 6 in the Age column of the data frame. Our goal here will be to use different imputation techniques to replace the missing value with a substituted value and subsequently study the differences between the techniques.

Simple imputer

Simple imputer is an example of a univariate approach to imputing missing values i.e. it only takes a single feature into account when performing imputation.

Some of the most common uses of simple imputer include:

  • Mean
  • Median
  • Most frequent (mode)

Here, our simple imputer has filled the missing value in the Age column with the average age of the first 5 rows which is 31.2.

Although easy and straightforward, simple imputer is a rather blunt approach to imputing missing values. As we have seen earlier on, Age is positively correlated with Fare so it would be worthwhile to also consider the values in the Fare column when performing imputation.

This is where multivariate imputation comes in where we take into account multiple features in a dataset during imputation.

Iterative imputer

Iterative imputer is an example of a multivariate approach to imputation. It models the missing values in a column by using information from the other columns in a dataset. More specifically, it treats the column with missing values as a target variable while the remaining columns are used are predictor variables to predict the target variable.

In our sample data frame, the Age column has one missing value on row 6 and is therefore assigned as the target variable in this scenario. This leaves the SibSp and Fare columns as our predictor variables.

Iterative imputer will use the first 5 rows of the data frame to train a predictive model. Once the model is ready, it will then use values in the SibSp and Fare columns on row 6 as inputs and predict the Age value for that row.

This is what the result of our iterative imputer looks like.

KNN imputer

KNN is short for k-nearest neighbours which is a machine learning algorithm and another multivariate imputation technique. KNN imputer scans a dataset for k nearest rows to the row with missing values. It then proceeds to fill those missing values with the average of those nearest rows.

To illustrate this, here I have set k to equal to 2. In other words, I want KNN imputer to impute the missing Age value on row 6 with the average age of the 2 observations that are closest to that row.

As a result, KNN imputer has decided that row 3 and row 5 are the closest to row 6. Therefore, the average age between those two rows is (26 + 35) / 2 = 30.5.


Model accuracy under simple imputer versus iterative imputer

Now that we have a better understanding of how the different imputers work, we can move on to apply these techniques to the Titanic dataset and compare the accuracy under each approach.

We should expect to see our model perform better under multivariate imputation than univariate imputation as multivariate imputation provides a more accurate prediction of the missing values and thus allowing our model to make better predictions.

In this section, we will build a column transformer which consists of a one-hot encoder for encoding the Sex and Embarked columns as well as an imputer to impute the missing values in the Age column.

Following that, we will chain the column transformer with a random forest classifier to predict the survival of the passengers on the Titanic.

Finally, we will perform 10-fold cross-validation to compare the prediction results under univariate imputation versus multivariate imputation.

Univariate imputation (simple imputer)

Multivariate imputation (iterative imputer)

The mean cross-validation score is higher under multivariate imputation than it is under univariate imputation.

Although not a drastic difference, we can conclude that our model performed better under multivariate imputation.


Conclusion

To summarise, in this article, we have discussed the differences between univariate imputation and multivariate imputation. Furthermore, we looked at 3 different imputation techniques within Scikit-learn which include simple imputer, iterative imputer and KNN imputer.

After comparing the prediction accuracy of our model using simple imputer versus iterative imputer, we can conclude that multivariate imputation results in better model predictions due to its lower mean cross-validation score.

As usual, I hope you have found this article helpful. Feel free to check out my previous articles on other applications of the Scikit-learn library in Python.

Happy learning!

Feature Selection & Dimensionality Reduction Techniques to Improve Model Accuracy

Guide to Encoding Categorical Features Using Scikit-Learn For Machine Learning


References


Follow me on other platforms


Related Articles