What’s the best way to handle NaN values?

Vasile Păpăluță
Towards Data Science
6 min readJan 12, 2020

--

While doing my practice in Big Data Analysis I found an obstacle that can destroy every data analysis process. This obstacle is known as NaN values in Data Science and Machine Learning.

What are NaN values?

NaN or Not a Number are special values in DataFrame and numpy arrays that represent the missing of value in a cell. In programming languages they are also represented, for example in Python they are represented as None value.

You may think that None (or NaN) values are just zeroes because they represent the absence of a value. But here is a trap, the difference between zero and None value is that zero is a value (for example integer or float), while the None value represents the absence of that value. Here is a meme that explains very well the difference between None and zero values:

The difference between 0 and NULL (Source — Google)

How are NaN value dangerous?

NaN values are dangerous in two ways:

  • The change of some metrics as mean or median values, therefore giving wrong information to scientists.
  • The sklearn implemented algorithms can’t perform on datasets that have such values (try to implement the TreeDecsisionClassifier on the heart-disease dataset).

How to deal with them?

So, if the NaN values are so dangerous to the work of the Data Scientists, what we should do with them? There are a few solutions:

  • To erase the rows that have NaN values. But this is not a good choice because in such a way we lose the information, especially when we work with small datasets.
  • To impute NaN values with specific methods or values. This article refers to these methods.

There are a lot of ways to impute these gaps and in most cases, Data Scientists, especially newbies, don’t know them. Here are the ways to do that:

  • Inpute them with specific values.
  • Impute with special metrics, for example, mean or median.
  • Impute using a method: MICE or KNN.

So let’s see how every method works and how they affect the dataset.

The experiment!

To verify every method I chose a dataset called the Iris Dataset — perhaps the most common dataset for testings in Machine Learning. I also tried these methods on bigger and more complex datasets but for some algorithms, it took too long to perform imputations.

First I generated for every feature column in this dataset 15 random and unique indexes between 0 and 149. Using these indexes (for every column they were generated separately) I changed column values with NaN.

And after applying every method on these NaN values I used Mean Squared Error (MSE) to check the “accuracy” of every method.

So, let’s start.

Method 1: Imputation with specific values.

In this method NaN values are changed with a specific value (a number for example), in most cases, this is 0. Sometimes it is the best option, like when your feature is the amount of money spent on sweets, but sometimes it is the worst one, like for age.

Now let’s see how it affects the initial dataset:

The imputed values are represented as stars (*) and normal values as dots.

As you see, filling the NaN values with zero strongly affects the columns where 0 value is something impossible. This would strongly affect space depending on the algorithms used especially KNN and TreeDecissionClassifier.

Hint: we can see if zero is a good choice by applying .describe() function to our dataframe. If the min value equals 0 then it could be a good choice, if not then you should go for another option.

Method 2: Metrics imputation.

Metrics imputations is a way to fill NaN values with some special metrics that depend on your data: mean or median for example.

Mean value is the sum of a value in a series divided by a number of all values of series. It is one of the most used types of metrics in statistics. But why do we impute the NaN values with mean value? Mean has a very interesting property, it doesn’t change if you add some more mean values to your series.

The data visualisation after mean imputations.

On the plot above you can see that it doesn’t affect the structure of dataset too much and the most important thing is that it doesn’t place any sample from one class into the zone of another class.

Median value splits the numbers in 2 halves that have an equal number of samples. Sometimes in statistical analysis, the median is more informative than mean, because it’s not skewed so much by new values. In a normal distribution (that actually doesn’t exist) the mean and Median values are equal.

Because of the fact that there are not normal distributions, in most cases mean value and median value are very close to each other.

The data visualisation after median imputations.

From the plot above it’s clear how the median works, if you will look carefully all stars (value that where imputed) are aligned in 2 orthogonal lines (5.6 for sepal_lenght and 3.0 for sepal_width).

Method 3: Imputing with KNN

This algorithm of imputation is very similar to KNearesNeighbours from sklearn. It finds the closest k samples from dataset to the sample with NaN value, and impute it with it the mean value of these samples.

This method is implemented in library impyute and sklearn (When I started writing this article I didn’t know about the sklearn implementation.)

The data visualisation after KNN imputation with 3 neighbours.

From the plot above we can see a critical error — placing sample from the red class in the blue-green zone (this is knn implementation with 3 neighbours).

Method 4: Imputing with MICE

And the last algorithm for this article, and the best one that I know right now — Multiple Imputation by Chained Equations. This algorithm for every column that has some missing values fits a linear regression with the present values. After that, it uses these linear functions to impute the NaN values with the prediction of these values.

The data visualisation after MICE imputation.

We can see that on the plot above it doesn’t strongly affect the dataset representation on the 2D plot. But why?

KNN and MICE imputations use the whole dataset to replace the NaN value, while median and mean uses only the column of the missing value, that’s why the last 2 algorithms don’t affect strongly the dataset structure and don’t change its information.

But what are the figures saying?

The final word is given to the figures. To find out how well the methods mentioned above are working, I use the MSE (Mean Squared Error). I calculated the MSE values between original values and the ones that were imputed.

The table above shows the MSE for every method (on KNN and MICE I used 2 versions: one including the target value (with y) and one not including it).

Below are what algorithm worked for what column best.

  • sepal_length — MICE_y
  • sepal_width — KNN4_y
  • petal_length — MICE
  • petal_width — MICE

Exactly how I thought it would be, MICE worked well in the majority of cases.

Last words!

The conclusion that I got from this experiment is that the best ways to impute continuous values are the ones that use the whole dataset like MICE or KNN, not just one column.

Thank you for reading!

--

--

A young and passionate student about Data Science and Machine Learning, dreaming of becoming one day an AI Engineer.