The world’s leading publication for data science, AI, and ML professionals.

4 Techniques To Deal With Missing Data in Datasets

Simple methods that can nullify the effects of missing values

How To Deal With Missing Data

Photo by Randy Fath on Unsplash
Photo by Randy Fath on Unsplash

Missing Data is a problem for every data scientist as we may not be able to carry out the analysis we desire or not run a certain model. In this article, I will discuss simple methods that deal with missing values. However, to preface, there is no ‘official’ best way to deal with null data. Typically, the best way to handle this scenario is to understand where the data comes from and what it means. This is referred to as domain knowledge. Nevertheless, let’s begin.

In this article, we will be using the famous and amazing Titanic dataset (CC0 Licence). I am sure you have all heard of it. The dataset is as follows:

import pandas as pd
data = pd.read_csv('test.csv')
data.info()
Image by author.
Image by author.
data.isnull().sum()
Image by author.
Image by author.

As we can see, the missing data is only in the ‘Age’ and ‘Cabin’ columns. These are float and categorical data types respectively, so we have to handle the two columns differently.


1. Delete the Data

The easiest method is to just simply delete the whole training examples where one or several columns have null entries.

data = data.dropna()
data.isnull().sum()
Image by author.
Image by author.

There are now no null entries! However, there is no free lunch. Take a look at how many training examples are left:

Image by author.
Image by author.

There are only 87 examples left! Originally there were 418, therefore we have reduced our dataset by around 80%. This is far from ideal, but for other datasets, this approach could be very reasonable. I would say a maximum reduction of 5% would be fine otherwise you may lose valuable data that will affect the training of your model.

2. Imputing Averages

The next method is to assign some average value (mean, median, or mode) to the null entries. Let’s take a look at the following snippet from the data:

data[100:110]
Image from author.
Image from author.

For the ‘Age’ column, the mean can be computed as the following:

data.fillna(data.mean(), inplace=True)
Image from author.
Image from author.

The average age of 30 has now been added to the null entries. Notice, for the ‘Cabin’ column the entries are still NaN as you can’t calculate the mean for an object datatype as it’s categorical. This can be fixed by computing its mode:

data = data.fillna(data['Cabin'].value_counts().index[0])
Image by author.
Image by author.

3. Assign New Category

In regards to the ‘Cabin’ feature, it only has 91 entries, which is about 25% of the total examples. Therefore, the mode value that we previously calculated is not very reliable. A better way is to assign these NaN values their own category:

data['Cabin'] = data['Cabin'].fillna('Unkown')
Image by author.
Image by author.

As we no longer have any NaN values, Machine Learning algorithms can now use this dataset. However, it will use the ‘Unknown’ unique value in the ‘Cabin’ column as its own category even though it never existed on the Titanic.

4. Certain Algorithms

The final technique is to do nothing. The majority of machine learning algorithms do not work with missing data. On the other hand, algorithms as K-Nearest Neighbor, Naive Bayes, and XGBoost all work with missing data. There is much literature online about these algorithms and their implementation.

Conclusion

There are many ways to deal with missing data. Certain methods are better than others depending on the type of data and the amount that is missing. There are also more complicated ways to input missing data that I have not covered here, but these options are great options to get you started .

For the full code, please see my GitHub:

Medium-Articles/Dealing With Missing Data.ipynb at main · egorhowell/Medium-Articles

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles