How To Deal With Missing Data

Missing Data is a problem for every data scientist as we may not be able to carry out the analysis we desire or not run a certain model. In this article, I will discuss simple methods that deal with missing values. However, to preface, there is no ‘official’ best way to deal with null data. Typically, the best way to handle this scenario is to understand where the data comes from and what it means. This is referred to as domain knowledge. Nevertheless, let’s begin.
In this article, we will be using the famous and amazing Titanic dataset (CC0 Licence). I am sure you have all heard of it. The dataset is as follows:
import pandas as pd
data = pd.read_csv('test.csv')
data.info()

data.isnull().sum()

As we can see, the missing data is only in the ‘Age’ and ‘Cabin’ columns. These are float and categorical data types respectively, so we have to handle the two columns differently.
1. Delete the Data
The easiest method is to just simply delete the whole training examples where one or several columns have null entries.
data = data.dropna()
data.isnull().sum()

There are now no null entries! However, there is no free lunch. Take a look at how many training examples are left:

There are only 87 examples left! Originally there were 418, therefore we have reduced our dataset by around 80%. This is far from ideal, but for other datasets, this approach could be very reasonable. I would say a maximum reduction of 5% would be fine otherwise you may lose valuable data that will affect the training of your model.
2. Imputing Averages
The next method is to assign some average value (mean, median, or mode) to the null entries. Let’s take a look at the following snippet from the data:
data[100:110]

For the ‘Age’ column, the mean can be computed as the following:
data.fillna(data.mean(), inplace=True)

The average age of 30 has now been added to the null entries. Notice, for the ‘Cabin’ column the entries are still NaN as you can’t calculate the mean for an object datatype as it’s categorical. This can be fixed by computing its mode:
data = data.fillna(data['Cabin'].value_counts().index[0])

3. Assign New Category
In regards to the ‘Cabin’ feature, it only has 91 entries, which is about 25% of the total examples. Therefore, the mode value that we previously calculated is not very reliable. A better way is to assign these NaN values their own category:
data['Cabin'] = data['Cabin'].fillna('Unkown')

As we no longer have any NaN values, Machine Learning algorithms can now use this dataset. However, it will use the ‘Unknown’ unique value in the ‘Cabin’ column as its own category even though it never existed on the Titanic.
4. Certain Algorithms
The final technique is to do nothing. The majority of machine learning algorithms do not work with missing data. On the other hand, algorithms as K-Nearest Neighbor, Naive Bayes, and XGBoost all work with missing data. There is much literature online about these algorithms and their implementation.
Conclusion
There are many ways to deal with missing data. Certain methods are better than others depending on the type of data and the amount that is missing. There are also more complicated ways to input missing data that I have not covered here, but these options are great options to get you started .
For the full code, please see my GitHub:
Medium-Articles/Dealing With Missing Data.ipynb at main · egorhowell/Medium-Articles
Another Thing!
I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.