Stop Using Mean to Fill Missing Data

Dario Radečić
Towards Data Science
5 min readSep 17, 2019

--

Mean imputation was the first ‘advanced’ (sighs) method of dealing with missing data I’ve used. In a way, it is a huge step from filling missing values with 0 or a constant, -999 for example (please don’t do that).

However, it still isn’t an optimal method, and today's post will show you why.

Photo by Pietro Jeng on Unsplash

The Dataset

For this article, I’ve chosen to use Titanic Dataset, mainly because it’s well known, and the Age column contains some missing values. To start out, let’s import everything needed and do some basic data cleaning. Here are the imports:

When you check the head of the dataset you will get the following:

And with regards to data cleaning, the following should be done:

  1. PassengerId and Ticket should be dropped — the first is just an arbitrary integer value, and the second one is distinct for every passenger.
  2. Values from the Sex column should be remapped to 0 and 1 instead of ‘male’ and ‘female’
  3. Person title should be extracted from Name — e.g. Mr., Mrs., Miss…, and those should be further converted into 0 and 1–0 if the title is common (Mr., Miss.), and 1 if it isn’t (Dr., Rev., Capt.). Finally, Name should be dropped
  4. Cabin should be replaced with Cabin_Known — 0 if the value is NaN, 1 otherwise
  5. Dummy columns should be created from Embarked and first dummy column should be dropped to avoid collinearity issues

Here’s a short code snippet for achieving this:

Mean Imputation

The dataset is now clean-ish, but it still contains missing values in the Age column:

--

--