Mean imputation was the first ‘advanced’ (sighs) method of dealing with missing data I’ve used. In a way, it is a huge step from filling missing values with 0 or a constant, -999 for example (please don’t do that).
However, it still isn’t an optimal method, and today’s post will show you why.

The Dataset
For this article, I’ve chosen to use Titanic Dataset, mainly because it’s well known, and the Age column contains some missing values. To start out, let’s import everything needed and do some basic data cleaning. Here are the imports:
When you check the head of the dataset you will get the following:

And with regards to data cleaning, the following should be done:
- PassengerId and Ticket should be dropped – the first is just an arbitrary integer value, and the second one is distinct for every passenger.
- Values from the Sex column should be remapped to 0 and 1 instead of ‘male’ and ‘female’
- Person title should be extracted from Name – e.g. Mr., Mrs., Miss…, and those should be further converted into 0 and 1–0 if the title is common (Mr., Miss.), and 1 if it isn’t (Dr., Rev., Capt.). Finally, Name should be dropped
- Cabin should be replaced with _Cabin_Known_ – 0 if the value is NaN, 1 otherwise
- Dummy columns should be created from Embarked and first dummy column should be dropped to avoid collinearity issues
Here’s a short code snippet for achieving this:
Mean Imputation
The dataset is now clean-ish, but it still contains missing values in the Age column:

Because I don’t want to mess up the original dataset, I will make a copy and mark it, so it’s clearly visible that mean imputation will be done here. Then, I will create a new attribute – _Age_Mean_Filled – which will, as the name suggests, contain mean imputed values for the Age_ attribute:
Statistically speaking, here’s what mean imputation does to the dataset:

Yes, the mean is the same, obviously, but take a look at 25th, 50th, and 75th percentile. Also, look at the change in standard deviation. The conclusion is obvious – mean imputation has an impact on attributes variability.
Let’s explore how the distributions look:

Both left and right side are close-ish, but the significant differences are around the middle – as you can see.
The question now is – how to address this issue?
Introducing MICE
MICE, or Multivariate Imputation by Chained Equation (what a memorable term), is an imputation method which works by filling the missing data multiple times. Chained Equation approach also has the benefit of being able to handle different data types efficiently – such as continuous and binary.
To quote statsmodels.org,
The basic idea is to treat each variable with missing values as the dependent variable in a regression, with some or all of the remaining variables as its predictors. The MICE procedure cycles through these models, fitting each in turn, then uses a procedure called "predictive mean matching" (PMM) to generate random draws from the predictive distributions determined by the fitted models. These random draws become the imputed values for one imputed data set.[1]
And here are some of the main advantages of using MICE, according to the National Center for Biotechnology Information:
Multiple imputation has a number of advantages over these other missing data approaches. Multiple imputation involves filling in the missing values multiple times, creating multiple "complete" datasets. Described in detail by Schafer and Graham (2002), the missing values are imputed based on the observed values for a given individual and the relations observed in the data for other participants, assuming the observed variables are included in the imputation model.[2]
Okay, enough talk, let’s do some coding! To start out, you will need to install impyute library through pip:
pip install impyute
If you are used to the mean imputation you may think that anything else will be far more complex, at least on the implementation side. In this case, you would be terribly wrong.
Due to the simplicity of impyute library, MICE implementation couldn’t be any simpler. You will need to pass pure values in array shape, so Pandas DataFrame won’t cut it by default. Luckily, you can just call .values and you’re set:
This was quick.
A quick call to .describe() on _mice_ages_ will yield the following:

Please note the minimum value. It’s -7. Because this attribute represents age, it doesn’t make sense to have negative values. This can be easily fixed with some basic list comprehensions:
If you now check the statistics of the original Age, and MICE imputed one, you will see that the values are much closer:

And a quick distribution plot confirms those claims. The distribution of MICE imputed values is much closer to the original distribution.

Final Words
You now how a new skill in your inventory. My recommendation is that you try to use it on some datasets you’ve already worked on, but used mean imputation or some other method instead.
If you developed a predictive model there, see how the accuracy compares. It would be great if you could squeeze out some extra accuracy because of this.
Thanks for reading.
Until next time…
Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.