I’m sure that every Data Scientist/ ML Practitioner has faced the challenge of missing values in their dataset. It is a common data cleaning process, but frankly, a very overlooked and neglected one. However, an effective missing value strategy can have a significant impact on your model’s performance.
Reasons for the occurrence of missing values
The reason as to why missing values occur is often specific to the problem domain. However, most of the time they occur from the following scenarios:
- Code Bug: The data collection method encountered a bug and some value were not properly obtained(for example, if you were to collect data via a REST API and you possibly didn’t parse the response properly, then the value would be missing.)
- Unavailability: Not enough data available for a given observation(for example, if there was a feature named "College", but the observation(person/athlete etc) did not attend a College, then obviously the value would be empty.)
- Deliberate NaN imputation: this can occur in coding competitions such as Kaggle, where part of the challenge is to deal with missing values.

The reason you should deal with missing values is because many ML algorithms require numeric input values, and can’t operate with missing values, therefore if you try run the algorithm with missing values, it will respond with an error(scikit-learn). However, some algorithms, such as XGBoost, will impute values based on training loss reduction.
Approaches to dealing with missing values:
There are two approaches one can take to deal with the missing values:
- Imputation: The data is filled in using some method
- Removal: the rows with missing values are deleted
While there is no real "best" option, it is usually better to impute data rather than remove it, as in not doing so might result in a lot of information loss, as well as a model that is more likely to underfit the data.
However, be mindful as sometimes removing the rows can be the better option, as it creates a more robust model, and can speed up training.
Ok, I know you’re probably dying to get to the Deep Learning method(I know, amazing!). So, without further ado, let’s get technical!
Strategy 1: KNNImputer

This method essentially used KNN, a Machine Learning algorithm, to impute the missing values, with each value being the mean of the n_neighbors samples found in proximity to a sample.
If you don’t know how KNN works, you can check out my article on it, where I break it down from first principles. Bu essentially, the KNNImputer will do the following:
- Measures the distance between the new sample and the N closest samples(as specified by the n_neighbours parameter)
- Based on its closest neighbour(s), it will take the mean value of the N closest non-null neighbors to the missing value.
KNNImputer in Action
Let’s see a simple example of KNNImputer being used:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
We will use the famous Titanic Dataset as our example dataset.
Next, we check which features have missing values:
df.isnull().sum()
OUT:
passengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Using this method, we can see what values need to be imputed.
df = df.drop(['PassengerId','Name'],axis=1)
df = df[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]
df["Sex"] = [1 if x=="male" else 0 for x in df["Sex"]]
Here, we drop some unneeded features and quickly One-hot-encode our Sex feature.
NOTE: usually one would do some feature engineering and transformation, but this is not the purpose of the article, therefore I am skipping this part. However, in a normal project, you should always examine and clean your data properly.
Next, we instantiate our KNNImputer, and giving it a n_neighbours value of 5.
imputer = KNNImputer(n_neighbors=5)
imputer.fit(df)
Now all that’s left to do is transform the data so that the values are imputed:
imputer.transform(df)
And there you have it; KNNImputer. Once again, scikit-learn makes this process very simple and intuitive, but I recommend looking at the code of this algorithm on Github to get a better sense of what the KNNImputer really does.
Advantages of the KNNImputer:
- Can be much more accurate than the mean, median or the mode(It depends on the dataset).
Disadvantages of the KNNImputer:
- Computationally expensive, as it stores in the entire dataset in memory.
- Is quite sensitive to outliers, so imputed values may cause the model to not perform as well as possible.
- You have to specify the number of neighbors
Strategy 2: Multiple Imputation by Chained Equations(MICE)

This is a very powerful algorithm that basically works by selecting one feature with missing values as the target variable and utilises a regression model to impute the missing values based on all the other variables in the dataset.
It then repeats this process in a round-robin fashion, meaning that each feature with missing values will be regressed against all the other features.
A little confusing? Yep. That’s why its … Analogy time!
Understanding MICE with the use of an analogy

Let’s suppose that you have dataset with the following features:
- Age
- Gender
- BMI
- Income
And each feature, except Gender, has missing values. In this scenario, The MICE algorithm would do the following:
- For each feature with missing values(Age,BMI,Income), you fill in those values with some temporary "place holder". This is usually the mean of all the values in the feature, so, in this case, we would fill in the missing Age with mean Age of the data, the missing BMI with the mean BMI, etc.
- Set back to missing one feature that you will like to impute. So, if we were to choose to impute Age, then age would be the one feature with missing values, as we imputed the other features in the previous step.
- Regress age on all(or some) of the features. To make this step work, drop all NaN values that age may contain. Essentially, we are fitting Linear Regression, with Age being the target feature, and the other features being the independent features.
- Use the previously fitted regression model to predict the missing values for Age.(Important Note: when age will later on be used as an independent variable to predict missing values of other features, both the observed and predicted values will be used). A random component is also added to this prediction.
- Repeat Steps 2–4 for all features that have missing data(in this case, BMI & Income)
When the MICE algorithm has completed all the steps from 1–5, this is known as a cycle. Usually, MICE requires the user through the data around 5–10 times.
Once we get to the final cycle, we are outputted with an imputed dataset. Usually, we want to run this algorithm on 5 copies of the dataset, and then pool our **** results together. After we have done this, we can then analyse the results and report the combined results.
MICE in Action
While the steps may seem long and difficult, thanks to our friend scikit-learn, implementing mice is as easy as pie:
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df = df.drop(['PassengerId','Name'],axis=1)
df = df[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]
df["Sex"] = [1 if x=="male" else 0 for x in df["Sex"]]
Note: make sure to import the _enable_iterativeimputer function before you import the IterativeImputer, as the feature is classified as experimental, and failing to do so will result in an ImportError.
Now let’s call instantiate the IterativeImputer class:
imputer = IterativeImputer(imputation_order='ascending',max_iter=10,random_state=42,n_nearest_features=None)
A few things to note here:
max_iter: the number of cycles to do of the imputation before returning the imputed dataset
imputation_order: The order in which the features will be imputed.
Possible values:
"ascending" From features with fewest missing values to most. "descending" From features with most missing values to fewest. "roman" Left to right. "arabic" Right to left. "random" A random order for each round.
n_nearest_features: Number of other features to use to estimate the missing values of each feature column. The nearness of a feature is determined by the absolute correlation coefficient
random_state: the random seed to be set to make your results reproducible
Next, we just fit it to the data, and transform it!
imputed_dataset = imputer.fit_transform(df)
Advantages of the MICE:
- Very flexible, can handle both binary and continuous values
- Usually more accurate than simple mean/mode/median imputations.
Disadvantages of the MICE:
- Computationally expensive, as it stores and loops through the dataset for N cycles.
- Can be slow for large datasets
Strategy 3: Imputation with Deep Learning

This is probably the most powerful and accurate imputation method available. It essentially utilises MXnet’s Deep Neural Networks to predict missing values. It is very flexible, as it supports both categorical and continuous variables, and can be run on a CPU, as well as a GPU.
Deep Learning Imputation in Action
The library we are using is called datawig. To install it. simply run:
pip3 install datawig
In your terminal/command line.
import pandas as pd
import numpy as np
import datawig
df = df.drop(['PassengerId','Name'],axis=1)
df = df[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]
df["Sex"] = [1 if x=="male" else 0 for x in df["Sex"]]
Now, we will instantiate datawig’s SimpleImputer class:
imputer = datawig.SimpleImputer(
input_columns=['Pclass','SibSp','Parch'],
output_column= 'Age',
output_path = 'imputer_model'
)
Some parameters to note here:
- _inputcolumns: the column(s) containing information about the column we want to impute
- _outputcolumns: the column we’d like to impute values for
- _outputmodel: stores model data and metrics
Next we fit the imputer to our data, impute missing values and return the imputed DataFrame:
# Fit an imputer model on the train data
# num_epochs: defines how many times to loop through the network
imputer.fit(train_df=df, num_epochs=50)
# Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df)
And that is how you impute with Deep Learning.
Advantages of Deep Learning Imputation
- Pretty accurate compared to other methods.
- It has functions that can handle categorical data (Feature Encoder).
- It supports CPUs and GPUs.
Disadvantages of Deep Learning Imputation
- Can be slow on large datasets
- Only one feature at a time can be imputed
- You must specify input and output columns
Conclusion

To wrap up, we must admit that there is no guaranteed strategy that will always work for every dataset. Some methods may work magnificently for one dataset, but poorly for others.
Therefore, it is always good to examine your data and see how you should go about imputing values, and what strategy to use. Make sure to use cross validation on the different imputation methods to see which one works best for a given dataset.
I hope you enjoyed this article, and many thanks to all my followers, who keep me constantly motivated and hungry to learn! I hope you learned something new today, and stay tuned as I have much more content yet to come!
