Cleaning Data is a necessary part of almost all data projects. A common theme in data cleansing is dealing with samples that are not complete. There are many options for imputation, and each of them has its benefits and risks. I will go over a few standard options I like to use and then discuss important considerations when choosing an imputation strategy.
Strategy #1 – Drop dirty samples
This is the most straightforward approach and one of the most common I have come across. Drop the rows which contain the missing values. In fact, as part of my initial EDA steps, I often choose to do this to get some baseline plots out about what my data looks like. This, however, sacrifices some of your precious data! In most cases, this is not a great strategy as you are reducing the total amount of information you have for drawing conclusions or training your model.
Strategy #2 – Replace missing values with 0
This is another simple strategy I see people take; however, I argue that it’s barely ever appropriate. Replacing missing values with a constant is a good choice, however, this constant should not always be 0. The issue here is that 0 is an arbitrary choice in most cases and does not reflect the actual situation the dataset is attempting to describe. There is a caveat here though if you have normalised your data to have a mean of 0, then 0 is a great choice. (See the next strategy!)
Strategy #3 – Replace missing values with the mean
This is a much better option than replacing the missing values with an arbitrary value like 0. This is because the samples are all distributed around this value, and there is a high probability that if the missing value were not missing, it would be somewhere around the mean. This is the best of the simple approaches, in my opinion. Also, keep in mind here, there are different possible mean values. Often a good choice for the imputation would be the population mean for this column. However, when dealing with time-series data, you can use the individuals mean for that feature.
Strategy #4 – Replace missing values with linear modelling
This strategy is for time series data. Sometimes when your data has a trend, the mean value for each individual is often not a good fit for imputation. This is because it doesn’t take into account the change of that variable over time! However, fitting a simple linear regression model to your data will impute (and extrapolate) data points for your time series. This, unfortunately, isn’t as useful when it’s clear that your dataset is not describing a linear trend.
Strategy #5 – Replacing values with polynomial modelling
Again, this is an option for time-series datasets. In the previous strategy, we discussed using linear modelling for imputation and extrapolation, and we can extend this concept further by using polynomial models for the same purpose. This allows us to impute values for non-linear trends!
Strategy #6 – Replacing values using clustering
Another valid option when imputing values is to use unsupervised learning techniques like clustering to identify which sub-group of your dataset an individual belongs to make an informed decision about what value to impute. I have used this strategy successfully with K-means clustering, and more recently, with affinity propagation. Affinity Propagation is a clustering technique that identifies a sample representing the centre of each cluster. Using this centre sample gives you some concept of an average value to use for imputation for other members of that cluster.
Strategy #7 – Replacing values with other Machine Learning modelling
I think you can start to see a familiar pattern here. There are a myriad of other valid Machine Learning (ML) models you can use for this purpose. Rather than list them all individually, I’ll throw in this strategy as a bit of a catch-all. Remember, when selecting a machine learning model for this purpose, you have to be careful to understand how that model works and what bias it may introduce to your dataset through imputation.

There are other important considerations when dealing with imputation. A big concern is the sparsity and how much actual missing data are you dealing with. In cases where you have large amounts of missing data, things like mean imputation are often not a good choice. It heavily alters the variance of your dataset and results in distributions that no longer represent the underlying concept the dataset is encoding. I feel like linear/polynomial/ML modelling is a better option in these situations as they impute a range of values. While they will rarely match the actual variance of the underlying dataset they will go a long way to reducing that massive kurtosis you will get from using something like mean imputation on a sparse dataset.
Polynomial modelling has a significant drawback when you are hoping to use it for both imputation and extrapolation. Higher-order polynomials tend to launch towards ±infinity outside of the domain you fitted them to. This means that your extrapolated values have a very high chance of not representing the actual trend. This is something you can, of course, visualise, so don’t rule this option out completely, just keep in mind its important to inspect what’s going on with this model over the entire domain you want to extrapolate for, not just the domain you fitted it to.
Sometimes, if you are cleaning this data with the intention to build a machine learning model, you might not need to clean these values at all! Modern ML models can sometimes treat missing information as a feature in their own right. LightGBM, for instance, uses the presence of missing values internally as part of the information in the model!
Missing values are something that we data scientists will always have to contend with. I hope this article has helped shed some light on the complexity of the issue and given you a place to start in your journey of understanding how to manage them. Good luck out there and happy coding!
Welcome to LightGBM’s documentation! – LightGBM 3.3.1.99 documentation