Missing values are a dilemma in many datasets, but many users are eager to allocate less time to deal with them so that they can focus more on modeling.
I write this after recently running into many portfolio projects that treat missing data as an afterthought. There is a lot of consideration put into the best algorithms and hyperparameters, whereas little justification is made for the imputation method of choice.
I get it, building models is fun, but the success of a machine-learning project hinges on how the data is processed prior to any modeling.
Here, we delve into the important step many tend to skip when handling missing data and show how users can carry it out in their own projects.
The Most Important Step
So what step do you have to carry out to ensure that you’re properly dealing with missing data?
The answer is simple: Exploratory Data Analysis (EDA)!
Unfortunately, it is a common mistake for users to skip EDA when addressing the missing data in their dataset. They instead choose to rely on approaches that they are most familiar/comfortable with.
In general, executing techniques blindly without considering the "why" is a recipe for disaster.
After all, the nature of missing data varies from case to case. When it comes to handling missing values, what works in one use dataset may not work in another.
Users who skip EDA have a higher likelihood of using techniques based on false assumptions, which will hamper data quality. Poorer data quality will naturally have a negative influence on any machine learning model built in the subsequent modeling phase (garbage in, garbage out).
Using EDA For Missing Data
EDA is an effective way to grasp the key characteristics of the dataset and its missing values. By arming yourself with more information on the missing data, you will be able to make more informed decisions with regard to how these missing values should be dealt with.
The next question is: what exactly should users look for when examining their data with EDA?
Ideally, users handling missing data should answer the following:
- Which features contain missing values?
- What proportion of records for each feature comprises missing data?
- Is the missing data missing at random (MAR) or missing not at random (MNAR)?
- Are the features with missing values correlated with other features?
Methods For Dealing With Missing Data
To effectively address the dilemma of missing values, users will benefit from getting familiar with the various techniques available to them. Knowing several methods alongside their pros and cons will improve the chances of selecting the best one.
Now, we give a brief overview of a few basic and advanced methods of handling missing data, also considering their benefits and limitations.
Furthermore, we will demonstrate how these methods can be utilized in Python using the following mock dataset.
Basic Methods
- Omitting records with missing data
Have data points with missing values?
Just remove them from the dataset! Problem solved!
The advantage of this method is that it is the simplest way to deal with missing values. In terms of code, you can execute this operation with a one-liner using the Pandas library.
However, this method would yield a much smaller dataset, which gives models less information to be trained with. Also, if the missing values are missing not at random (MNAR), it means that removing these data points would inevitably lead to bias in the model.
This is rarely the go-to option for handling missing data. For those considering this approach, at least ensure that there is sufficient data and that the missing values are missing at random.
2. Omitting features with missing data
Another option would be to just remove the features instead of the data points.
Once again, this entails removing data, so users should first verify the amount of missing data in the features before considering this method.
If a feature has too much missing data, it will most likely not contribute to the model. For such cases, it would be best to omit the feature from consideration altogether.
3. Imputing missing data with a statistic
Instead of removing records or features with missing values, users can rely on simple imputation methods to replace missing data with derived values. For numeric features, missing data can be replaced by the mean or median of the values in the distribution. For categorical features, missing data can be replaced by the mode.
Imputations in Python can be carried out with the Scikit Learn package’s SimpleImputer class.
While simple imputation sounds like a faultless method, it disregards the correlation that the feature of interest may have with any other features, which may not be fitting for the data of interest. As a result, replacing missing values with basic methods may yield a distribution of values that doesn’t adequately represent the data points.
Advanced Methods For Handling Missing Data
So far, we have explored some of the simpler ways to address missing values. However, these methods are based on assumptions that may not apply in real-life scenarios.
For cases where the simpler methods won’t cut it, it is worth considering the more advanced methods that consider multiple variables when imputing missing data.
- Imputing with K-nearest neighbors
You’re probably familiar with the K-nearest neighbors algorithm for its applications in classification, but did you know that you can also utilize the K-nearest neighbors algorithm to impute missing values?
This process entails plotting the features in the feature space to find the "nearest neighbors" of the records with the missing values. The mean of the values in the nearest neighbors is then used to impute the missing values.
You can execute this technique using the Scikit-learn module’s KNNImputer class.
The main downsides of the KNN algorithm are that it is computationally intensive and prone to outliers.
2. Multivariate Imputation with Chained Equation (MICE)
MICE is a unique imputation method that doesn’t get enough attention. It arguably deserves an article on its own.
In short, the method entails creating multiple imputations for each missing value as opposed to just one. The algorithm addresses statistical uncertainty and enables users to impute values for data of different types.
If you wish to understand the ins and outs of this algorithm, I found a pretty good paper detailing it here.
The MICE algorithm can be implemented in Python using the Scikit Learn package’s IterativeImputer class. Note that this class is still experimental, so you also need to import enable_iterative_imputer
to use it.
The MICE algorithm is robust, but it runs on the assumption that the missing values in the data are missing at random (MAR).
Finding the best method
As there isn’t a one-size-fits-all method that can be used to handle missing values, the ideal approach depends on the data in question.
Thus, any decision regarding the handling of missing data should always be based on an informed decision, which is why it is so important to carry out EDA prior to any handling of missing data.
By identifying the characteristics of the dataset of interest, you can identify methods that are most fitting or at the very least, remove some methods from consideration.
Here are some examples of what you could detect from your EDA and how you can approach handling missing data based on your findings:
- If a dataset’s feature has missing data in more than 80% of its records, it is probably best to remove that feature altogether.
- If a feature with missing values is strongly correlated with other missing values, it’s worth considering using advanced imputation techniques that use information from those other features to derive values to replace the missing data.
- If a feature’s values are missing not at random (MNAR), remove methods like MICE from consideration.
Key Takeaways
The main takeaway of this article should be to always allocate time for performing EDA to get a better understanding of your data. The results of your EDA will give insight into the techniques that are most compatible with the given use case.
Furthermore, it’s worth arming yourself with knowledge of multiple methods of handling missing data to ensure that your findings from the EDA will lead you to choose the best approach.
It may seem like a hassle to spend extra time looking at your data when you can jump straight to the modeling phase (I get it, EDA can be boring), but taking extra time to determine the best techniques to use will pay off when the resulting model performs at a satisfactory level.
I wish you the best of luck in your Data Science endeavors!