The world’s leading publication for data science, AI, and ML professionals.

How to Handle Missing Values in Python

A gentle introduction to imputation of missing values

GETTING STARTED

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

The biggest challenge for data scientists is probably something that sounds mundane, but very important for any analyses – cleaning dirty data. When you think of dirty data, you are probably thinking about inaccurate or malformed data. But the truth is, Missing Data is actually the most common occurrence of dirty data. Imagine trying to do a customer segmentation analysis, but 50% of the data have no address on record. It would be hard or impossible to do your analysis since the analysis would be bias in showing no customers in certain areas.

Explore Missing Data

  • How much data is missing? You can run a simple exploratory analysis to look at the frequency of your missing data. If it’s a small percentage, let’s say 5% or less, and the data is missing completely at random, you could consider ignore and delete those cases. But keep in mind that it’s always better to analyze all data if possible, and dropping data can introduce biases. Therefore, it’s always better to check the distribution to see where the missing data are coming from.
  • Analyze how the data is missing (MCAR, MAR, MNAR). We will discover different types of missing data in the next section.

Types of missing data

There are three kinds of missing data:

  • data missing completely at random (MCAR)
  • data missing at random (MAR)
  • data missing not at random (MNAR)

In this article, I’ll go over the types of missing data with examples, and share how to handle missing data with imputation.

Data Missing Completely at Random (MCAR)

When we say data are missing at random, we mean that the missingness is nothing to do with the person being studied. For example, a questionnaire might be lost in the post, or a blood sample might be damaged in the lab.

The statistical advantage of data missing completely at random is that the analysis remains unbiased because the absence of data does not bias the estimated parameters. However, it could be hard to testify if the data is indeed MCAR.

Data Missing at Random (MAR)

When we say data are missing at random, we assume __ the propensity for a data point to be missing is not related to the missing data itself, but it is related to some of the observed data. In other words, the probability of missing value depends on the characteristics of observable data. For example, if your crush rejected you, you can probably know why, by looking at other variables like "she’s already committed," or "her life goals are different from yours," etc.

Missingness at random is relatively easy to handle – simply include as regression inputs all variables that affect the probability of missingness. Unfortunately, we generally cannot be sure whether the data really are missing at random, or whether the missingness depends on unobserved predictors or the missing data themselves.

Data Missing Not at Random (MNAR)

When we say data are missing not at random, we assume that there is a pattern in the missing data that affect your primary dependent variable. For example, the people with the lowest education are missing on education, the sickest people are most likely to drop out of the study, or drug addicts leaving drug usage fields blank in a survey. This also means these missing values aren’t blank out of randomness but are left null on purpose.

If the data are not missing at random, it will not make sense if we impute the missing data since it could bias your analysis or model’s prediction.

Diagnosing the Mechanism

  1. MAR vs. MNAR: The only true way to distinguish between MNAR and MAR is to measure some of that missing data. We can compare respondents’ answers to non-respondents answers. If there’s a big difference, that’s good evidence that the data are MNAR. Besides, the more sensitive the survey question is, the less likely people will answer it.
  2. MCAR vs. MAR: A technique is to create dummy variables for whether a variable is missing. Then, run t-tests and chi-square tests between this variable and other variables in the dataset to see if the missingness on this variable is related to the values of other variables. For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men. Then we can conclude that weight is MAR.

Methods of Data Imputation

Simple Imputation

Replacing the missing values with the mean, median, or mode in a column is a very basic imputation method. This method is the fastest but doesn’t work well on encoded categorical features. In addition, it doesn’t take correlations between features into account. If we have to use this simple imputation method, consider using the median value instead of the mean value, since the mean value introduces outlier or noise to the analysis.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='median')
transformed_values = imputer.fit_transform(data.values)
# notice that imputation strategy can be changed to "mean", "most_frequent", 'constant'
# read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

KNN Imputation

KNNImputer helps to impute missing values present in the observations by finding the k nearest neighbors with the Euclidean distance matrix. You can check out how KNNImputer works under the hood here. This method is more accurate than the simple imputation; however, it can be computationally expensive and sensitive to outliers.

import numpy as np
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2) #define the k nearest neighbors   
imputer.fit_transform(data)
# read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html

Multivariate Imputation

Multivariate imputation solves the increased noise problem by factoring in other variables in the data. The basic multiple imputation by chained equations (MICE) assumes that the data are missing at random. We can make an educated guess about its true value by looking at other data samples.

Here are the three main steps:

  1. Create m sets of imputations for the missing values using an imputation process with a random component.
  2. Analyze each completed data set. Each set of parameter estimates will differ slightly because the data differs slightly.
  3. Integrate the m analysis results into a final result.

If you are interested to see how it works, check out this video here. Notice that it is still an open problem regarding how useful univariate vs. multivariate imputation is in prediction and classification when we are not interested in measuring uncertainty due to missing values.

Image by stefvanbuuren
Image by stefvanbuuren
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit(X_train)
imp.transform(X_test)
# read the documentation here: https://scikit-learn.org/stable/modules/impute.html

Conclusion

Missing data cannot be ignored in any analysis. As a Data Scientist or Data Analyst, we can’t just simply drop the missing values. We need to understand how the data is missing and handle the NaN values accordingly.

Reference

  1. https://www.statisticssolutions.com/missing-values-in-data/
  2. https://measuringu.com/handle-missing-data/
  3. https://www.theanalysisfactor.com/missing-data-mechanism/
  4. https://www.analyticsvidhya.com/blog/2020/07/knnimputer-a-robust-way-to-impute-missing-values-using-scikit-learn/
  5. https://scikit-learn.org/stable/modules/impute.html
  6. https://www4.stat.ncsu.edu/~post/suchit/bayesian-methods-incomplete.pdf

If you find this helpful, please follow me and check out my other blogs ❤️

Until next time, happy learning! 👩🏻 ‍💻

Understanding and Choosing the Right Probability Distributions with Examples

How to Prepare for Business Case Interview as an Analyst?

Building a Product Recommendation System for E-Commerce: Part I – Web Scraping

How to Convert Jupyter Notebooks into PDF


Related Articles