The world’s leading publication for data science, AI, and ML professionals.

Handling “Missing Data” Like a Pro – Part 2- Imputation Methods

Basic and Advanced Techniques for the 21st Century Data Scientist

Data Science. Analytics. Statistics. Python.

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

As we mentioned in the first article in a series dedicated to missing Data, the knowledge of the mechanism or structure of "missingness" is crucial because our responses would depend on them.

In Handling "Missing Data" Like a Pro – Part 1 – Deletion Methods, we have discussed deletion methods.

For this part of the article, we will be focusing on imputation methods. We will be comparing the effects on the dataset, as well as the advantages and disadvantages of each method.

LOAD THE DATASET AND SIMULATE MISSINGNESS

Load the Adult dataset and simulate an MCAR dataset found in this article.

IMPUTATION METHODS

Now that we have a dataset to practice our imputations, let us begin to discuss what these are.

Below is a summary of the modern-day imputation methods we can employ in our studies:

Technique Classes and Methods for Modern-Day Missing Value Imputations
Technique Classes and Methods for Modern-Day Missing Value Imputations

While we will be discussing the theories and concepts behind, let us employ Scikit-learn to do the dirty work for us.

CONSTANT REPLACEMENT METHODS

Constant imputation is the most popular single imputation method there is in dealing with missing data.

Constant imputation methods impute a constant value in the replacement of missing data in an observation. Simple enough, there are variations of this technique and some ways for data scientists to make this more effective.

MEAN SUBSTITUTION

For mean substitution, missing values are replaced with the arithmetic mean of the feature.

#Import the imputer
from sklearn.impute import SimpleImputer
#Initiate the imputer object
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
#Isolate the columns where we want to import
X = df[['age', 'fnlwgt']]
#Fit to learn the mean
imp.fit(X)
#Impute
imp.transform(X)
Mean Imputation Comparison Across Vectors
Mean Imputation Comparison Across Vectors
Mean imputation preserves the mean of the dataset with missing values. However, a big disadvantage of this one is the lowering of variance.
Mean imputation preserves the mean of the dataset with missing values. However, a big disadvantage of this one is the lowering of variance.
The variance of the imputed dataset has a significantly lower variance.
The variance of the imputed dataset has a significantly lower variance.

Mean imputation preserves the mean of the dataset with missing values, as can be seen in our example above. This, however, is only appropriate if we assume that our data is normally distributed where it is common to assume that most observations are around the mean anyway. It also is substantially helpful, for small missing data cases.

The main disadvantage of mean imputation is the fact that it tends to produce bias estimates for some parameters, particularly for the variance. This fact affects the construction of confidence intervals which is a serious issue for some researchers.

There is one way to remedy this a bit and this goes for all the constant replacement methods: One can impute different means for different subgroups. For example, for ‘age’ value imputation, you can choose to impute the mean age for males for all the observations where the age value is missing and the observation belongs to the male group class.

Note that for variables that are represented as an integer, such as age, you can round up or down after imputation.

MAXIMUM LIKELIHOOD MEAN SUBSTITUTION

The maximum likelihood (ML) method is an amazing technique that has the greatest capability of recovering the true population parameters.

ML methods are highly praised and used because they make use of every one observation of the dataset to estimate the population parameters. So if your dataset is MCAR, it has the greatest probability of convergence.

We will be discussing this in detail and the mathematics behind it again in the "model-based" data augmentation article but for now, let’s calculate the ML mean of our dataset.

from scipy import stats
from scipy.optimize import minimize
def fnlwgt_ML_mean(params):
    mean = params[0]   
    sd = params[1]
# Calculate negative log likelihood
    nll = -np.sum(stats.norm.logpdf(df['fnlwgt'].dropna(), loc=mean, scale=sd))
return nll
initParams = [7, 12]
fnlwgt_results = minimize(fnlwgt_ML_mean, initParams, method='Nelder-Mead')
mle_fnlwgt = fnlwgt_results.x[0]

This results in the following estimates of the mean and standard deviation:

Look at the estimates; they are close estimates of our full dataset.
Look at the estimates; they are close estimates of our full dataset.

If you compare this what we have:

The estimates can be compared to that of the MCAR dataset. The difference is due to our oversimplifying distributional assumption (recall that the age variable is right-skewed).
The estimates can be compared to that of the MCAR dataset. The difference is due to our oversimplifying distributional assumption (recall that the age variable is right-skewed).

For smaller datasets, as long as we got the correct distributional assumptions, then the ML estimate for mean may actually be better than what we get for ordinary mean estimation.

After getting the estimate, you can therefore substitute it as a constant to the imputer.

A particular disadvantage for the use of Maximum-Likelihood methods is that we need to assume the distribution of the data. Prior knowledge of the distribution or some preliminary EDA may help a bit in this regard. In addition, a separate MLE calculation is done per feature, unlike the mean and median constant replacements.

MEDIAN SUBSTITUTION

For median substitution, instead of the mean, the median is used as a replacement value for missing observations.

#Median Imputer
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X)

Median substitution, while maybe a good choice for skewed datasets, biases both the mean and the variance of the dataset. This will, therefore, need to be factored into the considerations of the researcher.

ZERO IMPUTATION

For some types of studies, it is more natural to impute zero (‘0’) for missing variables. Zero may make sense for variables that are social in nature such as "withdrawal of interest" or for people who failed to show up during exams where naturally they got a score of zero anyway.

Of course, it is only possible for variables where zero is a valid value, so this is not possible for the age variable where participants are not really newborns.

MODE IMPUTATION

From the name itself, mode imputation imputes the "most frequent" value for a particular variable and may be a good choice of method for normally distributed variables.

RANDOM REPLACEMENT METHODS

As opposed to constant value replacement methods, random replacement methods replace missing data with randomly generated values.

There are two general ways of accomplishing this:

  1. Using empirical data – If you are familiar with bootstrap methods, then you can view this as similar to that. This means that observations used to replace missing values come from the available data within the dataset itself.
  2. Using statistical distributions – If we know the distribution of a variable, we can draw samples from the theoretical/statistical distributions (a.k.a parametrically). For this, we can substitute our ML estimates for the parameters as they are considered more robust.

Let’s try to discuss some of the empirical random replacement methods.

HOT-DECK METHOD

Hot-deck methods are methods that replace missing values with randomly selected values from the current dataset on hand. This is contrasted with cold-deck methods where you may have a separate dataset to randomly draw values from.

For example, for our adult dataset, if a person has forgotten to report his/her age, this method would pick a random value from those that have reported their age.

random.seed(25)
df4 = df.copy()
#For Weight
df4.loc[:,'fnlwgt'] = [random.choice(df4['fnlwgt'].dropna()) if np.isnan(i) else i for i in df4['fnlwgt']]
df4.loc[fnlwgt_missing,'fnlwgt']
The algorithm imputed the missing values using a random choice from available data for that variable.
The algorithm imputed the missing values using a random choice from available data for that variable.
Hot Deck Imputation may result in a standard deviation that is higher than the original.
Hot Deck Imputation may result in a standard deviation that is higher than the original.

Hot Deck imputations may result in a standard deviation that is higher (or lower) than our full dataset which is, of course, no better than an understated (or overstated) value for confidence interval construction.

As with mean imputation, you can do hot deck imputation using subgroups (e.g imputing a random choice, not from a full dataset, but on a subset of that dataset like male subgroup, 25–64 age subgroup, etc.).

COLD DECK METHODS

It’s possible to draw in a replacement value from a separate dataset that is similar to the one with missing values.

For example, you may want to study about two groups of people where the population is homogenous but you simply happened to divide these into two groups(e.g. a Monday group and Tuesday group). If you have missing values for the Tuesday group, say for age, under the premise that both groups are homogenous and randomly assigned, then it’s possible to fill in the missing for age using a randomly chosen value for age coming from the Monday group.

A cold deck can be implemented using two subgroups of the training datasets, as well, as what we do with validation. Be careful not to use data from your test dataset to avoid data leakage.

MODEL-BASED REPLACEMENT METHODS

Model-based replacement methods are used to generate parameter estimates conditional on the given data that we have, the observed relationship between variables, and constraints imposed by the underlying distributions.

Because we make use of underlying distributions, we refer to these methods as "model-based".

Model-based methods include those of Markov-Chain Monte Carlo (MCMC), Maximum Likelihood, Expectation-Maximization algorithm, and Bayesian Ridge.

Decision Trees and Extra Trees can be used as well though not included in the original methods (those that rely heavily on data distributions). We’ll include these here as they are valid models in Machine Learning anyway.

As these are beautiful, sophisticated techniques, we need to address them in a separate article so we can appreciate them more deeply.

NON-RANDOM REPLACEMENT: ONE CONDITION

GROUP MEAN / GROUP MEDIAN

We have discussed non-random replacement earlier in the constant replacement methods section.

For Group Mean and Group Median, instead of imputing a single value (mean or median) for all missing values, we divide the observations into subgroups and impute the mean/median for the missing values in those subgroups.

Examples of subgroups for sex are that of male and female groups and for the age variable(which as we saw can be positively skewed), we can use a customized age group.

For example, if the final weight value in our example is missing then we can divide the subgroups into say, their work classes, get their corresponding mean/median, and impute it for missing values in the subgroups respectively.

df5 = df.copy()
#Accomplish Using Transform Method
df5["age"] = df5['age'].fillna(df.groupby('workclass')['age'].transform('mean'))
df5["fnlwgt"] = df5['fnlwgt'].fillna(df.groupby('workclass')['fnlwgt'].transform('mean'))
Using the "work class" variable you can get the mean per work class and substitute the mean "age" or "fnlwgt" to those with missing values for those classes.
Using the "work class" variable you can get the mean per work class and substitute the mean "age" or "fnlwgt" to those with missing values for those classes.

Using the groupby() method, you can create multiple group levels, say after work class, you can further group by educational level.

You can be as creative and exploratory in the group formulation as long as it progresses your research.

LAST OBSERVATION CARRIED FORWARD (LOCF)

For some time-series data, a primary reason for missing data is that of ‘attrition’. For example, suppose you are studying the effect of weight-loss programs for a specific person. If you see continuous improvement until the last observation, then the first missing observation can be assumed to be around the same value as the last one. Attrition here happened because that person has achieved his/her ideal weight.

Assuming that your rows are arranged per year:

#Just assuming the the variable below is a time series data
#df['column'].fillna(method='ffill')
#Another implementation but combined with groupmeans method
#df['age'] = df.groupby(['workclass'])['age'].ffill()

If you apply this method to a non-time series dataset, then this is considered a "hot deck" method as it uses actual observations from your dataset. Caution, however, should be exercised as this may not be entirely appropriate for a lot of cases as it has been proven to bias parameter estimates and increase Type 1 errors.

Mean and Standard Deviation Calculations for Group Mean Imputations.
Mean and Standard Deviation Calculations for Group Mean Imputations.

NEXT OBSERVATION CARRIED BACKWARD (NOCB)

Similar in spirit to LOCF, "Next Observation Carried Forward (NOCB)" carries subsequent values but instead of forward, it does so backward. If you have ever heard of the term "backfill", this is essentially that process.

If you think about it, there are a lot of cases where this is employed. Say for example that you are studying the salary progression of different test subjects. If you know that companies did not give a raise for a particular year (for example, during the COVID-pandemic), you can backfill past years with the current year’s salary.

As with LOCF, this is appropriate for time-series data and suffers the same disadvantages.

#Assuming a time-series variable
df['variable'].fillna(method='backfill')
#Another implementation but combined with groupmeans method
#df['age'] = df.groupby(['workclass'])['age'].ffill()

NON-RANDOM REPLACEMENT: MULTIPLE CONDITIONS

MEAN PREVIOUS/ MEAN SUBSEQUENT OBSERVATIONS

Instead of relying only on one prior or one backward observation, what we can do for a more robust measure, for certain cases, is to average across several observations.

This is certainly preferred for research involving stocks or security prices for example.

Let us say you want to average three (3) periods and carry it forward, the code you should use is:

df6[['age', 'fnlwgt']]= df6[['age', 'fnlwgt']] = df6[['age', 'fnlwgt']].fillna(df6[['age', 'fnlwgt']].rolling(3,min_periods=0).mean())

If instead, we wanted the mean average of three (3) periods for backfill:

df7 = df.copy()
#Rough codes as I can't find a more elegant solution to this
df7[['age', 'fnlwgt']] = df7[['age', 'fnlwgt']].iloc[::-1].rolling(3, min_periods=0).mean().iloc[::-1]

REGRESSION AND REGRESSION WITH ERROR

Regression and Regression with error methods fill in the missing values for variables, by predicting them based on the other variables in the dataset.

In a way, you can think of it as the missing value being the target variable in a linear regression model. Come to think of it, when you employ any supervised learning model, you are trying to predict or find an unobserved outcome. And missing data, are by themselves, unobserved outcomes.

The predicted value can use all the other variables in the dataset or simply just a subset of it.

We can craft a code that will do this from scratch but let us simply use an available package: autoimpute.

After running pip install autoimpute on your terminal, we can run the following code:

from autoimpute.imputations import SingleImputer, MultipleImputer
df8 = df.copy()
# create an instance of the single imputer and impute the data
# with autoimpute, you can choose a strategy per category or variable 
si_dict = SingleImputer(strategy={"age":'least squares', "fnlwgt": 'least squares'})
si_data_full = si_dict.fit_transform(df8[['age', 'fnlwgt']])
The dataset with imputed values from Least Squares Regression model.
The dataset with imputed values from Least Squares Regression model.
While the technique is fancy, it seems comparable with the other methods in terms of parameter estimates. Of course, the dataset may differ from actual machine learning training and this is something we need to test for ourselves.
While the technique is fancy, it seems comparable with the other methods in terms of parameter estimates. Of course, the dataset may differ from actual machine learning training and this is something we need to test for ourselves.

In some cases, adding error to the regression prediction allows greater stochasticity which may improve the parameter estimation of the model, especially the variance. Unfortunately, this can’t be accomplished through autoimpute but we can do so if the regression model is made from scratch.

One potential disadvantage of using the same variables for imputations with those that are included in the machine learning model that you are going to create is that it may introduce some bias to the parameter estimations. This means that preferably, use a set of variables that are not included in the machine learning model that you are currently studying, to carry out the regression imputation.

K-NEAREST NEIGHBORS (KNN)

Similar to the regression and regression with error model that we have just discussed, KNN can be used to fill in missing values in a dataset.

The intuition behind this is that a point value can be approximated by the points nearest to that missing point.

We can use the KNNImputer from scikit-learn to accomplish this:

df9 = df.copy()
# importing the KNN from fancyimpute library
from sklearn.impute import KNNImputer
# calling the KNN class
knn_imputer = KNNImputer(n_neighbors=3)
# imputing the missing value with knn imputer
df9[['age', 'fnlwgt']] = knn_imputer.fit_transform(df9[['age', 'fnlwgt']])
Comparison of the KNN imputations against other imputation methods.
Comparison of the KNN imputations against other imputation methods.
Mean and Variance Estimates Including KNN Imputations
Mean and Variance Estimates Including KNN Imputations

As we can see above, where KNN seems to perform a bit better than other imputation methods is for the estimation of the variance.

It is encouraged to try different formulations for the number of neighbors as well to achieve better results than what we have above.

FINAL THOUGHTS

While there is no one way to deal with missing data, this article sheds light on the various classes of techniques and methods one can employ to handle missing data, as well as their weaknesses and professional commentaries.

This field of study is surprisingly and rightfully growing and new methods are being developed to handle missing data. Ultimately, the method chosen should bear in mind the research objective, mechanism of data missingness, and the potential to bias the dataset. Factoring all these out, some data practitioners have concluded that for simple, MCAR missingness, deletion methods may be preferred.

Data Scientists are encouraged to explore one or more or even combine methods to achieve a better model.

While we have tested the effects of the different imputation methods on the parameter estimates, ultimately we want to see how these methods improve machine learning models and their predictive capacities.

In the next article, let’s look at some of the most advanced methods for dealing with missing data: model-based and multiple imputation methods.

Handling "Missing Data" Like a Pro – Part 3: Model-Based & Multiple Imputation Methods

Full code can be found on my Github page.

REFERENCES

McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.


Related Articles