Data Science. Analytics. Statistics. Python.

As we mentioned in the first article in a series dedicated to missing Data, the knowledge of the mechanism or structure of "missingness" is crucial because our responses would depend on them.
In Handling "Missing Data" Like a Pro – Part 1 – Deletion Methods, we have discussed deletion methods.
For this part of the article, we will be focusing on imputation methods. We will be comparing the effects on the dataset, as well as the advantages and disadvantages of each method.
LOAD THE DATASET AND SIMULATE MISSINGNESS
Load the Adult dataset and simulate an MCAR dataset found in this article.
IMPUTATION METHODS
Now that we have a dataset to practice our imputations, let us begin to discuss what these are.
Below is a summary of the modern-day imputation methods we can employ in our studies:

While we will be discussing the theories and concepts behind, let us employ Scikit-learn to do the dirty work for us.
CONSTANT REPLACEMENT METHODS
Constant imputation is the most popular single imputation method there is in dealing with missing data.
Constant imputation methods impute a constant value in the replacement of missing data in an observation. Simple enough, there are variations of this technique and some ways for data scientists to make this more effective.
MEAN SUBSTITUTION
For mean substitution, missing values are replaced with the arithmetic mean of the feature.
#Import the imputer
from sklearn.impute import SimpleImputer
#Initiate the imputer object
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
#Isolate the columns where we want to import
X = df[['age', 'fnlwgt']]
#Fit to learn the mean
imp.fit(X)
#Impute
imp.transform(X)



Mean imputation preserves the mean of the dataset with missing values, as can be seen in our example above. This, however, is only appropriate if we assume that our data is normally distributed where it is common to assume that most observations are around the mean anyway. It also is substantially helpful, for small missing data cases.
The main disadvantage of mean imputation is the fact that it tends to produce bias estimates for some parameters, particularly for the variance. This fact affects the construction of confidence intervals which is a serious issue for some researchers.
There is one way to remedy this a bit and this goes for all the constant replacement methods: One can impute different means for different subgroups. For example, for ‘age’ value imputation, you can choose to impute the mean age for males for all the observations where the age value is missing and the observation belongs to the male group class.
Note that for variables that are represented as an integer, such as age, you can round up or down after imputation.
MAXIMUM LIKELIHOOD MEAN SUBSTITUTION
The maximum likelihood (ML) method is an amazing technique that has the greatest capability of recovering the true population parameters.
ML methods are highly praised and used because they make use of every one observation of the dataset to estimate the population parameters. So if your dataset is MCAR, it has the greatest probability of convergence.
We will be discussing this in detail and the mathematics behind it again in the "model-based" data augmentation article but for now, let’s calculate the ML mean of our dataset.
from scipy import stats
from scipy.optimize import minimize
def fnlwgt_ML_mean(params):
mean = params[0]
sd = params[1]
# Calculate negative log likelihood
nll = -np.sum(stats.norm.logpdf(df['fnlwgt'].dropna(), loc=mean, scale=sd))
return nll
initParams = [7, 12]
fnlwgt_results = minimize(fnlwgt_ML_mean, initParams, method='Nelder-Mead')
mle_fnlwgt = fnlwgt_results.x[0]
This results in the following estimates of the mean and standard deviation:

If you compare this what we have:

For smaller datasets, as long as we got the correct distributional assumptions, then the ML estimate for mean may actually be better than what we get for ordinary mean estimation.
After getting the estimate, you can therefore substitute it as a constant to the imputer.
A particular disadvantage for the use of Maximum-Likelihood methods is that we need to assume the distribution of the data. Prior knowledge of the distribution or some preliminary EDA may help a bit in this regard. In addition, a separate MLE calculation is done per feature, unlike the mean and median constant replacements.
MEDIAN SUBSTITUTION
For median substitution, instead of the mean, the median is used as a replacement value for missing observations.
#Median Imputer
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X)
Median substitution, while maybe a good choice for skewed datasets, biases both the mean and the variance of the dataset. This will, therefore, need to be factored into the considerations of the researcher.
ZERO IMPUTATION
For some types of studies, it is more natural to impute zero (‘0’) for missing variables. Zero may make sense for variables that are social in nature such as "withdrawal of interest" or for people who failed to show up during exams where naturally they got a score of zero anyway.
Of course, it is only possible for variables where zero is a valid value, so this is not possible for the age variable where participants are not really newborns.
MODE IMPUTATION
From the name itself, mode imputation imputes the "most frequent" value for a particular variable and may be a good choice of method for normally distributed variables.
RANDOM REPLACEMENT METHODS
As opposed to constant value replacement methods, random replacement methods replace missing data with randomly generated values.
There are two general ways of accomplishing this:
- Using empirical data – If you are familiar with bootstrap methods, then you can view this as similar to that. This means that observations used to replace missing values come from the available data within the dataset itself.
- Using statistical distributions – If we know the distribution of a variable, we can draw samples from the theoretical/statistical distributions (a.k.a parametrically). For this, we can substitute our ML estimates for the parameters as they are considered more robust.
Let’s try to discuss some of the empirical random replacement methods.
HOT-DECK METHOD
Hot-deck methods are methods that replace missing values with randomly selected values from the current dataset on hand. This is contrasted with cold-deck methods where you may have a separate dataset to randomly draw values from.
For example, for our adult dataset, if a person has forgotten to report his/her age, this method would pick a random value from those that have reported their age.
random.seed(25)
df4 = df.copy()
#For Weight
df4.loc[:,'fnlwgt'] = [random.choice(df4['fnlwgt'].dropna()) if np.isnan(i) else i for i in df4['fnlwgt']]
df4.loc[fnlwgt_missing,'fnlwgt']


Hot Deck imputations may result in a standard deviation that is higher (or lower) than our full dataset which is, of course, no better than an understated (or overstated) value for confidence interval construction.
As with mean imputation, you can do hot deck imputation using subgroups (e.g imputing a random choice, not from a full dataset, but on a subset of that dataset like male subgroup, 25–64 age subgroup, etc.).
COLD DECK METHODS
It’s possible to draw in a replacement value from a separate dataset that is similar to the one with missing values.
For example, you may want to study about two groups of people where the population is homogenous but you simply happened to divide these into two groups(e.g. a Monday group and Tuesday group). If you have missing values for the Tuesday group, say for age, under the premise that both groups are homogenous and randomly assigned, then it’s possible to fill in the missing for age using a randomly chosen value for age coming from the Monday group.
A cold deck can be implemented using two subgroups of the training datasets, as well, as what we do with validation. Be careful not to use data from your test dataset to avoid data leakage.
MODEL-BASED REPLACEMENT METHODS
Model-based replacement methods are used to generate parameter estimates conditional on the given data that we have, the observed relationship between variables, and constraints imposed by the underlying distributions.
Because we make use of underlying distributions, we refer to these methods as "model-based".
Model-based methods include those of Markov-Chain Monte Carlo (MCMC), Maximum Likelihood, Expectation-Maximization algorithm, and Bayesian Ridge.
Decision Trees and Extra Trees can be used as well though not included in the original methods (those that rely heavily on data distributions). We’ll include these here as they are valid models in Machine Learning anyway.
As these are beautiful, sophisticated techniques, we need to address them in a separate article so we can appreciate them more deeply.
NON-RANDOM REPLACEMENT: ONE CONDITION
GROUP MEAN / GROUP MEDIAN
We have discussed non-random replacement earlier in the constant replacement methods section.
For Group Mean and Group Median, instead of imputing a single value (mean or median) for all missing values, we divide the observations into subgroups and impute the mean/median for the missing values in those subgroups.
Examples of subgroups for sex are that of male and female groups and for the age variable(which as we saw can be positively skewed), we can use a customized age group.
For example, if the final weight value in our example is missing then we can divide the subgroups into say, their work classes, get their corresponding mean/median, and impute it for missing values in the subgroups respectively.
df5 = df.copy()
#Accomplish Using Transform Method
df5["age"] = df5['age'].fillna(df.groupby('workclass')['age'].transform('mean'))
df5["fnlwgt"] = df5['fnlwgt'].fillna(df.groupby('workclass')['fnlwgt'].transform('mean'))

Using the groupby() method, you can create multiple group levels, say after work class, you can further group by educational level.
You can be as creative and exploratory in the group formulation as long as it progresses your research.
LAST OBSERVATION CARRIED FORWARD (LOCF)
For some time-series data, a primary reason for missing data is that of ‘attrition’. For example, suppose you are studying the effect of weight-loss programs for a specific person. If you see continuous improvement until the last observation, then the first missing observation can be assumed to be around the same value as the last one. Attrition here happened because that person has achieved his/her ideal weight.
Assuming that your rows are arranged per year:
#Just assuming the the variable below is a time series data
#df['column'].fillna(method='ffill')
#Another implementation but combined with groupmeans method
#df['age'] = df.groupby(['workclass'])['age'].ffill()
If you apply this method to a non-time series dataset, then this is considered a "hot deck" method as it uses actual observations from your dataset. Caution, however, should be exercised as this may not be entirely appropriate for a lot of cases as it has been proven to bias parameter estimates and increase Type 1 errors.

NEXT OBSERVATION CARRIED BACKWARD (NOCB)
Similar in spirit to LOCF, "Next Observation Carried Forward (NOCB)" carries subsequent values but instead of forward, it does so backward. If you have ever heard of the term "backfill", this is essentially that process.
If you think about it, there are a lot of cases where this is employed. Say for example that you are studying the salary progression of different test subjects. If you know that companies did not give a raise for a particular year (for example, during the COVID-pandemic), you can backfill past years with the current year’s salary.
As with LOCF, this is appropriate for time-series data and suffers the same disadvantages.
#Assuming a time-series variable
df['variable'].fillna(method='backfill')
#Another implementation but combined with groupmeans method
#df['age'] = df.groupby(['workclass'])['age'].ffill()
NON-RANDOM REPLACEMENT: MULTIPLE CONDITIONS
MEAN PREVIOUS/ MEAN SUBSEQUENT OBSERVATIONS
Instead of relying only on one prior or one backward observation, what we can do for a more robust measure, for certain cases, is to average across several observations.
This is certainly preferred for research involving stocks or security prices for example.
Let us say you want to average three (3) periods and carry it forward, the code you should use is:
df6[['age', 'fnlwgt']]= df6[['age', 'fnlwgt']] = df6[['age', 'fnlwgt']].fillna(df6[['age', 'fnlwgt']].rolling(3,min_periods=0).mean())
If instead, we wanted the mean average of three (3) periods for backfill:
df7 = df.copy()
#Rough codes as I can't find a more elegant solution to this
df7[['age', 'fnlwgt']] = df7[['age', 'fnlwgt']].iloc[::-1].rolling(3, min_periods=0).mean().iloc[::-1]
REGRESSION AND REGRESSION WITH ERROR
Regression and Regression with error methods fill in the missing values for variables, by predicting them based on the other variables in the dataset.
In a way, you can think of it as the missing value being the target variable in a linear regression model. Come to think of it, when you employ any supervised learning model, you are trying to predict or find an unobserved outcome. And missing data, are by themselves, unobserved outcomes.
The predicted value can use all the other variables in the dataset or simply just a subset of it.
We can craft a code that will do this from scratch but let us simply use an available package: autoimpute.
After running pip install autoimpute
on your terminal, we can run the following code:
from autoimpute.imputations import SingleImputer, MultipleImputer
df8 = df.copy()
# create an instance of the single imputer and impute the data
# with autoimpute, you can choose a strategy per category or variable
si_dict = SingleImputer(strategy={"age":'least squares', "fnlwgt": 'least squares'})
si_data_full = si_dict.fit_transform(df8[['age', 'fnlwgt']])


In some cases, adding error to the regression prediction allows greater stochasticity which may improve the parameter estimation of the model, especially the variance. Unfortunately, this can’t be accomplished through autoimpute but we can do so if the regression model is made from scratch.
One potential disadvantage of using the same variables for imputations with those that are included in the machine learning model that you are going to create is that it may introduce some bias to the parameter estimations. This means that preferably, use a set of variables that are not included in the machine learning model that you are currently studying, to carry out the regression imputation.
K-NEAREST NEIGHBORS (KNN)
Similar to the regression and regression with error model that we have just discussed, KNN can be used to fill in missing values in a dataset.
The intuition behind this is that a point value can be approximated by the points nearest to that missing point.
We can use the KNNImputer from scikit-learn to accomplish this:
df9 = df.copy()
# importing the KNN from fancyimpute library
from sklearn.impute import KNNImputer
# calling the KNN class
knn_imputer = KNNImputer(n_neighbors=3)
# imputing the missing value with knn imputer
df9[['age', 'fnlwgt']] = knn_imputer.fit_transform(df9[['age', 'fnlwgt']])


As we can see above, where KNN seems to perform a bit better than other imputation methods is for the estimation of the variance.
It is encouraged to try different formulations for the number of neighbors as well to achieve better results than what we have above.
FINAL THOUGHTS
While there is no one way to deal with missing data, this article sheds light on the various classes of techniques and methods one can employ to handle missing data, as well as their weaknesses and professional commentaries.
This field of study is surprisingly and rightfully growing and new methods are being developed to handle missing data. Ultimately, the method chosen should bear in mind the research objective, mechanism of data missingness, and the potential to bias the dataset. Factoring all these out, some data practitioners have concluded that for simple, MCAR missingness, deletion methods may be preferred.
Data Scientists are encouraged to explore one or more or even combine methods to achieve a better model.
While we have tested the effects of the different imputation methods on the parameter estimates, ultimately we want to see how these methods improve machine learning models and their predictive capacities.
In the next article, let’s look at some of the most advanced methods for dealing with missing data: model-based and multiple imputation methods.
Handling "Missing Data" Like a Pro – Part 3: Model-Based & Multiple Imputation Methods
Full code can be found on my Github page.
REFERENCES
McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.