The world’s leading publication for data science, AI, and ML professionals.

Handling “Missing Data” Like a Pro – Part 3: Model-Based & Multiple Imputation Methods

Basic and Advanced Techniques for the 21st-Century Data Scientist

DATA SCIENCE. ANALYTICS. PYTHON

Photo by Matt Walsh on Unsplash
Photo by Matt Walsh on Unsplash

As we mentioned in the first article in a series dedicated to the study of missing data, the knowledge of the mechanism or structure of "missingness" is crucial because our handling method would primarily depend on it.

In Handling "Missing Data" Like a Pro – Part 1 – Deletion Methods, we have discussed deletion methods.

In Handling "Missing Data" Like a Pro – Part 2: Imputation Methods, we discussed simple imputation methods. While some imputation methods are deemed appropriate for a specific type of data, e.g. normally distributed data, MCAR missingness, etc., these methods are criticized mostly for biasing our estimates and models. Some, therefore, believed that deletion methods are safer in some circumstances.

Fortunately for us, newer categories of imputation methods address these weaknesses of the simple imputation and the deletion methods.

These are model-based and multiple imputation methods.

LOAD THE DATASET AND SIMULATE MISSINGNESS

Load the Adult dataset and simulate an MCAR dataset found in this article.

MODEL-BASED IMPUTATION METHODS

Summary of the Methods Discussed in this Article
Summary of the Methods Discussed in this Article

Model-based methods are defined differently by McKnight (2007) in our main reference. In that, these methods are used, not to estimate missing data, but rather to generate parameter estimates as if the missing data are observed. As such, they are more appropriately referred to as Data Augmentation Methods.

For us, we call these model-based because they use Machine Learning / Statistical models to come up with estimates for missing data. In fact, regression estimates should belong here (from our last article), but we have separated the methods below because they are treated as much more complex (and therefore used less by data scientists).

We have started to discussed Maximum Likelihood (ML) in the generation of our ML-mean. Let us discuss two more of these, the EM algorithm and the Markov-Chain Monte Carlo method.

EXPECTATION-MAXIMIZATION ALGORITHM

The EM algorithm is a general method for obtaining ML estimates when data are missing (Dempster, Laird & Rubin, 1977).

Basically, the EM algorithm is composed of two steps: The expectation step (E) and the maximization step (M). This is a beautiful algorithm designed for the handling of latent (unobserved) variables and is therefore appropriate for missing data.

To execute this algorithm:

  1. Impute the values for missing data using Maximum-Likelihood. Use the non-missing variables per observation to calculate the ML estimate for the missing value.
  2. Generate parameter estimates for a "simulated" complete dataset based on step 1.
  3. Re-impute the values based on the parameter estimates (or "updated" parameter estimates) obtained from step 2.
  4. Re-estimate the parameters based on imputed data from step 3.

The iterative procedure stops when our parameter estimates are "no longer changing" or no longer updating. (The technical term is that the error from the current value less updated value is less than a certain epsilon.)

As with many machine learning methods that use iterations, the EM algorithm produces less biased estimates.

To impute this with a package, first install impyute through pip install impyute.

As with our linear regression, it is best to include in the calculation of the ML estimate the variables that are not included in your study so as not to biased the model.

Suppose like as with KNN, we want to estimate missing data using observed values for: ‘age‘, ‘fnlwgt‘, ‘educational-num‘, and ‘hours-per-week‘.

The code would be:

df10 = df.copy()
# importing the package
import impyute as impy
# imputing the missing value but ensure that the values are in matrix form
df10[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']] = 
    impy.em(df10[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']].values, loops=100)
#Simulate New Comparison Container (So we can separate these new categories)
comparison_df = pd.concat([orig_df[['age', 'fnlwgt']], X], axis=1)
#Rename so We can Compare Across Datasets
comparison_df.columns = ["age_orig", "fnlwgt_orig", "age_MCAR", "fnlwgt_MCAR"]
cols = comparison_df.columns.to_list()
comparison_df = pd.concat([comparison_df, df10[['age', 'fnlwgt']]], axis=1)
comparison_df.columns =  [*cols,'age_EM.imp', 'fnlwgt_EM.imp']
#View the comparison across dataset
comparison_df.loc[fnlwgt_missing,["fnlwgt_orig","fnlwgt_MCAR",
                              'fnlwgt_EM.imp']]
Comparison with EM algorithm imputations.
Comparison with EM algorithm imputations.

As we can see, with just a few lines of code, we were able to perform an EM imputation. Those who have been following the series would immediately see that this is the method that is closest when it comes to the standard deviation parameter which we ideally want.

This particular method, however, assumes that our data is multivariate normal. The features for which we have missing values, however, cannot be assumed to be normally distributed. Age and final weights are usually positively skewed and do not become negative.

As such, mindless application of the code resulted in the imputation of negative values for both age and final weight, which is not possible!

Negative Final Weights Imputed by The EM Method
Negative Final Weights Imputed by The EM Method

Before applying the code above, therefore, we have to find a way to normalize values.

We can simply apply log-transformation and review the effect our algorithm is for these newly transformed variables.

df11 = df.copy()
# imputing the missing value but ensure that the values are in matrix form
df11[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']] = 
    np.exp(impy.em(np.log(df10[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']].values), loops=100))
Comparison of the EM code on original and log-transformed datasets.
Comparison of the EM code on original and log-transformed datasets.

While our standard deviation is lower, it still has better estimates compared to the other single imputation methods that we have discussed. The mean estimate is much closer as well to the original value.

MARKOV-CHAIN MONTE CARLO METHOD (MCMC)

One limitation of models that are based on the Maximum Likelihood method is that they require distributional assumptions of data (e.g. multivariate normality).

The increasingly popular Markov Chain Monte Carlo (MCMC) procedure can be used in the absence of this knowledge. The process is Bayesian in nature with the ultimate goal of obtaining a posterior distribution.

We need to break down the concept into what Markov chains are and what Monte Carlo has to do with it, but we leave that for another article to keep this one short. But like the EM algorithm, MCMC augments the observed data to handle the estimation of parameters.

A package can be employed for these as well from NumPyro. As this method employs much longer codes than the others, we direct readers to the official documentation of NumPyro: http://num.pyro.ai/en/latest/tutorials/bayesian_imputation.html

OTHER SCIKIT IMPUTERS: BAYESIAN RIDGE, DECISION TREES, EXTRA TREES, K-NEIGHBORS

One way to categorize all the methods that we have been discussing in this article is to call them "multivariate imputers". That is, they impute based on the values of all the other variables that are present in the dataset.

As most readers are assumed to be familiar with Machine Learning, another way to look at this is a machine learning model to impute missing data using available data within the dataset as predictors.

As the process is pretty much similar for each, let’s simply create a loop for all the four methods above.

from sklearn.experimental import enable_iterative_imputer #MUST IMPORT THIS
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
#STEP 1 - Choosing Variables and Create a Matrix of Values
df12 = df.copy()[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']]
X = df12.values #Matrix of Values
# STEP 2 - INITIALIZE ESTIMATORS
estimators = {
    "bayesianridge":BayesianRidge(),
    "DTrees": DecisionTreeRegressor(max_features='sqrt'),
    "ETrees":ExtraTreesRegressor(n_estimators=10),
    "KNreg":KNeighborsRegressor(n_neighbors=15)
}
# STEP 3 - RUN IMPUTATIONS AND STORE IMPUTED VALUES
for key, value in estimators.items():
    imputer = IterativeImputer(random_state=19, estimator=estimators[key])
    imputer.fit(X)
    transformed = imputer.transform(X)

    #Temporarily Store
    temp_df = pd.DataFrame(np.column_stack(list(zip(*transformed))), columns=df12.columns)
    #Get updated columns list

    cols = comparison_df.columns.to_list()

    #Combine for Comparison 
    comparison_df = pd.concat([comparison_df, temp_df[['age', 'fnlwgt']]], axis=1)
    comparison_df.columns =  [*cols,f'age_{key}.imp', f'fnlwgt_{key}.imp']
Comparison of the different scikit imputations.
Comparison of the different scikit imputations.

Note that the estimators that we can try are not limited to what we have above. These are simply the ones discussed in the official documentation found here: https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html

MULTIPLE IMPUTATION METHODS

Multiple Imputation (MI) is currently the most acclaimed approach for handling missing data. These approaches provide estimates that are unbiased (and are therefore generalizable) and recovers the population variance, which is critical to statistical inference.

The main difference with the single imputation method is that instead of imputing a single value for a missing observation, several values (say 3 to 10) are imputed. The average of that is treated as the final imputed value.

MI is not just one method but a term for numerous approaches that deal with multiple imputations of values. These multiple values are derived from an iterative process that uses both the:

  1. observed data and
  2. sample value generated during the iterations.

Each set of imputed values is then used to replace missing values to create a complete dataset. So if we chose to impute 3 values, these values result in three complete datasets.

Those multiple estimates are combined to obtain a single best estimate of the parameter of interest. For example, if our approach is that of a multiple regression model, three regression models are constructed, one for each complete dataset. The resulting models have their corresponding parameters and coefficient estimates and the mean of these estimates will be our final one.

One package that implements this in Python is that of MICEFOREST. – Multiple Imputation by Chained Equations (MICE) with random forests(pip install miceforest).

Recall that in our earlier example, decision trees performed relatively well in recovering the population characteristics. This package would therefore apply multiple imputations using the random forest approach so let’s hope this results in a better performance than what we have earlier.

import miceforest as mf
df13 = df.copy()
df13 = df.copy()[['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week']]
# Create kernel. 
kernel = mf.MultipleImputedKernel(
  df13,
  datasets=4,
  save_all_iterations=True,
  random_state=1989
)
# Run the MICE algorithm for 3 iterations on each of the datasets
kernel.mice(3)
#View Last Dataset Imputed
kernel.complete_data(2).loc[fnlwgt_missing]
Results from MICEFOREST
Results from MICEFOREST

There’s one more thing we need to do after generating these multiple iterations: we need to average them.

mi_results = pd.concat([kernel.complete_data(i).loc[:,["age", 'fnlwgt']] for i in range(3)]).groupby(level=0).mean()
Final Comparisons including MICE Forest Imputations
Final Comparisons including MICE Forest Imputations

As we saw, our MI procedure does a terrific job with the recovery of the population parameters. One unspoken advantage of MI is that we are rid of distributional assumptions that come with some of the methods we have discussed above, particularly the ML methods.

Because MI methods produce asymptotically unbiased estimates, they can be implemented for MAR and MNAR mechanisms! This is a great win for us data scientists.

There’s a lot more to discuss the MICE approach and one can review them here: https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.4067

FINAL REMARKS

While we presented a lot of modern and highly-praised techniques in these series of articles, we have to keep in mind the following:

  1. That these techniques are not comprehensive. New techniques are being developed constantly that advance the way we approach missing data;
  2. That application varies per use case and Research objective; in our case, we simply approach our study through parameter estimates.
  3. And that despite the sophistication of the methods that we have discussed here, there is no better way to handle missing data than to avoid them. Proper research design, collection, storage, and extraction should be given appropriate thought by the researcher.

Full code can be found on my Github page.

REFERENCES

McKnight, P. E. (2007). Missing data: a gentle introduction. Guilford Press.

GitHub – AnotherSamWilson/miceforest: Multiple Imputation with Random Forests in Python

Multiple imputation using chained equations: Issues and guidance for practice


Related Articles