The world’s leading publication for data science, AI, and ML professionals.

Missing Value Handling - Imputation and Advanced Models

The pros and cons of different imputation methods and the models that incorporate missing values automatically.

Missing values are a common problem in many Data science projects. In my previous article about missing values here, I discussed missing data and identifying each type. In this follow-up post, imputation methods and models that handle missing values are discussed.


Imputation

Imputation is an effective tool to handle missing values. The problems with different missing data types are mitigated by inserting a descriptive value or even computing a value based on the remaining known value.

While no approach is perfect and not better than the actual data, imputation can be better than removing the instance entirely.

There are many different approaches to missing value imputation, but a few methods are focused on here.

  • Simple imputation
  • KNN Imputation
  • Iterative Imputation

These methods are found in the commonly used scikit-learn packages and compatible with standard data formats in Python. The basic process to impute missing values into a dataframe with a given imputer is written in the code block below.

imputer = SimpleImputer(strategy='mean')
# df is a pandas dataframe with missing values
# fit_transform returns a numpy array
df_imputed = imputer.fit_transform(df)
# Convert to pandas dataframe again
df_imputed = pd.DataFrame.from_records(df_imputed, columns=df.columns)

Simple Imputation

The most basic imputation method imputes either a constant value for each missing data point. Alternatively, you can use these methods to calculate and impute either the mean, median, or most frequent value for your dataset.

When the number of features is relatively large and, missing values are few, this is a practical approach as the few missing values may be negligible to the overall model performance.

KNN Imputation

KNN Imputation provides a more detailed approach than simple imputation. Using the K-most similar records to the instance with the missing value, some dependencies between missing and non-missing values can be modeled.

Thus, this method is more flexible and can somewhat handle data that is missing at random.

KNN Imputation is more computationally expensive than simple imputation. Still, if your dataset is not in the range of 10s of millions of records, this method works fine.

Iterative Imputation

Similar to KNN imputation, iterative imputation can model complex relationships between known values and predict missing features. This method is a multi-step process that creates a series of models to predict missing features based on the known values of other features.

Iterative imputation is a complicated algorithm, but the overall approach is relatively straightforward.

  1. Impute missing values with simple imputation. This step allows the models to fit and predict correctly.
  2. Determine an order of imputation. The implementation has several options. This parameter can affect the final predictions as previous predictions are used for future predictions.
  3. Impute one feature by training a model on all other features. The target variable is the feature being imputed that contains some known values.
  4. Repeat this process for each feature.
  5. Repeat the process across all features several times or until the changes between complete iterations are below a threshold tolerance.

Iterative imputation uses Bayesian Ridge regression as the default estimator; however, you can modify this to an estimator of your choice.

One drawback to iterative imputation is that it is more computationally expensive compared to the other imputation methods. Thus, for enormous datasets, KNN imputation may be preferable.


Comparison of imputation methods

These imputation methods vary in the manner that they impute data. To compare each of these imputers, I run several experiments removing data and then imputing the data again with each imputer.

The datasets used have data removed at random for different levels of missingness. The experiments are run ten times to ensure the validity of the results. The experiment is repeated using three UCI datasets.

After imputation, the original dataset with the true known values is compared to the imputed dataset. The mean squared error is calculated between the two datasets to evaluate the efficacy of the imputation method.

From the experiments, it should be clear that with many missing values, the performance of simple imputation drops off rapidly. Most frequent imputation consistently performs worse and perhaps best used only when the data’s distribution is well known.

KNN improves the performance but is more dependent on the underlying data distributions.

Iterative imputation is superior for imputation when there are many missing values in the data. However, these methods are improving the performance when there are many missing values. On the other hand, when there are only a few missing values, simple imputation performs comparably.


Models that handle missing values

It can be the case that the fact a value is missing is important. For example, suppose there is an extension of the breast cancer dataset. In this alternative set, there is a feature ‘measured_blood_pressure’ that is correlated with breast cancer.

For some cases, this feature is not measured by a physician since all other features indicate that it would be irrelevant from the physician’s point of view.

Now, this feature certainly does have an actual value, but it is not known. Moreover, the fact that it is missing is potentially more valuable compared to imputing some other value.

Fortunately, there are some models which handle missing data without the need for imputation. Each model used here is a gradient boosting model, an ensemble model based on decision trees.

These models handle missing values at splits within the tree; the missing values go to the split side, reducing the overall loss at the split the most. When there is a missing value previously never seen during testing, the instance goes to the side that contained the most samples.

Evaluation

Each of these models has many parameters that allow for fine-tuning. For details about how to properly fine-tune a model, see my other article about hyperparameter optimization here.

I am using the breast cancer dataset and the default model hyperparameters to compare each of the models.

For each test, I remove an increasing amount of the data randomly from the dataset then split the data. I train the model with 70% of the data and test each model with 30%.

The experiment is repeated ten times at each level of missingness, and the accuracy is averaged over the ten experiments. The results are in the image below.

The performance here is less relevant than the rate of performance decline with increasing portions of missing data. In most scenarios, more than 50% of missing data is not likely. Note how the performance of these models drops less than 5% when 50% of the data is removed.


Conclusion

Missing data is an incredibly frustrating aspect of Data Science. Moreover, determining the mechanism behind the missing data can be a significant problem itself.

However, several options are available to either handle or get around missing data. These methods each have their benefits and disadvantages, and there is no perfect approach.

Imputation offers an excellent solution to maintain flexibility with your data without removing the instances entirely. However, there are many more methods for imputation that are not discussed here.


Summary TLDR;

  • Imputation is an effective way to handle Missing Values. Use KNN imputation and iterative imputation when possible to model missing at random and missing not at random data. The choice varies mainly on the computational resources available and the nature of the data.
  • If the fact the data is missing is meaningful, try models that allow for missing data during training and prediction, such as XGBoost, Histogram Gradient Boosting, and LightGBM.

If you’re interested in reading articles about novel data science tools and understanding machine learning algorithms, consider following me on Medium.

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles