The world’s leading publication for data science, AI, and ML professionals.

Coding a custom imputer in scikit-learn

Learn how to create custom imputers, including groupby aggregation for more advanced use-cases

Photo by Gabriel Crismariu on Unsplash
Photo by Gabriel Crismariu on Unsplash

Working with missing data is an inherent part of the majority of the Machine Learning projects. A typical approach would be to use scikit-learn‘s SimpleImputer (or another imputer from the sklearn.impute module). However, often the simplest approach might not be the best one and we could gain some extra performance by using a more sophisticated approach.

That is why in this article I wanted to demonstrate how to code a custom scikit-learn based imputer. To make the case more interesting, the imputer will fill in the missing values based on the groups’ averages/medians.

Why should you write custom imputer as classes?

Before jumping straight into coding I wanted to elaborate on a few potential reasons why writing a custom imputer class (inheriting from scikit-learn) might be worth your time:

  • It can help you with developing your programming skills – while writing imputers inheriting from scikit-learn you learn about some best practices already used by the contributors. Additionally, via inheritance you can use some of the already prepared methods. This way, your code will be better/cleaner and potentially more robust to some unforeseen issues.
  • Your custom classes can be further developed over time and potentially shared with other users (or maybe even integrated into scikit-learn!)
  • More on the practical side, by creating imputers using the scikit-learn framework you make them compatible with scikit-learn‘s Pipelines, which make the project’s flow much cleaner and easier to reproduce/productionize. Another practical matter is the clear distinction between the fit and transform methods, so you will not accidentally introduce data leakage – including the test data in the process of determining the values to be used for imputing.

Implementing the custom imputer

In this section, we will implement the custom imputer in Python.

Setup

First, we load all the required libraries:

For writing this article, I used scikit-learn version 0.22.2.

Generating sample data

For this article we will use a toy dataset. We assume the case of collecting the height of people coming from two different populations (samples A and B), hence some variability in the data. Additionally, the first sample also has a distinguishing feature called variant (with values of a and b). What is behind this naming structure is of no importance, the goal was to have two different levels of possible aggregation. Then, we sample the heights from the Normal distribution (using numpy.random.normal) with different values of the scale and location parameters per sample_name.

By using sample(frac=1) we basically reshuffled the DataFrame, so our dataset does not look so artificial. Below you can see the preview of the created DataFrame.

Preview of the generated data
Preview of the generated data

Then, we replace 10 random heights with NaN values using the following code:

Now, the DataFrame is ready for imputation.

Coding the imputer

It is time to code the imputer. You can find the definition of the class below:

As described before, by using an inheritance from the sklearn.base classes (BaseEstimator, TransformerMixin) we get a lot of work done for us and at the same time the custom imputer class is compatible with scikit-learn‘s Pipelines.

So what actually happens in the background? By inheriting from BaseEstimator we automatically get the get_params and set_params methods (all scikit-learn estimators require those). Then, inheriting from TransformerMixin provides the fit_transform method.

Note: There are also other kinds of Mixin classes available for inheritance. Whether we need to do so depends on the type of estimator we want to code. For example, ClassifierMixin and RegressorMixin give us access to the score method used for evaluating the performance of the estimators.

In the __init__ method, we stored the input parameters:

  • group_cols – the list of columns to aggregate over,
  • target – the target column for imputation (the column in which the missing values are located),
  • metric – the metric we want to use for imputation, it can be either the mean or the median of the group.

Additionally, we included a set of assertions to make sure we pass in the correct input.

In the fit method, we calculate the impute_map_, which is a DataFrame with the aggregated metric used for imputing. We also check if there are no missing values in the columns we used for aggregation. It is also very important no know that the fit method should always return self!

Lastly, in the transform method we replace the missing values in each group (indicated by the rows of the impute_map_) with the appropriate values. As an extra precaution, we use check_is_fitted to make sure that we have already fitted the imputer object before using the transform method. Before actually transforming the data, we make a copy of it using the copy method to make sure we do not modify the original source data. For more on the topic, you can refer to one of my previous articles.

In both the fit and transform methods, we have also specified y=None in the method definition, even though the GroupImputer class will not be using the y value of the dataset (also known as the target, not to be confused with the target parameter, which indicates the imputation target). The reason for including it is to ensure compatibility with other scikit-learn classes.

It is time to see the custom imputer in action!

Running the code prints out the following:

df contains 10 missing values.
df_imp contains 0 missing values.

As with all imputers in scikit-learn, we first create the instance of the object and specify the parameters. Then, we use the fit_transform method to create the new object, with the missing values in the height column replaced by averages calculated over the sample_name and variant.

To create df_imp, we actually need to manually convert the output of the transformation into a pd.DataFrame, as the original output is a numpy array. That is the case with all imputers/transformers in scikit-learn.

We can see that the imputer worked as expected and replaced all the missing values in our toy DataFrame.

Conclusions

In this article, I showed how to quickly create a custom imputer by inheriting from some base classes in scikit-learn. This way, the coding is much faster and we also ensure that the imputer is compatible with the entire scikit-learn framework.

Creating custom imputers/transformers can definitely come in handy while working on machine learning projects. Additionally, we can always reuse the created classes for other projects, as we tried to make it as flexible as possible in the first place.

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.

References

[1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute/_base.py

[2] https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html

[3] https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin


Related Articles