
Working with missing data is an inherent part of the majority of the Machine Learning projects. A typical approach would be to use scikit-learn
‘s SimpleImputer
(or another imputer from the sklearn.impute
module). However, often the simplest approach might not be the best one and we could gain some extra performance by using a more sophisticated approach.
That is why in this article I wanted to demonstrate how to code a custom scikit-learn
based imputer. To make the case more interesting, the imputer will fill in the missing values based on the groups’ averages/medians.
Why should you write custom imputer as classes?
Before jumping straight into coding I wanted to elaborate on a few potential reasons why writing a custom imputer class (inheriting from scikit-learn
) might be worth your time:
- It can help you with developing your programming skills – while writing imputers inheriting from
scikit-learn
you learn about some best practices already used by the contributors. Additionally, via inheritance you can use some of the already prepared methods. This way, your code will be better/cleaner and potentially more robust to some unforeseen issues. - Your custom classes can be further developed over time and potentially shared with other users (or maybe even integrated into
scikit-learn
!) - More on the practical side, by creating imputers using the
scikit-learn
framework you make them compatible withscikit-learn
‘sPipelines
, which make the project’s flow much cleaner and easier to reproduce/productionize. Another practical matter is the clear distinction between thefit
andtransform
methods, so you will not accidentally introduce data leakage – including the test data in the process of determining the values to be used for imputing.
Implementing the custom imputer
In this section, we will implement the custom imputer in Python.
Setup
First, we load all the required libraries:
For writing this article, I used scikit-learn
version 0.22.2.
Generating sample data
For this article we will use a toy dataset. We assume the case of collecting the height of people coming from two different populations (samples A
and B
), hence some variability in the data. Additionally, the first sample also has a distinguishing feature called variant
(with values of a
and b
). What is behind this naming structure is of no importance, the goal was to have two different levels of possible aggregation. Then, we sample the heights from the Normal distribution (using numpy.random.normal
) with different values of the scale and location parameters per sample_name
.
By using sample(frac=1)
we basically reshuffled the DataFrame
, so our dataset does not look so artificial. Below you can see the preview of the created DataFrame
.

Then, we replace 10 random heights with NaN values using the following code:
Now, the DataFrame
is ready for imputation.
Coding the imputer
It is time to code the imputer. You can find the definition of the class below:
As described before, by using an inheritance from the sklearn.base
classes (BaseEstimator
, TransformerMixin
) we get a lot of work done for us and at the same time the custom imputer class is compatible with scikit-learn
‘s Pipelines
.
So what actually happens in the background? By inheriting from BaseEstimator
we automatically get the get_params
and set_params
methods (all scikit-learn
estimators require those). Then, inheriting from TransformerMixin
provides the fit_transform
method.
Note: There are also other kinds of Mixin classes available for inheritance. Whether we need to do so depends on the type of estimator we want to code. For example,
ClassifierMixin
andRegressorMixin
give us access to thescore
method used for evaluating the performance of the estimators.
In the __init__
method, we stored the input parameters:
group_cols
– the list of columns to aggregate over,target
– the target column for imputation (the column in which the missing values are located),metric
– the metric we want to use for imputation, it can be either the mean or the median of the group.
Additionally, we included a set of assertions to make sure we pass in the correct input.
In the fit
method, we calculate the impute_map_
, which is a DataFrame
with the aggregated metric used for imputing. We also check if there are no missing values in the columns we used for aggregation. It is also very important no know that the fit
method should always return self
!
Lastly, in the transform
method we replace the missing values in each group (indicated by the rows of the impute_map_
) with the appropriate values. As an extra precaution, we use check_is_fitted
to make sure that we have already fitted the imputer object before using the transform
method. Before actually transforming the data, we make a copy of it using the copy
method to make sure we do not modify the original source data. For more on the topic, you can refer to one of my previous articles.
In both the fit
and transform
methods, we have also specified y=None
in the method definition, even though the GroupImputer
class will not be using the y
value of the dataset (also known as the target, not to be confused with the target
parameter, which indicates the imputation target). The reason for including it is to ensure compatibility with other scikit-learn
classes.
It is time to see the custom imputer in action!
Running the code prints out the following:
df contains 10 missing values.
df_imp contains 0 missing values.
As with all imputers in scikit-learn
, we first create the instance of the object and specify the parameters. Then, we use the fit_transform
method to create the new object, with the missing values in the height
column replaced by averages calculated over the sample_name
and variant
.
To create df_imp
, we actually need to manually convert the output of the transformation into a pd.DataFrame
, as the original output is a numpy
array. That is the case with all imputers/transformers in scikit-learn
.
We can see that the imputer worked as expected and replaced all the missing values in our toy DataFrame
.
Conclusions
In this article, I showed how to quickly create a custom imputer by inheriting from some base classes in scikit-learn
. This way, the coding is much faster and we also ensure that the imputer is compatible with the entire scikit-learn
framework.
Creating custom imputers/transformers can definitely come in handy while working on machine learning projects. Additionally, we can always reuse the created classes for other projects, as we tried to make it as flexible as possible in the first place.
You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.
References
[1] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute/_base.py
[2] https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html