Custom Transformers for Machine Learning Pipelines using Sklearn

Introduction to Custom Transformers. A Walk-through in Scikit-learn Python

Srivignesh Rajan
Towards Data Science

--

Data Transformers

Data Transformers are python classes that modify the data according to a specific requirement. Data Transformers in scikit-learn include SimpleImputer, StandardScaler, LabelEncoder, and much more. SimpleImputer attempt to modify data by imputing the missing values in data columns, StandardScaler tries to modify data by scaling the column values between specific intervals, and LabelEncoder maps the categorical values in a column to numerical values, so every Data Transformer modifies the data according to the purpose it is made for.

Data Transformers usually have methods such as fit and transform, so what are those? Let us have a look at it!

In practice, the fit method is used to compute the appropriate metrics from training data, and the transform method is used to modify the training and test data. The fit method must not be applied to test data and it must only be transformed according to the metrics computed from the fit method. The fit method does not modify the data, unlike the transform method. Consider the following example,

The above code fills in the missing values in a column by the mean of that column. If we fit the imputer on the data X, the SimpleImputer computes the mean of each column and assigns it to an instance variable called statistics_ which contains the mean values of all columns. The method fit does not modify the data by imputing unless the transform method is invoked.

Custom Data Transformers

Custom Data Transformers can be used to add additional functionality into an existing Data Transformer or they can be independently created to serve a specific purpose. The advantage of using Custom Data Transformers is, it can be used with sklearn’s Pipeline. Let us discuss more on sklearn Pipeline later in this post.

In this post, let us walk through both aspects of Custom Data Transformers.

  • Custom Data Transformers as an extension to existing Data Transformers
  • Independent Custom Data Transformers

Custom Data Transformers as an extension to existing Data Transformers

Let us create a Custom MinMaxScaler that receives the data, extracts numerical columns, and scales the values according to the following formula.

PC: Author
Output

MinMaxScaler accepts only columns with numerical values, if the data we provide contains categorical columns the scaler would throw a ValueError as shown below.

ValueError: could not convert string to float: 'Banu'

CustomMinMaxScaler provides an extension to MinMaxScaler which supports categorical columns as well. CustomMinMaxScaler does nothing on categorical columns but extracts all the numerical columns and applies scaler to it.

How does the above code work?

  • When the fit method is invoked, the code extracts all the numerical columns and fits the scaler on data with numerical columns.
  • When the transform method is invoked, the code extracts all the numerical columns, transforms the numerical data, concatenates the transformed numerical data with the other data (categorical data), and returns that data.

Independent Custom Data Transformers

Independent custom data transformers don’t depend upon any existing data transformers and it doesn’t provide any additional functionality to existing data transformers either.

Let us create a custom data transformer named MeanImputer that receives data, extracts numerical columns, and imputes the missing values by the mean of the column.

How does the above code work?

  • When the fit method is invoked, the code extracts all the numerical columns and fits the MeanImputer on data with numerical columns which computes the mean of all the numerical columns.
  • When the transform method is invoked, the code extracts all the numerical columns, imputes the missing values in the numerical data by mean of the column, concatenates the imputed numerical data with the other data (categorical data), and returns that data.

Pipeline in sklearn

The definition of Pipeline in sklearn,

“Pipeline of transforms with a final estimator”

What does that mean?

  • A pipeline in sklearn is a sequence of data transformers that are executed sequentially one after the other.
  • A data transformer must implement fit and transform methods to be a part of a Pipeline.

Let us look at an example,

The dataset used in this post is the House Price Prediction dataset from Kaggle.

Data before transformation and data after transformation

What happened here?

  • A Pipeline is defined with two sequential steps in which the first step involves the imputation of numerical columns by the mean of the columns and this is done by the MeanImputer.
  • The second step involves scaling numerical data using MinMaxScaler.
  • When the method fit_transform is invoked the data is imputed by MeanImputer and the output is fed to the second step which scales the numerical data, this step is done by CustomMinMaxScaler.
  • If you notice, the categorical data is untouched and untransformed because we only used MeanImputer and CustomMinMaxScaler which does operations only on numerical data.

Pipeline makes our process much easier, by merely invoking the fit_transform method from a single pipeline instance we can have several things done ✅

References

[1] Scikit-learn, TransformerMixin.

[2] Scikit-learn, Pipeline.

Connect with me on LinkedIn, Twitter!

Happy Machine Learning!

Thank you!

--

--