Pipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)

Understand the basics and workings of scikit-learn pipelines from the ground up, so that you can build your own.

Himanshu Chandra
Towards Data Science

--

This article will cover:

  1. Why another tutorial on Pipelines?
  2. Creating a Custom Transformer from scratch, to include in the Pipeline.
  3. Modifying and parameterizing Transformers.
  4. Custom target transformation via TransformedTargetRegressor.
  5. Chaining everything together in a single Pipeline.
  6. Link to download the complete code from GitHub.

There’s a video walkthrough of the code at the end for those who prefer the format. I personally like written tutorials, but I’ve had requests for video versions too in the past, so there it is.

Pipeline pic
Image by Robson Machado from Pixabay

Why another tutorial on Pipelines?

Since you are here, there’s a very good chance you already know Pipelines make your life easy by pre-processing the data. I heard that too and tried to implement one in my code.

A shout-out to the few great tutorials I could find on the topic! I recommend you certainly browse through them, before or after the current article :

i. https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65
ii. https://machinelearningmastery.com/how-to-transform-target-variables-for-regression-with-scikit-learn
iii. http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html

It was all good while following the tutorials and using standard imputing, scaling, power-transforms, etc. But then I wanted to write specific logic to be applied to the data and wasn’t very sure what was being called where?

I tried to look for a lucid explanation on when are the constructor, fit(), transform() functions, actually being called, but couldn’t get a simple example. So I decided to step through the code bit by bit and present my understanding for anyone who wants to understand this from scratch.

Let’s get started then!

Creating a Custom Transformer from scratch, to include in the Pipeline

Create dataframe
Create DataFrame

To understand the examples better, we’ll create a dataset that will help us explore the code better.

The code above creates data which follows the equation y = X1 + 2 * sqrt(X2). This makes sure a simple Linear Regression model is not able to fit it perfectly.

Let’s see what prediction results are thrown at us:

LinearRegression predictions on raw data
LinearRegression predictions on raw data

A perfect prediction would be 14 and 17. The predictions are not bad, but can we do some calculations on the input features to make this better?

Predictions after input feature manipulation
Predictions after input feature manipulation

The input manipulations cause it to fit a perfect linear trend (y=X1+X2 now), and hence the perfect predictions. Now, this is just an example, but suppose for a dataset, your analysis said such input transformation would be good, how do you do that in a safe manner via Pipelines.

Let’s see a basic LinearRegression() model fitted by using a Pipeline.

LinearRegression() with Pipeline
LinearRegression() with Pipeline

As expected, we get the same predictions as our first attempt. The syntax at this point of time is quite simple —

  1. We declare a pipe1 variable using Pipeline class with array of steps inside it. The name of the step (in this case linear_model) could be anything unique of your choice. It is followed by an actual Transformer or Estimator (in this case, our LinearRegression() model).
  2. Like any other model, it is fitted on the training data, but using the pipe1 variable.
  3. Use pipe1 to predict on test set as you would do in any other model.

To perform the input calculations/transformations, we’ll design a custom transformer.

Custom Input Transformer
Custom Input Transformer

We create a class and name it ExperimentalTransformer. All transformers we design will inherit from BaseEstimator and TransformerMixin classes as they give us pre-existing methods for free. You can read more about them in the article links I provided above.

There are 3 methods to take care of here:

  1. __init__ : This is the constructor. Called when pipeline is initialized.
  2. fit() : Called when we fit the pipeline.
  3. transform() : Called when we use fit or transform on the pipeline.

For the moment, let’s just put print() messages in __init__ & fit(), and write our calculations in transform(). As you see above, we return the modified values there. All the input features will be passed into X when fit() or transform() is called.

Let’s put this into a pipeline to see the order in which these functions are called.

ExperimentalTransformer in Pipeline
ExperimentalTransformer in Pipeline

You can see in the code comments above, one can also use make_pipeline() syntax, which is shorter, to create pipelines.

Now the output:

Output with ExperimentalTransformer
Output with ExperimentalTransformer

3 important things to note:

a. __init__ was called the moment we initialized the pipe2 variable.

b. Both fit() and transform() of our ExperimentalTransformer were called when we fitted the pipeline on training data. This makes sense as that is how model fitting works. You would need to transform input features while trying to predict train_y.

c. transform() is called, as expected, when we call predict(test_X) — the input test features need to be square-rooted and doubled too before making predictions.

The result — perfect predictions!

Modifying and parameterizing Transformers

But..

We’ve assumed in the transform() function of our ExperimentalTransformer that the column name is X2. Let’s not do so and pass the column name via the constructor, __init__().

Here’s our ExperimentalTransformer_2:

Passing arguments to the constructor
Passing arguments to the constructor

Take care to keep the parameter name exactly the same in the function argument as well as the class’ variable (feature_name or whichever name you choose). Changing that will cause problems later, when we also try to transform the target feature (y). It causes a double-call to __init__ for some reason.

I also added an additional_param with a default value, just to mix things up. It’s not really needed for anything in our case, and acts as an optional argument.

Create the new pipeline now:

Calling with new Pipeline
Calling with new Pipeline

Output is as expected:

New output, same as before
New output, same as before

Custom target transformation via TransformedTargetRegressor

What about a situation when some pre and post processing needs to be done?

Consider a slightly modified data set:

y squared in dataset
y squared in dataset

Everything’s the same, but now y has been squared. To make this fit into a simple linear model, we will need to square-root y before fitting our model and also later, square any predictions made by the model.

We can use scikit-learn’s TransformedTargetRegressor to instruct our pipeline to perform some calculation and inverse-calculation on the target variable. Let’s first write those two functions:

Transform and inverse-transform functions
Transform and inverse-transform functions

One square-roots y and the other squares it back.

Calling via pipeline:

TransformedTargetRegressor call
TransformedTargetRegressor call

The TransformedTargetRegressor class takes regressor, func and inverse_func arguments which connects our pipeline to these new functions.

Note how we fit the model now, not the pipeline.

The output shows up something interesting and unexpected though:

First output with TargetRegressor
First output with TargetRegressor

The results are fine, but can you see how our target_transform() and inverse_target_transform() methods have been called multiple times when fit() was called? That is going to become an overhead in big projects and complex pipelines. The change needed to handle this is simply to set check_inverse param of TransformedTargetRegressor to False. We’ll do that in the next step along with looking at another way to handle target transformation — by using transformer param inside TransformedTargetRegressor instead of func and inverse_func.

We can pass an in-built transformer or our custom transformer instead of the two functions we designed. The custom transformer will look almost identical to the one we designed earlier for our pipeline, but will have an additional inverse_transform function inside it. Here’s the implementation:

CustomTargetTransformer
CustomTargetTransformer

That’s it, just use it in our TransformedTargetRegressor call now:

Call with transformer param
Call with transformer param

The output looks fixed now:

Output with no repeated calls
Output with no repeated calls

One last thing to do here. We’ll make use of caching to preserve computations and also see how to get or set parameters of our pipeline from outside (this would be needed later if you want to apply GridSearch on top of this).

get_params()
get_params()

Notice how each parameter of each component of the pipeline can be accessed by using it’s name followed by a double underscore __.

We’ll tie it all together and even try to set a parameter from outside — the column name X2 we have been passing to the constructor.

Tying it all together
Tying it all together

Complete Code: https://github.com/HCGrit/MachineLearning-iamJustAStudent/tree/master/PipelineFoundation

Code walkthrough: https://youtu.be/mOYJCR0IDk8

Code walkthrough

What Next?

Any practical pipeline implementation would rarely be complete without using either a FeatureUnion or a ColumnTransformer. The very first reference link I provided above walks you through FeatureUnions. References I found extremely helpful for ColumnTransformers vs FeatureUnions are:

i. https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces
ii. https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

Also, you will eventually use GridSearch on your model. Using it with pipelines is explained here: https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html?highlight=pipeline

Using these concepts should be easy enough, now that you have a good grasp of the foundations of pipeline creation.

Interested in sharing ideas, asking questions or simply discussing thoughts? Connect with me on LinkedIn, YouTube, GitHub or through my website: I am Just a Student.

See you around & happy learning!

--

--