Pipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)
Understand the basics and workings of scikit-learn pipelines from the ground up, so that you can build your own.
This article will cover:
- Why another tutorial on Pipelines?
- Creating a Custom Transformer from scratch, to include in the Pipeline.
- Modifying and parameterizing Transformers.
- Custom target transformation via TransformedTargetRegressor.
- Chaining everything together in a single Pipeline.
- Link to download the complete code from GitHub.
There’s a video walkthrough of the code at the end for those who prefer the format. I personally like written tutorials, but I’ve had requests for video versions too in the past, so there it is.
Why another tutorial on Pipelines?
Since you are here, there’s a very good chance you already know Pipelines make your life easy by pre-processing the data. I heard that too and tried to implement one in my code.
A shout-out to the few great tutorials I could find on the topic! I recommend you certainly browse through them, before or after the current article :
i. https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65
ii. https://machinelearningmastery.com/how-to-transform-target-variables-for-regression-with-scikit-learn
iii. http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
It was all good while following the tutorials and using standard imputing, scaling, power-transforms, etc. But then I wanted to write specific logic to be applied to the data and wasn’t very sure what was being called where?
I tried to look for a lucid explanation on when are the constructor, fit(), transform() functions, actually being called, but couldn’t get a simple example. So I decided to step through the code bit by bit and present my understanding for anyone who wants to understand this from scratch.
Let’s get started then!
Creating a Custom Transformer from scratch, to include in the Pipeline
To understand the examples better, we’ll create a dataset that will help us explore the code better.
The code above creates data which follows the equation y = X1 + 2 * sqrt(X2)
. This makes sure a simple Linear Regression model is not able to fit it perfectly.
Let’s see what prediction results are thrown at us:
A perfect prediction would be 14 and 17. The predictions are not bad, but can we do some calculations on the input features to make this better?
The input manipulations cause it to fit a perfect linear trend (y=X1+X2
now), and hence the perfect predictions. Now, this is just an example, but suppose for a dataset, your analysis said such input transformation would be good, how do you do that in a safe manner via Pipelines.
Let’s see a basic LinearRegression() model fitted by using a Pipeline.
As expected, we get the same predictions as our first attempt. The syntax at this point of time is quite simple —
- We declare a
pipe1
variable usingPipeline
class with array ofsteps
inside it. The name of the step (in this caselinear_model
) could be anything unique of your choice. It is followed by an actual Transformer or Estimator (in this case, ourLinearRegression()
model). - Like any other model, it is fitted on the training data, but using the
pipe1
variable. - Use
pipe1
to predict on test set as you would do in any other model.
To perform the input calculations/transformations, we’ll design a custom transformer.
We create a class and name it ExperimentalTransformer
. All transformers we design will inherit from BaseEstimator
and TransformerMixin
classes as they give us pre-existing methods for free. You can read more about them in the article links I provided above.
There are 3 methods to take care of here:
__init__
: This is the constructor. Called when pipeline is initialized.fit()
: Called when we fit the pipeline.transform()
: Called when we use fit or transform on the pipeline.
For the moment, let’s just put print() messages in __init__ & fit(), and write our calculations in transform(). As you see above, we return the modified values there. All the input features will be passed into X
when fit() or transform() is called.
Let’s put this into a pipeline to see the order in which these functions are called.
You can see in the code comments above, one can also use make_pipeline() syntax, which is shorter, to create pipelines.
Now the output:
3 important things to note:
a. __init__ was called the moment we initialized the pipe2
variable.
b. Both fit() and transform() of our ExperimentalTransformer were called when we fitted the pipeline on training data. This makes sense as that is how model fitting works. You would need to transform input features while trying to predict train_y.
c. transform() is called, as expected, when we call predict(test_X) — the input test features need to be square-rooted and doubled too before making predictions.
The result — perfect predictions!
Modifying and parameterizing Transformers
But..
We’ve assumed in the transform() function of our ExperimentalTransformer that the column name is X2. Let’s not do so and pass the column name via the constructor, __init__().
Here’s our ExperimentalTransformer_2
:
Take care to keep the parameter name exactly the same in the function argument as well as the class’ variable (feature_name
or whichever name you choose). Changing that will cause problems later, when we also try to transform the target feature (y). It causes a double-call to __init__ for some reason.
I also added an additional_param with a default value, just to mix things up. It’s not really needed for anything in our case, and acts as an optional argument.
Create the new pipeline now:
Output is as expected:
Custom target transformation via TransformedTargetRegressor
What about a situation when some pre and post processing needs to be done?
Consider a slightly modified data set:
Everything’s the same, but now y
has been squared. To make this fit into a simple linear model, we will need to square-root y
before fitting our model and also later, square any predictions made by the model.
We can use scikit-learn’s TransformedTargetRegressor
to instruct our pipeline to perform some calculation and inverse-calculation on the target variable. Let’s first write those two functions:
One square-roots y
and the other squares it back.
Calling via pipeline:
The TransformedTargetRegressor class takes regressor
, func
and inverse_func
arguments which connects our pipeline to these new functions.
Note how we fit the model
now, not the pipeline.
The output shows up something interesting and unexpected though:
The results are fine, but can you see how our target_transform() and inverse_target_transform() methods have been called multiple times when fit() was called? That is going to become an overhead in big projects and complex pipelines. The change needed to handle this is simply to set check_inverse
param of TransformedTargetRegressor to False. We’ll do that in the next step along with looking at another way to handle target transformation — by using transformer
param inside TransformedTargetRegressor instead of func and inverse_func.
We can pass an in-built transformer or our custom transformer instead of the two functions we designed. The custom transformer will look almost identical to the one we designed earlier for our pipeline, but will have an additional inverse_transform
function inside it. Here’s the implementation:
That’s it, just use it in our TransformedTargetRegressor call now:
The output looks fixed now:
One last thing to do here. We’ll make use of caching to preserve computations and also see how to get or set parameters of our pipeline from outside (this would be needed later if you want to apply GridSearch on top of this).
Notice how each parameter of each component of the pipeline can be accessed by using it’s name followed by a double underscore __
.
We’ll tie it all together and even try to set a parameter from outside — the column name X2
we have been passing to the constructor.
Complete Code: https://github.com/HCGrit/MachineLearning-iamJustAStudent/tree/master/PipelineFoundation
Code walkthrough: https://youtu.be/mOYJCR0IDk8
What Next?
Any practical pipeline implementation would rarely be complete without using either a FeatureUnion or a ColumnTransformer. The very first reference link I provided above walks you through FeatureUnions. References I found extremely helpful for ColumnTransformers vs FeatureUnions are:
i. https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces
ii. https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer
Also, you will eventually use GridSearch on your model. Using it with pipelines is explained here: https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html?highlight=pipeline
Using these concepts should be easy enough, now that you have a good grasp of the foundations of pipeline creation.
Interested in sharing ideas, asking questions or simply discussing thoughts? Connect with me on LinkedIn, YouTube, GitHub or through my website: I am Just a Student.
See you around & happy learning!