The world’s leading publication for data science, AI, and ML professionals.

Make a rock-solid ML model using Sklearn Pipeline!

Why, How & When to use Sklearn's Pipeline!

Photo by Sigmund on Unsplash
Photo by Sigmund on Unsplash

If you are building a predictive model and achieving the desired accuracy without any preprocessing steps like cleaning the data, imputing missing values, you just happen to be the luckiest person in the world! But the most of today’s data do not speak up unless you perform a decent amount of transformation and preprocessing. There are several steps involved, from preprocessing, transformation to modeling. And when putting those models to production, it is important that we must build a robust process that can run without disruption. That is where Sklearn Pipeline and the other supporting components come into the picture.

Why use Sklearn Pipeline?

I would spin up a Jupiter notebook to start exploring my data, innovating new features, performing preprocessing like cleaning, scaling, etc. before being fed into the model. However, I know it can become a mess while doing so inside Jupyter or similar IDEs. It makes more sense to create an automation pathway rather than pressing Shift+Enter inside your Jupyter Notebook. Using Sklearn Pipeline is a convenient way to enforce the steps with your preprocessing steps and ensures the code’s reproducibility.

Unstable and inconsistent results for production models can significantly impact businesses if they are relying on Machine Learning models to make their decisions every day.

I am not a member of "Jupyter haters fan-club"; I myself built everything inside Jupyter Notebook and then convert it into a Python script integrating Sklearn Pipeline!


General Sklearn Pipeline workflow:

If we look at a generalized diagram of an end to end machine learning Pipeline, it will look something like below:

Once the data is provided,

  • We do Imputation – filling missing values with mean, median, etc.
  • Followed by it, some feature engineering like creating a flag for missing values, mean of groupby for a categorical feature provided, etc. can be done.
  • Scaling/normalization can be used that helps the algorithm to converge faster and maybe PCA to remove some noise.
  • Finally, the estimator is fitted to the data. You can use the Pipeline to generate predictions for the unseen data.

Sklearn’s Pipeline **** integrates all the components mentioned above and orchestrates the process to deliver the model. If we think, how different pieces of legos contribute to achieving your final product, different pieces that rhyme together inside _Pipelin_e to make it work! Let’s understand a few commonly used transformers briefly together:

  • Pipeline: It is an assembly of steps together, including the final estimator helping to automate machine learning workflows. The above-drawn figure is an example of Pipeline.
  • Feature Union: Data flow is not linear, but multiple things need to be preprocessed together. It executes all the functions in parallel and pastes the final results together. Use it when you want to apply different transformations to one column.
Image created by the author
Image created by the author
  • Column Transformer: Helps in combining different kinds of transformation applied to various columns and concatenate them.
Image created by the author
Image created by the author
  • make_column_selector: This Lego piece helps to select multiple columns based on datatype. Regex could be also used for the selection of columns.
Image created by the author
Image created by the author
  • Simple Imputer: This transformer can be used for filling in the missing values with either mean, median, most_frequent, or a constant.
  • BaseEstimator, TransformerMixin: You can create your transformer using these base classes. TransformerMixin is mainly used to implement a transform method.
  • FunctionalTransformer: This helps to convert a user-defined function into a function that can be easily used and integrated with Pipeline.

Short Example:

Let’s take a small implementation on the breast cancer dataset and how different it can be when using Pipeline.

Normal ML code:

Implementing Pipeline:

As we can see from the code above, there are several things that stand out:

  • Clean and quality code
  • Easy to reproduce
  • Unemployment of redundancy practices

Some Advantages:

Readability and reproducibility

  • Readability is one of the aspects often overlooked in the world of Machine Learning. We often give a lot more importance to achieve results than focusing on the quality of the code. The code must be is easily conceivable to any user/colleague in their practices and with whom you share it.
  • Moreover, it is important that the code is written can produce the same results when ran again and again with the same steps. This helps tremendously when experimenting with small adjustments in the code to gain a better ML model without disturbing all the components.

Standardized workflow – less chance for data leakage

  • When using Pipeline to fit the data, the workflow becomes very standard and there are fewer chances for data leakage to happen. Data leakage is one of the biggest hurdles faced by ML practitioners when putting their models in Production.

Easy to debug when getting an error

  • Your pipeline might throw an error or result in failure. During such times it becomes easier to trace out the root cause of the problem if your code is clean and efficient!

Re-usability

  • You can port the already built Pipeline in one project with minor adjustments as per the requirements of the new project. This has made me efficient and helped me focus on different components of the entire ML pipeline.

More applications

  • We can use Grid Search in combination with Pipeline
  • Parallelize them using libraries like Dask.
  • You can run a loop with multiple estimators and use Pipeline with each of them.

Last but not least, Sklearn Pipeline should be really integrated and made an essential part of the workflow. I have built several projects and used them extensively. The extension or applications of them are limitless, and if employed, they can be a great tool! Thank you for your time!


Follow me on Twitter or LinkedIn. You may also reach out to me via [email protected]


Related Articles