The world’s leading publication for data science, AI, and ML professionals.

Introduction to Scikit-learn’s Pipelines

Building an end-to-end machine learning pipeline using scikit-learn

Photo by JJ Ying on Unsplash
Photo by JJ Ying on Unsplash

Trying to solve a Machine Learning problem it’s usually quite a repetitive task when it comes to the processes we have to follow. Generally speaking, when we want to solve a regression or classification problem in tabular data we follow three steps:

  1. Read the data from the source
  2. Preprocess the data in different ways
  3. Feed a model with the preprocessed data

In this article, I want to show to beginner data scientists how to do these steps in a more readable, mantainable and easier way using scikit-learn’s pipelines. If you are not a beginner you probably know how pipelines work and the advantages they bring to the table.

Many beginners in data science tend to execute every preprocessing step one by one using different functions in different chunks of code. Although this works, this way of preprocessing data and creating models is not clean and mantainable at all.


Let’s say we have a simple dataset such as scikit-learn’s wine dataset. A beginner data scientist would probably do something like this:

If we look closely to this chunk of code, we’ll see a lot of redundant code because we are applying the same transformations twice. However, this code works properly and we get 97.77% accuracy.

Now, we will replicate the code above using scikit-learn’s pipeline. But wait, what are scikit-learn pipelines?

A scikit-learn pipeline, is a component provided by the scikit-learn package that allow us to merge different components within the scikit-learn’s API and run them sequentially. This is very helpful because once we create the pipeline, the component do all the X_train and X_test preprocessing and model fitting for us and it can help us to minimize human error during the data transformation and fitting process. Let’s see an example:

This approach also gets 97.77% accuracy but it’s much more readable and mantainable than the other chunk of code.

One major advantage of using this pipelines, is that we can save them as any other model within scikit-learn. These pipelines are estimators so we can save them with all the preprocessing and modelling steps into a binary file easily using joblib and once we save them we can load them from the binary file whenever we want:

Conclusion

As you saw above, scikit-learn pipelines are a very powerful tool that every data scientist should keep in mind before doing any preprocessing steps. However, this article barely scratches the surface of the things you can do with this component. If you are interested in knowing more about pipelines I may write another article in the future explaining how to create custom preprocessing functions that can be inserted withing a scikit-learn pipeline or how to do some hyperparameter tuning on the components inside the pipeline.

If you enjoyed this post you can follow me and check my other posts too:

How to Detect, Handle and Visualize Outliers


Related Articles