The world’s leading publication for data science, AI, and ML professionals.

Unleash the Power of Scikit-learn’s Pipelines

An intermediate guide to scikit-learn's pipelines

Photo by Christophe Dion on Unsplash
Photo by Christophe Dion on Unsplash

In my last post, I wrote an introduction to scikit-learn’s pipelines. If you haven’t read it yet you can access through the link below:

Introduction to Scikit-learn’s Pipelines

In this article, I want to extend the previous post showing some cool features to create more powerful pipelines and at the end of the post I’m going to show you how to fine tune this pipelines to improve the accuracy.

As we saw in the previous post, pipelines can be very useful but we barely grasp the surface in the introduction.

What if we want to do some feature transformations taking in to account feature types?

Imagine we got a pandas dataframe which contains numerical and categorical features and we want to handle these two type of features in a different manner. In this case, we can use the ColumnTransformer component from scikit-learn pipeline. Let’s begin!

For educational purposes, we’re going to use Kaggle’s Adult Census Income dataset:

Adult Census Income

This dataset contains 6 numerical features and 9 categorical features, so it seems suitable for the pipeline we want to build. The jupyter notebook will be accessible in my github and you can download it in this link:

towards-data-science-posts-notebooks/Unleash the Power of Scikit-learn’s Pipelines.ipynb at master…

Before implementing the pipeline, we have to do a little transformation over the dataset because all the unknown values are saved as ‘?’ instead of NaN and the income (target column) is a string and we want to encode it to real numbers.

Once we got these transformations done, we can start implementing our pipeline. For doing this, we’re going to use scikit-learn’s ColumnTransformer component and we’re going to select and handle categorical and numerical features in a different way:

Make_column_selector will help us select the columns by type. In our case, these types will be int and float for numerical features and object for categorical features. If we look closely, the ColumnTransformer component also uses 2 variables called numerical_pipe and categorical_pipe. These pipelines will be defined as follow:

Once we have defined our ColumnTransformer’s components and all its elements within, we will use this component to create our main pipeline:

That’s it! Our pipeline is finished! Let’s use it for training and let’s measure its accuracy:

Our accuracy using 33% of the data as test set is 83.2%. That’s not bad at all!

How can we improve the accuracy of this pipeline?

To improve the results of this pipeline, we can perform fine tuning techniques on each component. Scikit-learn makes it easy to do fine tuning using GridSearchCV. This component will do an extensive fine tuning trying all the possible combinations of hyperparameters within our defined scope. We have to be careful here because the computational complexity associated with this process could sky rocket exponentially if we try too much combinations:

If we look closely to the code, we have defined some hyperparameters for the pipeline and then after defining the metric we want to optimize, we have started the process fitting all the possible combinations. After this process we improve our accuracy in a 0.9% getting a total accuracy of 84.1%. It doesn’t sound like much but this is a quite simple optimization and it could improve even more defining a bigger hyperparameter space, trying more powerful models, doing some bayesian or genetic optimization instead of grid search optimization, etc.

Conclusion

Pipelines and hyperparameter fine tuning work very well when we use them together. This post was an example of the incredible potential of the pipelines and how they can be extremely useful in a data scientist’s daily work. Although this pipeline was more complex than the pipeline in my first post, these pipelines can be much more complex. For example, we could create custom transformations and use them in our pipelines. However, this is out of this post’s scope but if you want a post about custom transformers let me know.

References

sklearn.model_selection.GridSearchCV

sklearn.metrics.make_scorer

sklearn.compose.ColumnTransformer

sklearn.pipeline.Pipeline


Related Articles