The world’s leading publication for data science, AI, and ML professionals.

How To Build Scikit-learn Pipelines Like A Pro

Learn to build preprocessing, model, as well as Grid Search pipelines the easy way with a mini project

Photo by Mark Boss on Unsplash
Photo by Mark Boss on Unsplash

Every time you pick up a dataset for a project, you are tasked with cleaning and preprocessing the data, dealing with missing data and outliers, modelling, and even performing hyperparameter searches to find the optimal set of hyperparameters to use for your estimators.

Apparently, there is a convenient and neat way to do this in code with Pipelines.

In this article, we will go through a fairly popular Kaggle dataset and perform all of these steps and build a real sklearn pipeline to learn from.

Let’s get started👇

Exploring the dataset

The dataset we will be using for this mini project will be from Kaggle – Heart Failure Detection Tabular Dataset available under the Creative Common’s license. Grab it from the below Kaggle link:

Heart Failure Prediction

Let’s import it and see what it looks like!

image by author - data preview
image by author – data preview

The next step is to split the dataset into training and test sets. Except the last column which is "Death Event" we have all our features for training. Looking at the last column, we can see that it is a Binary classification task.

The shape of the data:
Output:
((209, 12), (90, 12), (209,), (90,))

Finally, we explore all the numerical columns of our dataset:

X_train.describe().T
image by author - describe the dataset
image by author – describe the dataset

Looking at categorical data, we verify that there are none:

# there are no categorical features
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
categorical_features
image by author - no cat features
image by author – no cat features

And now, we can move on to building our pipeline!

Our Scikit-learn Pipeline

The preprocessing pipeline

First, we build our preprocessing pipeline. It will consist of two components – 1) a MinMaxScalar instance for transforming the data to be between (0, 1), and 2) aSimpleImputer instance for filling the missing values using the mean of the existing values in the columns.

col_transformation_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

We put them together using a ColumnTransformer.

A ColumnTransformer can take tuples consisting of different column transformations we need to apply on our data. It also expects a list of columns to go along with, for each transformation. Since we only have numeric columns here, we supply all of our columns to our column transformer object.

Let’s put it all together then:

Awesome! The first part of our pipeline is done!

Let’s go and build our model now.

The model pipeline

We choose a Random Forest classifier for this task. Let’s spin up a quick classifier object:

# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators = 11, criterion='entropy', random_state=0)

And, we can combine our preprocessing and models in a single pipeline:

rf_model_pipeline = Pipeline(steps=[
    ('preprocessing', columns_transformer),
    ('rf_model', rf_classifier),
])

Now, fitting on our training data is simple enough:

rf_model_pipeline.fit(X_train, y_train)

And finally, we can predict on our test set and calculate our accuracy score:

# predict on test set
y_pred = rf_model_pipeline.predict(X_test)

Putting it together:

This is all well and good. However, what if I said that you could perform Grid Search for finding optimal hyperparameters with this pipeline as well? Wouldn’t that be cool?

Let’s explore that next!

Using GridSearch with our pipeline

We have already build and used our model for prediction our dataset. We will now focus on finding the best hyperparameters for our random forest model.

Let’s build up our grid of parameters first:

params_dict = {'rf_model__n_estimators' : np.arange(5, 100, 1), 'rf_model__criterion': ['gini', 'entropy'], 'rf_model__max_depth': np.arange(10, 200, 5)}

In this case, we focus on tuning three parameters for our model:

  1. n_estimators: The number of trees in random forest,
  2. criterion: The function to measure the quality of a split, and
  3. max_depth : The maximum depth of the tree

One important thing to note here is that: Instead of simply using n_estimators as the parameter name in our grid, we use: rf_model__n_estimators. Here rf_model__ prefix comes from the name we chose for our random forest model in our pipeline. (refer to the previous section).

Next, we simply use the GridSearch module to train our classifier:

grid_search = GridSearchCV(rf_model_pipeline, params_dict, cv=10, n_jobs=-1)
grid_search.fit(X_train, y_train)

Let’s put it all together into one:

Now, it is easy enough to predict using our grid_search object like so:

image by author - accuracy score
image by author – accuracy score

Awesome! We have now built a full pipeline for our project!

A few parting words…

So, there you have it! A full sklearn pipeline consisting of a preprocessor, a model, and grid search all experimented upon a mini project from Kaggle. I hope you find this tutorial illuminating and easy to follow along.

It’s time to give yourself a pat on the back! 😀

Find the entire code for this tutorial here. This is the code repository of all of my Data Science articles. Star and bookmark it if you please!

In the future, I’ll be coming back and doing some more Scikit-learn based articles. So follow me on Medium and stay in the loop!

I also recommend becoming a Medium member to never miss any of the Data Science articles I publish every week. Join here 👇

Join Medium with my referral link – Yash Prakash


Get connected!

Follow me on Twitter. Check out the full code repository of all of my Data Science posts!

A few other articles of mine you might be interested in:

The Nice Way To Deploy An ML Model Using Docker

31 Datasets For Your Next Data Science Project

How To Use Bash To Automate The Boring Stuff For Data Science


Related Articles