Every time you pick up a dataset for a project, you are tasked with cleaning and preprocessing the data, dealing with missing data and outliers, modelling, and even performing hyperparameter searches to find the optimal set of hyperparameters to use for your estimators.
Apparently, there is a convenient and neat way to do this in code with Pipelines.
In this article, we will go through a fairly popular Kaggle dataset and perform all of these steps and build a real sklearn pipeline to learn from.
Let’s get started👇
Exploring the dataset
The dataset we will be using for this mini project will be from Kaggle – Heart Failure Detection Tabular Dataset available under the Creative Common’s license. Grab it from the below Kaggle link:
Let’s import it and see what it looks like!

The next step is to split the dataset into training and test sets. Except the last column which is "Death Event" we have all our features for training. Looking at the last column, we can see that it is a Binary classification task.
The shape of the data:
Output:
((209, 12), (90, 12), (209,), (90,))
Finally, we explore all the numerical columns of our dataset:
X_train.describe().T

Looking at categorical data, we verify that there are none:
# there are no categorical features
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()
categorical_features

And now, we can move on to building our pipeline!
Our Scikit-learn Pipeline
The preprocessing pipeline
First, we build our preprocessing pipeline. It will consist of two components – 1) a MinMaxScalar
instance for transforming the data to be between (0, 1), and 2) aSimpleImputer
instance for filling the missing values using the mean of the existing values in the columns.
col_transformation_pipeline = Pipeline(steps=[
('impute', SimpleImputer(strategy='mean')),
('scale', MinMaxScaler())
])
We put them together using a ColumnTransformer
.
A ColumnTransformer
can take tuples consisting of different column transformations we need to apply on our data. It also expects a list of columns to go along with, for each transformation. Since we only have numeric columns here, we supply all of our columns to our column transformer object.
Let’s put it all together then:
Awesome! The first part of our pipeline is done!
Let’s go and build our model now.
The model pipeline
We choose a Random Forest classifier for this task. Let’s spin up a quick classifier object:
# random forest classifier
rf_classifier = RandomForestClassifier(n_estimators = 11, criterion='entropy', random_state=0)
And, we can combine our preprocessing and models in a single pipeline:
rf_model_pipeline = Pipeline(steps=[
('preprocessing', columns_transformer),
('rf_model', rf_classifier),
])
Now, fitting on our training data is simple enough:
rf_model_pipeline.fit(X_train, y_train)
And finally, we can predict on our test set and calculate our accuracy score:
# predict on test set
y_pred = rf_model_pipeline.predict(X_test)
Putting it together:
This is all well and good. However, what if I said that you could perform Grid Search for finding optimal hyperparameters with this pipeline as well? Wouldn’t that be cool?
Let’s explore that next!
Using GridSearch with our pipeline
We have already build and used our model for prediction our dataset. We will now focus on finding the best hyperparameters for our random forest model.
Let’s build up our grid of parameters first:
params_dict = {'rf_model__n_estimators' : np.arange(5, 100, 1), 'rf_model__criterion': ['gini', 'entropy'], 'rf_model__max_depth': np.arange(10, 200, 5)}
In this case, we focus on tuning three parameters for our model:
- n_estimators: The number of trees in random forest,
- criterion: The function to measure the quality of a split, and
- max_depth : The maximum depth of the tree
One important thing to note here is that: Instead of simply using n_estimators
as the parameter name in our grid, we use: rf_model__n_estimators
. Here rf_model__
prefix comes from the name we chose for our random forest model in our pipeline. (refer to the previous section).
Next, we simply use the GridSearch module to train our classifier:
grid_search = GridSearchCV(rf_model_pipeline, params_dict, cv=10, n_jobs=-1)
grid_search.fit(X_train, y_train)
Let’s put it all together into one:
Now, it is easy enough to predict using our grid_search
object like so:

Awesome! We have now built a full pipeline for our project!
A few parting words…
So, there you have it! A full sklearn pipeline consisting of a preprocessor, a model, and grid search all experimented upon a mini project from Kaggle. I hope you find this tutorial illuminating and easy to follow along.
It’s time to give yourself a pat on the back! 😀
Find the entire code for this tutorial here. This is the code repository of all of my Data Science articles. Star and bookmark it if you please!
In the future, I’ll be coming back and doing some more Scikit-learn based articles. So follow me on Medium and stay in the loop!
I also recommend becoming a Medium member to never miss any of the Data Science articles I publish every week. Join here 👇
Get connected!
Follow me on Twitter. Check out the full code repository of all of my Data Science posts!
A few other articles of mine you might be interested in:
The Nice Way To Deploy An ML Model Using Docker
31 Datasets For Your Next Data Science Project
How To Use Bash To Automate The Boring Stuff For Data Science