
Overview
This post will serve as a step by step guide to build pipelines that streamline the machine learning workflow. I will be using the infamous Titanic dataset for this tutorial. The dataset was obtained from Kaggle. The goal being to predict whether a given person survived or not. I will be implementing various classification algorithms, as well as, grid searching and cross validation. This dataset holds records for each passenger consisting of 10 variables (see data dictionary below). For the purposes of this tutorial, I will only be using the train
dataset, which will be split into train, validation, and test sets.

Why Pipelines?
The machine learning workflow consists of many steps from data preparation (e.g., dealing with missing values, scaling/encoding, feature extraction). When first learning this workflow, we perform the data preparation one step at a time. This can become time consuming since we need to perform the preparation steps to both the training and testing data. Pipelines allow us to streamline this process by compiling the preparation steps while easing the task of model tuning and monitoring. Scikit-Learn’s Pipeline class provides a structure for applying a series of data transformations followed by an estimator (Mayo, 2017). For a more detailed overview, take a look over the documentation. There are many benefits when implementing a Pipeline:
- Convenience and encapsulation: We call
fit
andpredict
only once on the data to fit an entire sequence of estimators. - Joint parameter selection: We can perform a grid search over parameters of all estimators in the pipeline.
- Cross-Validation: Pipelines help to avoid data leakage from the testing data into the trained model during cross-validation. This is achieved by ensuring that the same samples are used to train the transformers and predictors.
Time to see pipelines in action! Below, I will install and import the necessary libraries. Then move on to loading in the dataset and handling missing values. Once the data is ready, I will create transformers for the different data types and a column transformer to encapsulate the preprocessing steps. Finally, I’ll write a function for training a model with cross validation and a similar function including grid search cross validation.
- Installing Scikit-Learn
!pip install -U scikit-learn
- Importing the necessary libraries
# Standard Imports
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pickle
# Transformers
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
# Modeling Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, KFold, Gridsearchcv
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score, confusion_matrix, classification_report
from IPython.display import display, Markdown
# Pipelines
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
- Loading in the data and viewing the top 5 rows
df = pd.read_csv("titanic.csv")
df.head()

- Checking for missing values
df.isna().sum()

There are 177 out of 891 missing values in the Age
column. For the purposes of this pipeline tutorial, I am going to go ahead and fill in the missing Age
values with the mean age. There are 687 out of 891 missing values in the Cabin
column. I am removing this feature since approximately 77% of values are missing. The Embarked
feature is only missing 2 values so we can fill these with the most common value. The Name
and Ticket
features both hold unique values to each passenger and will not be needed for predictive classification so they will also be dropped.
- Dropping features
df.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
- Filling na values for the
Embarked
feature with the most frequent value,S
.
df.Embarked = df.Embarked.fillna(value='S')
Now that we’ve handled the missing values in the dataset, we can move on to defining the continuous and categorical variables.
- Defining variables for the columns in the dataframe to perform a train test split.
columns = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']
numerical_columns = ['Age', 'Fare']
categorical_columns = ["Pclass", "Sex",
"SibSp", "Parch", "Embarked"]
Next, I am going to create two functions. The first function cross_validate
will take in a classifier and cv (cross validator), split the training data into train and test sets, fit the classifier on the training and predict on it as well. The function will then predict on the hold out validation set and return the scores from both the training and test set.
Pipelines allow us to perform the preprocessing (e.g. standardizing, encoding, model fitting) in one step. A pipeline can take in any number of preprocessing steps with each having .fit()
and .transform()
methods. Below, I am creating two transformers, a standard scaler and a one hot encoder. The two different transformers will be for the different data types.
#Creating ss transformer to scale the continuous numerical data with StandardScaler()
ss = Pipeline(steps=[('ss', StandardScaler())])
--------------------------------------------------------------------
#Creating ohe transformer to encode the categorical data with OneHotEncoder()
ohe = Pipeline(steps=[('ohe', OneHotEncoder(drop='first'))])
--------------------------------------------------------------------
#Creating preprocess column transformer to combine the ss and ohe pipelines
preprocess = ColumnTransformer(
transformers=[
('cont', ss, numerical),
('cat', ohe, categorical)
])
- Creating evaluation function to plot a confusion matrix and return the accuracy, precision, recall, and f1 scores
def evaluation(y, y_hat, title = 'Confusion Matrix'):
cm = confusion_matrix(y, y_hat)
precision = precision_score(y, y_hat)
recall = recall_score(y, y_hat)
accuracy = accuracy_score(y,y_hat)
f1 = f1_score(y,y_hat)
print('Recall: ', recall)
print('Accuracy: ', accuracy)
print('Precision: ', precision)
print('F1: ', f1)
sns.heatmap(cm, cmap= 'PuBu', annot=True, fmt='g', annot_kws= {'size':20})
plt.xlabel('predicted', fontsize=18)
plt.ylabel('actual', fontsize=18)
plt.title(title, fontsize=18)
plt.show();
- Performing train_test_split on the data
X = df.drop(['Survived'], axis=1)
y = df.Survived
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Creating cross_validate
function
- Defines the full pipeline with the preprocess and classifier pipelines
- Loop through each fold in the cross validator (default is 5)
- Fit the classifier on the train set,
train_ind
(prevents data leakage from test set) - Predict on the training set
- Predict on the validation set
- Print out an evaluation report containing a confusion matrix and the mean accuracy scores for both train and validation sets
def cross_validate(classifier, cv):
pipeline = Pipeline(steps=[
('preprocess', preprocess),
('classifier', classifier)
])
train_acc = []
test_acc = []
for train_ind, val_ind in cv.split(X_train, y_train):
X_t, y_t = X_train.iloc[train_ind], y_train[train_ind]
pipeline.fit(X_t, y_t)
y_hat_t = pipeline.predict(X_t)
train_acc.append(accuracy_score(y_t, y_hat_t))
X_val, y_val = X_train.iloc[val_ind], y_train[val_ind]
y_hat_val = pipeline.predict(X_val)
test_acc.append(accuracy_score(y_val, y_hat_val))
print(evaluation(y_val, y_hat_val))
print('Training Accuracy: {}'.format(np.mean(train_acc)))
print('n')
print('Validation Accuracy: {}'.format(np.mean(test_acc)))
print('n')
In the function I am using the cross validator to split the training data in order to have a hold out test set (X_test, y_test). Now we can use the function above by inputting the desired classifier and cross validator.
cross_validate(DecisionTreeClassifier(), KFold())
Output:

- With K Nearest Neighbors classifier
cross_validate(KNeighborsClassifier(), KFold())
Output:

Grid Search
Let’s say we wanted to find the most optimal parameters for the model in the pipeline, we can just create a grid search pipeline. For a refresher on grid search, check out the _documentation_. We can create a function like the one above for cross validating but modify it a bit to perform a grid search. This function will take in the desired classifier, parameter grid, and cross validator. Then, will go through the same process as for the cross_validate
function with a grid search.
def grid_search(classifier, param_grid, cv):
search = GridSearchCV(Pipeline(steps=[
('preprocess', preprocess),
('classifier', classifier)
]), param_grid, cv=cv)
train_acc = []
test_acc = []
for train_ind, val_ind in cv.split(X_train, y_train):
X_t, y_t = X_train.iloc[train_ind], y_train[train_ind]
search.fit(X_t, y_t)
y_hat_t = search.predict(X_t)
train_acc.append(accuracy_score(y_t, y_hat_t))
X_val, y_val = X_train.iloc[val_ind], y_train[val_ind]
y_hat_val = search.predict(X_val)
test_acc.append(accuracy_score(y_val, y_hat_val))
print(evaluation(y_val, y_hat_val))
print('Training Accuracy: {}'.format(np.mean(train_acc)))
print('n')
print('Validation Accuracy: {}'.format(np.mean(test_acc)))
print('n')
print('Grid Search Best Params:')
print('n')
print(search.best_params_)
- GridSearchCV with Random Forest
When creating a parameter grid for the model in the pipeline, the model’s name needs to be appended to each parameter. In the code block below, I have appended 'classifier__'
to match the name of the model in the pipeline (named the model ‘classifier’ in the pipeline).
#Creating parameter grid for Random Forest
rand_forest_parms = {'classifier__n_estimators': [100, 300, 500],
'classifier__max_depth':[6, 25, 50, 70],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 10]}
#Calling the grid_search function using the parameters above
grid_search(RandomForestClassifier(), rand_forest_parms)
Output:

During model training, it is important to perform feature selection to make sure the model is provided with the most predictive power, as well as, making sure our model isn’t too complex. We can check the feature importances for the classifier in the pipeline using the eli5
library. To do this, we need to create a list with the numerical feature columns and the encoded columns. Then, call eli5.explain_weights_df
with the grid search pipeline’s best_estimator_
and its named_steps
. Of course we can just add this to our grid search function to return the top ten feature importances by modifying the function to take in a boolean value that will print out these features.
Grid Search Function with Feature Importances
- The modification to include feature importances in the function below are in bold.
def grid_search(classifier, param_grid, cv, print_feat=False):
cv = cv
search = GridSearchCV(Pipeline(steps=[
('preprocess', preprocess),
('classifier', classifier)
]), param_grid, cv=cv)
train_acc = []
test_acc = []
for train_ind, val_ind in cv.split(X_train, y_train):
X_t, y_t = X_train.iloc[train_ind], y_train[train_ind]
search.fit(X_t, y_t)
y_hat_t = search.predict(X_t)
train_acc.append(accuracy_score(y_t, y_hat_t))
X_val, y_val = X_train.iloc[val_ind], y_train[val_ind]
y_hat_val = search.predict(X_val)
test_acc.append(accuracy_score(y_val, y_hat_val))
if print_feat:
ohe_cols = list(search.best_estimator_.named_steps['preprocess'].named_transformers_['cat'].named_steps['ohe'].get_feature_names(
input_features=categorical))
num_feats = list(numerical)
num_feats.extend(ohe_cols)
feat_imp = eli5.explain_weights_df(search.best_estimator_.named_steps['classifier'], top=10, feature_names=num_feats)
print(feat_imp)
print('n')
print(evaluation(y_val, y_hat_val))
print('Training Accuracy: {}'.format(np.mean(train_acc)))
print('n')
print('Validation Accuracy: {}'.format(np.mean(test_acc)))
print('n')
print('Grid Search Best Params:')
print('n')
print(search.best_params_)
- Performing grid search and returning the top ten features of importance along with their weights
grid_search(RandomForestClassifier(), rand_forest_parms, KFold(), print_feat=True)
Output:

Let’s say that the Random Forest Classifier in the grid search pipeline is performing the best. The next step would be to see how the trained model performs on the hold out test data. All we need to do is create a final pipeline with GridSearchCV and fit it to the entire X_train and y_train. Then, predict on X_test.
- Fitting final pipeline to
X_train
andy_train
, and predicting onX_test
final_pipeline = GridSearchCV(Pipeline(steps=[
('preprocess', preprocess),
('classifier', RandomForestClassifier())
]), rand_forest_parms, cv=KFold())
#Fit and predict on train data
final_pipeline.fit(X_train, y_train)
train_pred = final_pipeline.best_estimator_.predict(X_train)
print('Evaluation on training data n')
print(evaluation(y_train, train_pred))
print('n')
#Predict on test data
test_pred = final_pipeline.best_estimator_.predict(X_test)
print('Evaluation on testing data n')
print(evaluation(y_test, test_pred))
Output:

Conclusion
Pipelines keep our preprocessing steps and models encapsulated, making the machine learning workflow much easier. We can apply more than one preprocessing step if needed before fitting a model in the pipeline. The main benefit for me has been being able to come back to a project and following the workflow I set with pipelines. This process would take hours before I learned about pipelines. I hope this tutorial can become a helpful resource in learning the pipeline workflow.
Resources
- Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction. (n.d.). Retrieved from https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html
- Koen, S. (2019, August 09). Architecting a Machine Learning Pipeline. Retrieved from https://towardsdatascience.com/architecting-a-machine-learning-pipeline-a847f094d1c7
- Sklearn.pipeline.Pipeline. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- M, S. (2019, December 13). WHAT IS A Pipeline IN MACHINE LEARNING?HOW TO CREATE ONE? Retrieved from https://medium.com/analytics-vidhya/what-is-a-pipeline-in-machine-learning-how-to-create-one-bda91d0ceaca#:~:text=A machine learning pipeline is used to help automate machine learning workflows.&text=Machine learning (ML) pipelines consist,and achieve a successful algorithm .
- Titanic: Machine Learning from Disaster. (n.d.). Retrieved from https://www.kaggle.com/c/titanic/data
- _3.2. Tuning the hyper-parameters of an estimator. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/grid_search.html_
- Overview. (n.d.). Retrieved from https://eli5.readthedocs.io/en/latest/overview.html