Pipelines & Custom Transformers in Scikit-learn

Introductory level explanation with accompanying code snippets to follow along…

Santiago Velez Garcia
Towards Data Science

--

Machine Learning academic curriculums tend to focus almost exclusively on the models. One may argue that the model is what performs the magic. The statement may hold some truth, but this magic only works if the data is in the right form. Besides, to make things more complicated, the ‘right form’ depends on the type of model.

Credits: https://www.freepik.com/free-vector/pipeline-brick-wall-background_3834959.htm (*I liked better the MarioBros. image…but you know: copy rights)

Getting the data in the right form is what the industry calls preprocessing. It takes a large chunk of the machine learning practitioner time. For the engineer, preprocessing and fitting or preprocessing and predicting are two distinct processes, but in a production environment, when we serve the model, no distinction is made. It is only data in, prediction out. Pipelines are here to do that. They integrate the preprocessing steps and the fitting or predicting into a single operation. Apartfrom helping to make the model production-ready, they add a great deal of reproducibility to the experimental phase.

Lerning Objectives

  • What is a pipeline
  • What is a transformer
  • What is a custom transformer

Resources

References

Scikit Learn. Dataset transformations

From the Scikit Learn documentation we have:

Dataset transformation …Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.

We will focus on two of the transformer types, namely:

Custom transformer

Although Scikit learn comes loaded with a set of standard transformers, we will begin with a custom one to understand what they do and how they work. The first thing to remember is that a custom transformer is an estimator and a transformer, so we will create a class that inherits from both BaseEstimator and TransformerMixin. It is a good practice to initialize it with super().__init__(). By inheriting, we get a standard method such as get_params and set_params for free. In the init, we also want to create the model parameter or parameters we want to learn.

class CustomScaler(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
self.means_ = None
self.std_ = None

def fit(self, X, y=None):
X = X.to_numpy()
self.means_ = X.mean(axis=0, keepdims=True)
self.std_ = X.std(axis=0, keepdims=True)

return self

def transform(self, X, y=None):
X[:] = (X.to_numpy() - self.means_) / self.std_

return X

The fit method is where “learning” takes place. Here we perform the operation based upon the training data that yields the model parameters.

In the transform method, we apply the parameters learned in fit to unseen data. Bear in mind that the preprocessing is going to make part of the whole model, so during training, fit, and transform are apply to the same dataset. But later, when you use the trained model, you only apply the transform method with the parameter learned with fit based on the training dataset but on unseen data.

It is key that the learned parameters, and hence the transformer operation, are the same regardless of the data to be applied to.

Standard Transformers

Scikit learn comes with a variety of standard transformers out of the box. Given they almost unavoidable use, you should be familiar with Standardization, or mean removal and variance scaling and SimpleImputer for numerical data and with Encoding categorical features for categorical, specially one-of-K, also known as one-hot encoding.

The pipeline

Chaining estimators

Remember that the transformers are an estimator but so is your model (logistic regression, random forest, etc.). Think of it as steps vertical stacking. Here order matters. So you want to put the preprocessing before the model. The key is that a step output is the next step input.

FeatureUnion: composite feature spaces

Often you want to apply a different transformation to some of your features. The required transformations for numerical and categorical data are different. It is as if you have two parallel ways, or as if they were horizontally stacked.

The input to the parallel ways is the same. So the transform method has to begin by choosing the features relevant to the transformation (for example, numerical features or categorical features).

Example

We will do the preprocessing pipeline for Kaggle’s Titanic Dataset. You can find the Kaggles’s tutorial here.

Credits: https://commons.wikimedia.org/wiki/RMS_Titanic#/media/File:Titanic_in_color.png

For our work, you can follow the steps in the provided gists below (open it in a new tab ad follow along). It contains all the code. We will brake it apart for better understanding.

Now let us begin. After unzipping the file and loading the data perform a quick exploration.

# loading and explorationfilename = '/content/working_directory/train.csv'
raw_train = pd.read_csv(filename)
print('data set shape: ', raw_train.shape, '\n')
print(raw_train.head())
data set shape: (891, 12)

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S

[5 rows x 12 columns]

Now after dropping the features we will not be using (PassengerId, Name, Ticket, Cabin, Embarked) and separating the labels (Survived), six (6) features remain, namely: Pclass, Sex, Age, SibSp, Parch and Fare.

dr = ['PassengerId','Name','Ticket','Cabin','Embarked']
train = raw_train.drop(labels = dr, axis = 1)

X = train.drop('Survived', axis=1)
y = train['Survived'].values
print('data set shape: ', X.shape, '\n')
print(X.head())
print(X.describe())
data set shape: (891, 6)

Pclass Sex Age SibSp Parch Fare
0 3 male 22.0 1 0 7.2500
1 1 female 38.0 1 0 71.2833
2 3 female 26.0 0 0 7.9250
3 1 female 35.0 1 0 53.1000
4 3 male 35.0 0 0 8.0500
Pclass Age SibSp Parch Fare
count 891.000000 714.000000 891.000000 891.000000 891.000000
mean 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.420000 0.000000 0.000000 0.000000
25% 2.000000 20.125000 0.000000 0.000000 7.910400
50% 3.000000 28.000000 0.000000 0.000000 14.454200
75% 3.000000 38.000000 1.000000 0.000000 31.000000
max 3.000000 80.000000 8.000000 6.000000 512.329200

Notice that there are both numerical (Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’) and categorical (‘Sex’) features which preprocessing will differ. Notice as well that not all passenger Age value are available.

# count missing values
X.isna().sum()
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64

Custom imputer

Age is presumably a key feature needed for predicting the surviving chances. Therefore for a model to adequately perform, we need to fill in the missing values. One alternative is to use the dataset age mean value. But there is a correlation between Sex, PClass, and Age. Men are older than women and, passengers in the upper class are as well older than passengers in the lower class. We can use that to come up with a better replacement value than just the general average. We will use the mean of the category given by the sex and Pclass. Notice we used two categorical features (Pclass and Sex) to group the points to fill in missing values for a numerical feature (Age).

# Custom Transformer that fills missing ages
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
self.age_means_ = {}

def fit(self, X, y=None):
self.age_means_ = X.groupby(['Pclass', 'Sex']).Age.mean()

return self

def transform(self, X, y=None):
# fill Age
for key, value in self.age_means_.items():
X.loc[((np.isnan(X["Age"])) & (X.Pclass == key[0]) & (X.Sex == key[1])), 'Age'] = value

return X

Numerical features pipeline

After selecting the appropriate features, we will perform a simple imputer and a StandardScaler. The previously presented CustomScaler performs the same operation as the prebuild Scikit-learn StandardScaler.

class NumericalTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()

def fit(self, X, y=None):
return self

def transform(self, X, y=None):
# Numerical features to pass down the numerical pipeline
X = X[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
X = X.replace([np.inf, -np.inf], np.nan)
return X.values
# Defining the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[
('num_transformer', NumericalTransformer()),
('imputer', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler())])

Categorical features pipeline

After selecting the appropriate feature (Sex), we will perform hot encoding via the prebuilt tranform OneHotEncoder.

class CategoricalTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()

# Return self nothing else to do here
def fit(self, X, y=None):
return self

# Helper function that converts values to Binary depending on input
def create_binary(self, obj):
if obj == 0:
return 'No'
else:
return 'Yes'

# Transformer method for this transformer
def transform(self, X, y=None):
# Categorical features to pass down the categorical pipeline
return X[['Sex']].values
# Defining the steps in the categorical pipeline
categorical_pipeline = Pipeline(steps=[
('cat_transformer', CategoricalTransformer()),
('one_hot_encoder', OneHotEncoder(sparse=False))])

Horizontal stacking

The categorical and numerical pipelines run in parallel but independently. They have the same input but produce separate outputs that we will rejoin. To rejoin them, we use FeatureUnion.

# Combining numerical and categorical pipeline into one full big pipeline horizontally
# using FeatureUnion
union_pipeline = FeatureUnion(transformer_list=[
('categorical_pipeline', categorical_pipeline),
('numerical_pipeline', numerical_pipeline)])

Vertical stacking

Since we need both categorical and numerical features for our Custom imputer (where we fill in the missing Age values) it comes before the parallel pipelines now together as preprocessing pipeline. For this, we use Scikit Learn’s Pipeline.

# Combining the custom imputer with the categorical and numerical pipeline
preprocess_pipeline = Pipeline(steps=[('custom_imputer', CustomImputer()),
('full_pipeline', union_pipeline)])

The model

We will use Scikit Learn’s DecisionTreeClassifier. Here the focus is not the model it is rather the excuse to see the transforms and pipeline in action. The DecisionTreeClassifier is once another estimator that we stack after our preprocessing pipeline.

To see everything in action we will call fit on the full_pipeline, that is, preprocessing and model, and later predict.

# MODEL
from sklearn import tree

# Decision Tree
decision_tree = tree.DecisionTreeClassifier()
# define full pipeline --> preprocessing + model
full_pipeline = Pipeline(steps=[
('preprocess_pipeline', preprocess_pipeline),
('model', decision_tree)])

# fit on the complete pipeline
training = full_pipeline.fit(X, y)
print(full_pipeline.get_params())

# metrics
score_test = \
round(training.score(X, y) * 100, 2)
print(f"\nTraining Accuracy: {score_test}")

And finally the prediction part:

# Prediction

my_data = X.iloc[[77]]
y = full_pipeline.predict(my_data)
print(my_data, y)
Pclass Sex Age SibSp Parch Fare
77 3 male -0.211777 0 0 8.05 [0]

Closing

This very short recount of transforms and pipelines with Scikit learn should have given you the tools to integrate, in a production-ready and reproducible manner, the preprocessing phase in your machine learning models. Hope you enjoyed. Happy coding!

--

--