Pipelines & Custom Transformers in Scikit-learn

Introductory level explanation with accompanying code snippets to follow along…

Published in

Towards Data Science

7 min readNov 7, 2020

Machine Learning academic curriculums tend to focus almost exclusively on the models. One may argue that the model is what performs the magic. The statement may hold some truth, but this magic only works if the data is in the right form. Besides, to make things more complicated, the ‘right form’ depends on the type of model.

Credits: https://www.freepik.com/free-vector/pipeline-brick-wall-background_3834959.htm (*I liked better the MarioBros. image…but you know: copy rights)

Getting the data in the right form is what the industry calls preprocessing. It takes a large chunk of the machine learning practitioner time. For the engineer, preprocessing and fitting or preprocessing and predicting are two distinct processes, but in a production environment, when we serve the model, no distinction is made. It is only data in, prediction out. Pipelines are here to do that. They integrate the preprocessing steps and the fitting or predicting into a single operation. Apartfrom helping to make the model production-ready, they add a great deal of reproducibility to the experimental phase.

Lerning Objectives

What is a pipeline
What is a transformer
What is a custom transformer

Resources

References

Scikit Learn. Dataset transformations

From the Scikit Learn documentation we have:

Dataset transformation …Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modeling and transforming the training data simultaneously.

We will focus on two of the transformer types, namely:

Custom transformer

Although Scikit learn comes loaded with a set of standard transformers, we will begin with a custom one to understand what they do and how they work. The first thing to remember is that a custom transformer is an estimator and a transformer, so we will create a class that inherits from both BaseEstimator and TransformerMixin. It is a good practice to initialize it with super().__init__(). By inheriting, we get a standard method such as get_params and set_params for free. In the init, we also want to create the model parameter or parameters we want to learn.

class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()
        self.means_ = None
        self.std_ = None

    def fit(self, X, y=None):
        X = X.to_numpy()
        self.means_ = X.mean(axis=0, keepdims=True)
        self.std_ = X.std(axis=0, keepdims=True)

        return self

    def transform(self, X, y=None):
        X[:] = (X.to_numpy() - self.means_) / self.std_

        return X

The fit method is where “learning” takes place. Here we perform the operation based upon the training data that yields the model parameters.

In the transform method, we apply the parameters learned in fit to unseen data. Bear in mind that the preprocessing is going to make part of the whole model, so during training, fit, and transform are apply to the same dataset. But later, when you use the trained model, you only apply the transform method with the parameter learned with fit based on the training dataset but on unseen data.

It is key that the learned parameters, and hence the transformer operation, are the same regardless of the data to be applied to.

Standard Transformers

Scikit learn comes with a variety of standard transformers out of the box. Given they almost unavoidable use, you should be familiar with Standardization, or mean removal and variance scaling and SimpleImputer for numerical data and with Encoding categorical features for categorical, specially one-of-K, also known as one-hot encoding.

The pipeline

Chaining estimators

Remember that the transformers are an estimator but so is your model (logistic regression, random forest, etc.). Think of it as steps vertical stacking. Here order matters. So you want to put the preprocessing before the model. The key is that a step output is the next step input.

FeatureUnion: composite feature spaces

Often you want to apply a different transformation to some of your features. The required transformations for numerical and categorical data are different. It is as if you have two parallel ways, or as if they were horizontally stacked.

The input to the parallel ways is the same. So the transform method has to begin by choosing the features relevant to the transformation (for example, numerical features or categorical features).

Example

We will do the preprocessing pipeline for Kaggle’s Titanic Dataset. You can find the Kaggles’s tutorial here.

Credits: https://commons.wikimedia.org/wiki/RMS_Titanic#/media/File:Titanic_in_color.png

For our work, you can follow the steps in the provided gists below (open it in a new tab ad follow along). It contains all the code. We will brake it apart for better understanding.

Now let us begin. After unzipping the file and loading the data perform a quick exploration.

# loading and explorationfilename = '/content/working_directory/train.csv'
raw_train = pd.read_csv(filename)
print('data set shape: ', raw_train.shape, '\n')
print(raw_train.head())data set shape:  (891, 12) 

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

Now after dropping the features we will not be using (PassengerId, Name, Ticket, Cabin, Embarked) and separating the labels (Survived), six (6) features remain, namely: Pclass, Sex, Age, SibSp, Parch and Fare.

dr = ['PassengerId','Name','Ticket','Cabin','Embarked']
train = raw_train.drop(labels = dr, axis = 1)

X = train.drop('Survived', axis=1)
y = train['Survived'].values
print('data set shape: ', X.shape, '\n')
print(X.head())
print(X.describe())data set shape:  (891, 6) 

   Pclass     Sex   Age  SibSp  Parch     Fare
0       3    male  22.0      1      0   7.2500
1       1  female  38.0      1      0  71.2833
2       3  female  26.0      0      0   7.9250
3       1  female  35.0      1      0  53.1000
4       3    male  35.0      0      0   8.0500
           Pclass         Age       SibSp       Parch        Fare
count  891.000000  714.000000  891.000000  891.000000  891.000000
mean     2.308642   29.699118    0.523008    0.381594   32.204208
std      0.836071   14.526497    1.102743    0.806057   49.693429
min      1.000000    0.420000    0.000000    0.000000    0.000000
25%      2.000000   20.125000    0.000000    0.000000    7.910400
50%      3.000000   28.000000    0.000000    0.000000   14.454200
75%      3.000000   38.000000    1.000000    0.000000   31.000000
max      3.000000   80.000000    8.000000    6.000000  512.329200

Notice that there are both numerical (Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’) and categorical (‘Sex’) features which preprocessing will differ. Notice as well that not all passenger Age value are available.

# count missing values
X.isna().sum()Pclass      0
Sex         0
Age       177
SibSp       0
Parch       0
Fare        0
dtype: int64

Custom imputer

Age is presumably a key feature needed for predicting the surviving chances. Therefore for a model to adequately perform, we need to fill in the missing values. One alternative is to use the dataset age mean value. But there is a correlation between Sex, PClass, and Age. Men are older than women and, passengers in the upper class are as well older than passengers in the lower class. We can use that to come up with a better replacement value than just the general average. We will use the mean of the category given by the sex and Pclass. Notice we used two categorical features (Pclass and Sex) to group the points to fill in missing values for a numerical feature (Age).

# Custom Transformer that fills missing ages
class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()
        self.age_means_ = {}

    def fit(self, X, y=None):
        self.age_means_ = X.groupby(['Pclass', 'Sex']).Age.mean()

        return self

    def transform(self, X, y=None):
        # fill Age
        for key, value in self.age_means_.items():
            X.loc[((np.isnan(X["Age"])) & (X.Pclass == key[0]) & (X.Sex == key[1])), 'Age'] = value

        return X

Numerical features pipeline

After selecting the appropriate features, we will perform a simple imputer and a StandardScaler. The previously presented CustomScaler performs the same operation as the prebuild Scikit-learn StandardScaler.

class NumericalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # Numerical features to pass down the numerical pipeline
        X = X[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
        X = X.replace([np.inf, -np.inf], np.nan)
        return X.values# Defining the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[
    ('num_transformer', NumericalTransformer()),
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())])

Categorical features pipeline

After selecting the appropriate feature (Sex), we will perform hot encoding via the prebuilt tranform OneHotEncoder.

class CategoricalTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        super().__init__()

    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self

    # Helper function that converts values to Binary depending on input
    def create_binary(self, obj):
        if obj == 0:
            return 'No'
        else:
            return 'Yes'

    # Transformer method for this transformer
    def transform(self, X, y=None):
        # Categorical features to pass down the categorical pipeline
        return X[['Sex']].values# Defining the steps in the categorical pipeline
categorical_pipeline = Pipeline(steps=[
    ('cat_transformer', CategoricalTransformer()),
    ('one_hot_encoder', OneHotEncoder(sparse=False))])

Horizontal stacking

The categorical and numerical pipelines run in parallel but independently. They have the same input but produce separate outputs that we will rejoin. To rejoin them, we use FeatureUnion.

# Combining numerical and categorical pipeline into one full big pipeline horizontally
# using FeatureUnion
union_pipeline = FeatureUnion(transformer_list=[
    ('categorical_pipeline', categorical_pipeline),
    ('numerical_pipeline', numerical_pipeline)])

Vertical stacking

Since we need both categorical and numerical features for our Custom imputer (where we fill in the missing Age values) it comes before the parallel pipelines now together as preprocessing pipeline. For this, we use Scikit Learn’s Pipeline.

# Combining the custom imputer with the categorical and numerical pipeline
preprocess_pipeline = Pipeline(steps=[('custom_imputer', CustomImputer()),
                                      ('full_pipeline', union_pipeline)])

The model

We will use Scikit Learn’s DecisionTreeClassifier. Here the focus is not the model it is rather the excuse to see the transforms and pipeline in action. The DecisionTreeClassifier is once another estimator that we stack after our preprocessing pipeline.

To see everything in action we will call fit on the full_pipeline, that is, preprocessing and model, and later predict.

# MODEL
from sklearn import tree

# Decision Tree
decision_tree = tree.DecisionTreeClassifier()# define full pipeline --> preprocessing + model
full_pipeline = Pipeline(steps=[
    ('preprocess_pipeline', preprocess_pipeline),
    ('model', decision_tree)])

# fit on the complete pipeline
training = full_pipeline.fit(X, y)
print(full_pipeline.get_params())

# metrics
score_test = \
    round(training.score(X, y) * 100, 2)
print(f"\nTraining Accuracy: {score_test}")

And finally the prediction part:

# Prediction

my_data = X.iloc[[77]]
y = full_pipeline.predict(my_data)
print(my_data, y)Pclass   Sex       Age  SibSp  Parch  Fare
77       3  male -0.211777      0      0  8.05 [0]

Closing

This very short recount of transforms and pipelines with Scikit learn should have given you the tools to integrate, in a production-ready and reproducible manner, the preprocessing phase in your machine learning models. Hope you enjoyed. Happy coding!