The world’s leading publication for data science, AI, and ML professionals.

Building Custom Column Transformers in a Pipeline

Add custom transformations to a pre-existing pipeline.

Photo by Victor on Unsplash
Photo by Victor on Unsplash

When Standard Transformations aren’t Enough

The standard scikit-learn library offers a lot of different functionality. However, many common functions to transform, modify, and alter your data are present and readily compatible with pipelines.

This post will discuss how to construct custom transformers for a pipeline. Several code blocks are available to copy. The final transformer has parameters for optimization with the hyper-parameter optimization method of your choice.

Scikit-learn does not wholly cover everything a data scientist needs, and most problems require customization to the issue at hand. Fortunately, there are custom transformers for precisely this purpose.

The immediate question you may have is why custom transformations are needed as part of a pipeline. Alternatively, you can transform your data manually before pushing it through a pipeline. However, there are several benefits to incorporating transformations into your pipeline.

  • A pipeline is repeatable and scalable. The actions performed for the transformation from raw data are easily reproducible on another machine from the raw data in a smooth process. Therefore, you can spread the workload across multiple machines and distribute the computational load.
  • Parameters within the pipeline are optimizable. While not all transformations will ask for parameters, others may require parameters. For example, suppose you want to test either the squared value or the cubed value of a feature will perform better in your model. Then, the exponent can be parameterized and optimized to determine the optimal configuration.

What is Required to Make a Custom Transformer

There are several considerations to create a custom transformation. The first is that the transformer should be defined as a class. This design creates the framework for easy incorporation into a pipeline.

The class inherits from the BaseEstimator and TransformerMixin classes. These classes provide some base functionality for the custom transformation class and ensure compatibility with the sklearn pipeline. The next step is to define three base methods, an init() method, a fit() method, and a transform() method. The init() method is required to initialize the class, the fit() method is called when the pipeline is fit, and the transform() method is called when transform or fit is called on the pipeline.


Problem Construction

The experiments for the custom transformer use the breast cancer dataset. The code and transformations are easily adjusted to different use cases to fit the needs of your problem. Data is separated into training, test sets, and a ‘ROC_AUC’ metric for the classification problem.

After transformation, the entire pipeline goes through a random hyperparameter search. The search allows the parameters within the transformation to be optimized effectively with the hyper-parameter search of choice. For more detail regarding the different hyper-parameter searches, please refer to my post on the subject:

Hyperparameter Tuning – Always Tune your Models

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
TEST_SIZE = 0.1
RANDOM_STATE = 10
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop(['target'], axis=1)
y = df['target'].astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)

Defining the Custom Transform

The CustomTransformer class defined in the code block below has several modifications to allow for a high degree of flexibility. There are three different transformations applied, power, logarithmic, and root. The power transformation requires that the degree of the exponent be specified.

For each of these transformations, the original feature column is maintained. This construction has potential problems as the transformed features and the original feature will be dependent. Alternatively, you can adjust the code to override the original column with a single transformation to eliminate these dependencies.

The columns to transform are determined via the ‘feature_names’ parameter in the transformer. The exponent for the power transformer is determined with the ‘power’ parameter. It is crucial to define the parameters you want to optimize as attributes of the class. Only attributes of the class can be modified during a hyper-parameter search and subsequently optimized. For additional details about different pipeline parameter definitions, refer to my post on a complete sklearn pipeline:

Automated Machine Learning with Sklearn Pipelines

from sklearn.base import BaseEstimator, TransformerMixin
class CustomTransformer(BaseEstimator, TransformerMixin):
    # List of features in 'feature_names' and the 'power' of the exponent transformation
    def __init__(self, feature_names, power):
        self.feature_names = feature_names
        self.power = power
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_copy = X.copy()
        for feat in self.feature_names:
            X_copy[feat + '_power' + str(self.power)] = self.power_transformation(X_copy[feat])
            X_copy[feat + '_log'] = self.log_transformation(X_copy[feat])
            X_copy[feat + '_root'] = self.power_transformation(X_copy[feat])
        return X_copy

    def power_transformation(self, x_col):
        return np.power(x_col, self.power)
    def log_transformation(self, x_col):
        return np.log(x_col +0.0001)
    def root_transformation(self, x_col):
        return np.root( x_col)

Using the transformer in a pipeline

Several meta-parameters are initialized for the pipeline. These include the number of folds, scoring, metric, verbose, and jobs for parallelization. Feel free to update these as needed; however, they are initialized to function with limited computational power. The pipeline is set up with a decision tree classifier and several hyper-parameter distributions for the decision tree.

Additionally, I’ve included parameters for the custom transformer. First is a random distribution for the power parameter, as previously discussed. Next is a set of sets of different feature names. This setup allows for transformations on only subsets of features and not all features. Since the optimization technique used is random hyper-parameter search, these three feature lists will be selected at random and used for optimization.

N_ITER = 5
K_FOLDS = 5
SCORING_METRIC = 'roc_auc'
VERBOSE = 1
N_JOBS = 1
hyperparameter_dict = {
    'custom_transformer__power': randint(1,4),
    'custom_transformer__feature_names': [
        ['mean texture', 'mean perimeter', 'mean area'],
        ['smoothness error', 'compactness error', 'concavity error'],
        ['worst concave points', 'worst symmetry', 'worst fractal dimension'],
    ],
    'model__max_depth': randint(3,10),
    'model__max_features': ['sqrt'],
    'model__min_samples_split': randint(2,20),
}
pipe = Pipeline(steps=[
    ('custom_transformer', CustomTransformer(feature_names=list(X.columns), power=2)),
    ('model', DecisionTreeClassifier())
])
optimal_model = RandomizedSearchCV(
    pipe, hyperparameter_dict, n_iter = N_ITER, cv=K_FOLDS,
    scoring=SCORING_METRIC, n_jobs = N_JOBS,
    return_train_score=True, verbose = VERBOSE
)
optimal_model.fit(X_train, y_train)
print(
    optimal_model.score(X_test, y_test),
    optimal_model.best_params_
)

Conclusion

This post discussed how to create a custom transformation function that is compatible with sklearn pipelines. Moreover, these transformers incorporated parameters for optimization.

Ultimately, there is a lot of functionality already baked into the standard sklearn package and functions. However, there certainly isn’t everything. Custom transformations can be effectively utilized to add additional flexibility into an existing pipeline and add custom functionality.


If you’re interested in reading articles about novel Data Science tools and understanding machine learning algorithms, consider following me on Medium.

If you’re interested in my writing and want to support me directly, please subscribe through the following link. This link ensures that I will receive a portion of your membership fees.

Join Medium with my referral link – Zachary Warnes


Related Articles