The world’s leading publication for data science, AI, and ML professionals.

Creating Custom Transformers with Scikit-Learn

This article discusses two methods to create custom transformers with Scikit-Learn and their implementation with Pipeline and GridSearchCV.

Photo by Arseny Togulev on Unsplash
Photo by Arseny Togulev on Unsplash

Transformers are classes that enable data transformations while preprocessing the data for machine learning. Examples of transformers in Scikit-Learn are SimpleImputer, MinMaxScaler, OrdinalEncoder, PowerTransformer, to name a few. At times, we may require to perform data transformations that are not predefined in popular Python packages. In such cases, custom transformers come to the rescue. In this article, we’ll discuss two methods of defining custom transformers in Python using Scikit-Learn. We’ll use the ‘Iris dataset’ from Scikit-Learn and define a Custom Transformer for outlier removal using the IQR method.

Method 1

This method defines a custom transformer by inheriting BaseEstimator and TransformerMixin classes of Scikit-Learn. ‘BaseEstimator’ class of Scikit-Learn enables hyperparameter tuning by adding the ‘set_params’ and ‘get_params’ methods. While, ‘TransformerMixin’ class adds the ‘fit_transform’ method without explicitly defining it. In the below code snippet, we’ll import the required packages and the dataset.

Image by author
Image by author

In the above code snippet, we’ve defined a class named ‘OutlierRemover’ which is our custom transformer to remove outliers i.e. replace outliers with NaN. The class has an attribute named ‘factor’ which is a hyperparameter to control the outlier removal process. Higher the ‘factor’, extreme would be the outliers removed. By default, ‘factor’ is initialized to 1.5. The class has three methods, namely, ‘outlier_removal’, ‘fit’ and ‘transform’. Inheriting BaseEstimator and TransformerMixin classes adds three more methods, namely, ‘fit_transform’, ‘get_params’ and ‘set_params’. We’ve also created an instance named ‘outlier_remover’ of the ‘OutlierRemover’ class.

init‘ is the first method that’s called upon creating an instance/object of the class. This is used to initialize the class attributes. We’ll initialize the factor of the IQR method with 1.5 (default value). The ‘outlier_removal’ method replaces the outliers in a Series with NaN. The ‘fit’ method returns self always. The ‘transform’ method takes in an array/data frame as input and applies the outlier_removal method to all the columns of the data frame/array and returns it.

Image by author
Image by author

We’ll apply the outlier removal transform using ‘OutlierRemover’ by creating a data frame named ‘test’ with three columns and four records.

Image by author
Image by author

We can see that ‘col1’ has an outlier (999) and ‘col3’ also has an outlier (-10). We’ll first fit the ‘OutlierRemover’ to the ‘test’ data frame (using the already created instance, ‘outlier_remover’) and apply the transform to it.

outlier_remover.fit(test)
Image by author
Image by author
outlier_remover.transform(test)
Image by author
Image by author

We can see that the outlier in ‘col1’ (999) is replaced with NaN and the outlier in ‘col3’ (-10) is replaced with NaN. We can apply the transform using a single ‘fit_transform’ method as shown below. This gives the same result as the one above.

outlier_remover.fit_transform(test)
Image by author
Image by author

We’ll create an instance named ‘outlier_remover_100’ of the ‘OutlierRemover’ class by setting the ‘factor’ to 100. As discussed earlier, higher the ‘factor’, extreme would be the outliers removed.

Image by author
Image by author

We can see that ‘999’ in ‘col1’ and ‘-10’ in ‘col2’ aren’t considered as outliers this time, as the ‘factor’ attribute is high. Now, we’ll apply the outlier remover transform to the iris dataset. Before that, we’ll visualize the outliers in the four columns of the Iris dataset using a box plot.

Image by author
Image by author

In the above box plot, we can see the column ‘SepalWidthCm’ has few outliers, four to be precise. The other three columns have no outliers. We’ll create a ColumnTransformer to apply the ‘OutlierRemover’ to all the variables of the Iris dataset and visualize the variables after outlier removal, using a boxplot.

Image by author
Image by author

In the above box plot, we can see that the outliers in the column ‘SepalWidthCm’ are removed i.e. replaced with NaN. We’ll find out how many and which outliers were removed from each column.

Image by author
Image by author
Image by author
Image by author

We can see that the values of ‘SepalWidthCm’ greater than 4 and less than or equal to 2 are removed, as they were outliers. The same can be seen in the earlier box plot showing outliers. Now, we’ll create a pipeline that removes the outliers, imputes the removed outliers and fits a logistic regression model. We’ll tune the hyperparameters using Gridsearchcv.

Image by author
Image by author

Method 2

The second method to create a custom transformer uses the ‘FunctionTransformer’ class of Scikit-Learn. This is a simpler approach that eliminates the need of defining a class, however, we need to define a function to perform the required transformation. Similar to method 1, we’ll create a custom transformer to remove the outliers. The below function takes an array/data frame along with ‘factor’ as inputs and replaces the outliers in each column with NaN.

In the above code snippet, we’ve also created an instance named ‘outlier_remover’ of the ‘FunctionTransformer’ class by passing the custom function (‘outlier_removal’) we defined for outlier removal, along with the argument ‘factor’. In this method, we need to pass the additional arguments (other than the input array/data frame) to the function inside the ‘FunctionTransformer’ as a dictionary using the ‘kw_args’ argument of ‘Functiontransformer’. We’ve also created the ‘test’ data frame we used in method 1.

Image by author
Image by author

We can see that ‘col1’ has an outlier (999) and ‘col3’ also has an outlier (-10). We’ll fit and apply the ‘OutlierRemover’ transform to the data using the already created instance ‘outlier_remover’ .

outlier_remover.fit_transform(test)

We can see that the outlier in ‘col1’ (999) is replaced with NaN and the outlier in ‘col3’ (-10) is replaced with NaN. Creating a custom transformer using ‘FunctionTransformer’ provides us with a few additional methods as shown below.

[i for i in dir(outlier_remover) if i.startswith('_') == False]
Image by author
Image by author

Now, we’ll create a Pipeline that removes the outliers, imputes the removed outliers and fits a logistic regression model. We’ll tune the hyperparameters using GridSearchCV.

Image by author
Image by author

A major difference in the method 2 is, we need to tune the ‘kw_args’ hyperparameter, unlike in the other transformers (including the one discussed in method 1). In the above code snippet, we’ve tuned the ‘kw_args’ hyperparameter using the list of values [{‘factor’:0},{‘factor’:1},{‘factor’:2},{‘factor’:3},{‘factor’:4}]. This may make it difficult to tune multiple hyperparameters of a custom transformer.

These are the two methods to define a custom transformer using Scikit-Learn. Defining custom transformers and including them in a pipeline simplifies the model development and also prevents the problem of data leakage while using k-fold cross-validation.

Know more about my work at https://ksvmuralidhar.in/


Related Articles