The world’s leading publication for data science, AI, and ML professionals.

5 Data Transformers to know from Scikit-Learn

Data Transformation technique you might not know exists.

Image from Author
Image from Author

If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.

As Data scientists, we are often faced with many situations where we faced difficulty when exploring data and developing machine learning. The struggle could come from the statistical assumption who did not fit your data or too much noise in the data. This is where you might need to transform your data to get a better clarity or fill the statistical method’s assumption.

Previously, I have read a beginner-friendly article regarding Data Transformation, which you could read here.

Beginner Explanation for Data Transformation

I suggest you read the article above if you did not understand what Data Transformation is and the benefit (or not) of transforming your data. If you feel have understood the concepts, then we could move on to a more depth discussion.

One disclaimer I would make is that you need to be careful when doing Data Transformation because you would end up with a transformed data – which is not your original data anymore. You need to understand why you need to transform your data and the transformed data output, which is why I write this article.

In this article, I want to outline a more advanced Data Transformation from the Scikit-Learn, which we could use in the specific situation. What are they? Let’s get into it.


1. Quantile Transformer

Quantile Transformation is a non-parametric data transformation technique to transform your numerical data distribution to following a certain data distribution (often the Gaussian Distribution (Normal Distribution)). In the Scikit-Learn, the Quantile Transformer can transform the data into Normal distribution or Uniform distribution; it depends on your distribution references.

How is the Quantile Transformation works? Conceptually, Quantile Transformer applying Quantile Function into the data intended to transform. Quantile Function itself is an inversed function of the Cumulative Distribution Function (CDF) which you could check here for the Normal Distribution. If you use Uniform Distribution, the transformed data would be the quantile position of the data. Let’s use example data to understand the transformation better.

import seaborn as sns
import numpy as np
from sklearn.preprocessing import QuantileTransformer
#Using mpg data
mpg = sns.load_dataset('mpg')
#Quantile Transformation (by default it is Uniform)
quantile_transformer = QuantileTransformer(random_state=0,  output_distribution='uniform')
mpg['mpg_trans'] = pd.Series(quantile_transformer.fit_transform(np.array(mpg['mpg']).reshape(-1, 1))[:,0])
mpg[['mpg', 'mpg_trans']].head()
Image by Author
Image by Author

As you can see in the image above, the real data in the index 0 is 18, and the transformed data is 0.28; why is it 0,28? Let’s try to take the quantile 0.28 of the mpg columns.

np.quantile(mpg['mpg'], 0.28)
Image by Author
Image by Author

The result is 18; this is means that the transformed data is the approximation of the quantile position of the actual data. For note, when you are applying Quantile Transformation, you would lose the Linear Correlation between the variable you transformed because Quantile Transformer is a Non-Linear transformer. However, it is not expected to measure the linear correlation between transformed variables as the data have been changed.

Quantile transformation is often used to remove outliers or fit the Normal Distribution, although there are many similar data transformations that you could compare with.


2. Power Transformer

While Quantile Transformer is a non-parametric transformer applying Quantile Function, Power Transformer is a parametric transformer via power function. Like the Quantile Transformer, Power Transformer is often used to transform data to follow the Normal Distribution.

From Scikit-Learn, two methods are given within the Power Transformer class: Yeo-Johnson transform, and Box-Cox transforms. The basic difference between the methods is the data they allowed to be transformed – Box-Cox needs the data to be positive, while Yeo-Johnson allowed the data to be both negative and positive. Let’s use the example data to use the Power Transformation from Scikit-Learn. I want to use the previous dataset as some features are not distributed normally – for example, the weight feature.

sns.distplot(mpg['weight'])
Image by Author
Image by Author

As we can see in the image above, the distribution is clearly skew-right or positive skew, which means it did not follow the normal distribution. Let’s use the Power Transformer to transform the data to follow the normal distribution closely.

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method = 'box-cox')
mpg['weight_trans'] = pt.fit_transform(np.array(mpg['weight']).reshape(-1,1))[:,0]
sns.distplot(mpg['weight_trans'])
Image by Author
Image by Author

Using the Power Transformer (Box-Cox in this example) would push the data to follow the normal distribution closely. As the data is quite skewed, the transformed data are not completely following the normal distribution, but it is closer than the untransformed data.

While both Quantile Transformer and Power Transformer can transform your data into another data distribution by preserving the ranks, the use might still depend on your data and the scale you expected from the transformation. There is no certain way to say that one transformer is better than the other; what you can do to know which transformer is good for your data is by applying it and measure the metrics you use.


3. K-Bins Discretizations

Discretization is a process of transforming the continuous feature into a categorical feature by partitioning it into several bins within the expected value range (intervals). I would show you the sample data and discretization transformation in the table below.

Image by Author
Image by Author

In the table above, I have five data points (10,15,20,25,30), and I discretize the continuous value into a categorical feature I called Bins, which holds values 1 and 2. In the Bins feature, I input values between 10–20 into category 1 and the rest into category 2 – This is how essentially discretization work.

In the Scikit-Learn, the process of Discretization using a binning with a set interval (often quantile) is compiled in the KBinsDiscretization class. Let’s use the dataset example to get a better understanding of the function.

from sklearn.preprocessing import KBinsDiscretizer
#Setting the divided bins into 5 bins with quantile interval and transformation into ordinal category
est = KBinsDiscretizer(n_bins = 5, encode = 'ordinal', strategy='quantile')
mpg['mpg_discrete'] = est.fit_transform(np.array(mpg['mpg']).reshape(-1,1))
mpg[['mpg', 'mpg_discrete']].sample(5)
Image by Author
Image by Author

As we can see from the table above, the continuous feature mpg is discretized into an ordinal categorical feature. You would often benefit from the One-Hot Encoding of the discretization feature; that is why the KBinsDiscretizer also offers you the One-Hot capability (in fact, the default encode parameter is ‘onehot’). However, I often use the Pandas get_dummies feature for the OHE process as it is easier to process than directly from the KBinsDiscretizer.


4. Feature Binarization

Feature Binarization is a simple discretization process using a certain threshold to transform the continuous feature into a categorical feature. The value results from Feature Binarization is Boolean value – True or False (0 or 1). Let’s try to use the Binarization class from Scikit-Learn to understand the concept.

from sklearn.preprocessing import Binarizer
#Setting the threshold to 20
transformer = Binarizer( threshold = 20)
mpg['mpg_binary'] = transformer.fit_transform(np.array(mpg['mpg']).reshape(-1,1))
mpg[['mpg', 'mpg_binary']].sample(5)
Image by Author
Image by Author

As you can see from the table above, the value below 20 would return False (0), and the rest would return True (1). The threshold is something we set so often; you might want to know why you set the current threshold. Note that if you used KBinsDiscretizer n_bins to 2, it would be similar to the Binarizer if the threshold is similar to the bin edge value.


5. Function Transformers

Scikit-Learn has provided us many transformation methods that we could use for the data preprocessing pipeline. However, we want to apply our own function for data transformation, but Scikit-Learn did not offer it. That is why Scikit-Learn also presents the Function Transformers class to develop their own data transformation function.

Why do we want to develop our own custom function with Scikit-Learn? It is only somewhat required to have a Scikit-Learn pipeline for your model and need to have a custom transformation in your data pipeline. The more you work with the model and data, you would realize that a custom transformer is used more often than you think.

Let’s try to create our own transformers with Function Transformers. I want to transform my data into log value in my data pipeline, but Scikit-Learn did not offer the function; that is why I need to develop it by myself.

from sklearn.preprocessing import FunctionTransformer
#Passing the log function from numpy which would be applied to every data in the column
transformer = FunctionTransformer(np.log, validate=True)
mpg['mpg_log'] = transformer.fit_transform(np.array(mpg['mpg']).reshape(-1,1))
mpg[['mpg', 'mpg_log']].head(5)
Image by Author
Image by Author

With the Function Transformer, you could control your own transformation result like the above result. The data is transformed into log data, and you could use it for further processes.

You don’t necessarily need to rely on the Numpy package; you could always create your own function – what is important is the function produces an output. For example, I create my own function on the below code.

def cust_func(x):
    return x + 1

The function would return a numerical value with an addition process (plus one). Let’s pass it into our custom transformers.

transformer = FunctionTransformer(cust_func, validate=True)
mpg['mpg_cust'] = transformer.fit_transform(np.array(mpg['mpg']).reshape(-1,1))
mpg[['mpg', 'mpg_cust']].head(5)
Image by Author
Image by Author

As you can see from the table above, our value result follows the function we have previously developed. Function Transformer sounds easy, but in a long way, it would help your pipeline process.


Conclusion

Data Transformation is a typical process in your data work and often benefits your work if you know what the data transformation process results. Scikit-Learn have provided us with few data transformations method, including:

  1. Quantile Transformer
  2. Power Transformer
  3. K-Bins Discretizer
  4. Feature Binarization
  5. Function Transformers

I hope it helps!


Visit me on my LinkedIn or Twitter.

If you are not subscribed as a Medium Member, please consider subscribing through my referral.


Related Articles