The world’s leading publication for data science, AI, and ML professionals.

Data Preprocessing with Scikit-Learn: Standardization and Scaling

How to preprocess numerical features with different value ranges.

Photo by Charles Deluvio on Unsplash
Photo by Charles Deluvio on Unsplash

Scikit-learn is a widely-used machine learning library for Python. It has gained tremendous popularity among Data Science practitioners thanks to the variety of algorithms and its easy-to-understand syntax. In addition to ready-to-use algorithms, scikit-learn also provides useful functions and methods for data preprocessing.

Data preprocessing is an extremely important step in machine learning or deep learning. We cannot just dump the raw data into a model and expect it to perform well. Even if we build a complex, well structured model, its performance gets as good as the data we feed to it. Thus, we need to process the raw data to boost the performance of models.

In this post, we will cover the ways to handle numerical features (columns) that have very different value ranges. We will apply standardization and scaling. Let’s start with the motivation behind these transformations and then explore the differences between them with examples.

Motivation

The datasets that we fit to Machine Learning models usually have many features. The values of different features are highly likely to be on a different scale. For instance, consider a model trying to predict house prices. The area of a house is around 200 square meters whereas the age is usually less than 20. The number of bedrooms can be 1, 2, or 3 in most cases. All of these features are important in determining the price of a house. However, if we use them without any scaling, machine learning models might give more importance to the features with higher values. Models tend to perform better and converge faster when the features are on a relatively similar scale.

Standardization and StandardScaler

One solution to this issue is standardization. Consider columns as variables. If a column is standardized, mean value of the column is subtracted from each value and then the values are divided by the standard deviation of the column. The resulting columns have a standard deviation of 1 and a mean that is very close to zero. Thus, we end up having variables (columns) that have almost a normal distribution. Standardization can be achieved by StandardScaler.

The functions and transformers used during preprocessing are in sklearn.preprocessing package. Let’s import this package along with numpy and pandas.

import numpy as np
import pandas as pd
from sklearn import preprocessing

We can create a sample matrix representing features. Then transform it using a StandardScaler object.

a = np.random.randint(10, size=(10,1))
b = np.random.randint(50, 100, size=(10,1))
c = np.random.randint(500, 700, size=(10,1))
X = np.concatenate((a,b,c), axis=1)
X

X represents the values in a dataframe with 3 columns and 10 rows. Columns represent features. The mean and standard deviation of each column:

The columns are highly different in terms of mean and standard deviation.

We can now create a StandardScaler object and fit X to it.

sc = preprocessing.StandardScaler().fit(X)

X can be transformed by applying transform method on StandardScaler object.

X_standardized = sc.transform(X)
X_standardized

Let’s calculate the mean and standard deviation of transformed features.

Mean of each feature is very close to 0 and all the features have unit (1) variance. Please note that standard deviation is the square root of variance. A standard deviation of 1 indicates variance is 1.

I would like to emphasize a very important point here. Consider we are working on a supervised learning task so we split the dataset into training and test subsets. In that case, we only fit training set to the standard scaler object, not the entire dataset. We, of course, need to transform the test set but it is done with transform method.

  • StandardScaler.fit(X_train)
  • StandardScaler.transform(X_train)
  • StandardScaler.transform(X_test)

Fitting the entire dataset to the standard scaler object causes the model to learn about test set. However, models are not supposed to learn anything about test set. It destroys the purpose of train-test split. In general, this issue is called data leakage.

Data Leakage in Machine Learning

When we transform the test set, the features will not have exactly zero mean and unit standard deviation because the scaler used in transformation is based on the training set. The amount of change in the test set is the same as in the training set. Let’s create a sample test set and transform it.

X_test = np.array([[8, 90, 650], [5, 70, 590], [7, 80, 580]])
X_test
X_test_transformed = sc.transform(X_test)
X_test_transformed

The mean and standard deviation of the columns of test set:


MinMaxScaler and RobustScaler

Another way of bringing the value ranges to a similar level is scaling them in a specific range. For instance, we can squeeze each column between 0 and 1 in a way that minimum and maximum values before scaling become 0 and 1 after scaling. This kind of scaling can be achieved by MinMaxScaler of scikit learn. The default range is [0,1] but we can change it using feature_range parameter.

from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
X_scaled = mm_scaler.fit_transform(X)
X_scaled
mm_scaler2 = MinMaxScaler(feature_range=(0,10))
X_scaled2 = mm_scaler2.fit_transform(X)
X_scaled2

StandardScaler and MinMaxScaler are not robust to outliers. Consider we have a feature whose values are in between 100 and 500 with an exceptional value of 15000. If we scale this feature with MinMaxScaler(feature_range=(0,1)), 15000 is scaled as 1 and all the other values become very close to the lower bound which is zero. Thus, we end up having a disproportionate scale which negatively affects the performance of a model. One solution is to remove the outliers and then apply scaling. However, it may not always be a good practice to remove outliers. In such cases, we can use RobustScaler of scikit-learn.

RobustScaler, as the name suggests, is robust to outliers. It removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). RobustScaler does not limit the scaled range by a predetermined interval. Thus, we do not need to specify a range like we do for MinMaxScaler.

We can see the difference between MinMaxScaler and RobustScaler by adding a row of outliers to our previous dataset.

X_new = np.append(X, np.array([[50,420,1400]]), axis=0)
X_new

Let’s first apply MinMaxScaler with range [0,1].

X_new_mm = mm_scaler.fit_transform(X_new)
X_new_mm

Outliers scaled to the upper limit of the range. Thus, all the other values are very close to the lower limit.

How about RobustScaler?

from sklearn.preprocessing import RobustScaler
r_scaler = RobustScaler()
X_new_rs = r_scaler.fit_transform(X_new)
X_new_rs

When to use which?

We have covered StandardScaler, MinMaxScaler, and RobustScaler. Many machine learning models benefit from having features on similar scales. However, there is not a strict rule that defines what kind of transformation is optimal for a particular algoritm.

Both MinMaxScaler and StandardScaler are sensitive to outliers. Thus, in case of outliers that we cannot remove, RobustScaler is a better choice than the other two.

Without the presence of outliers, MinMaxScaler performs well in most cases. However, deep learning algorithms (e.g. neural networks) and regression algorithms are in favor of having features with normal distribution. StandardScaler is a better choice for such cases.

These are most commonly used transformation techniques that satisfy our needs in general. Scikit-learn also provides more specific transformations that are explained in documentation of preprocessing package.


Thank you for reading. Please let me know if you have any feedback.


Related Articles