The world’s leading publication for data science, AI, and ML professionals.

What is Feature Scaling & Why is it Important in Machine Learning?

MinMaxScaler vs StandardScaler vs RobustScaler

Photo by Braden Collum on Unsplash
Photo by Braden Collum on Unsplash

Feature scaling is the process of normalizing the range of features in a dataset.

Real-world datasets often contain features that are varying in degrees of magnitude, range, and units. Therefore, in order for machine learning models to interpret these features on the same scale, we need to perform feature scaling.

In the world of science, we all know the importance of comparing apples to apples, and yet many people, especially beginners, have a tendency to overlook feature scaling as part of their Data Preprocessing for machine learning. As we will see in this article, this can cause models to make predictions that are inaccurate.

In this article, we will discuss:

  • Why Feature Scaling is important
  • The difference between normalization and standardization
  • Why and how feature scaling affects model performance

More specifically, we will be looking at 3 different scalers in the Scikit-learn library for feature scaling and they are:

  1. MinMaxScaler
  2. StandardScaler
  3. RobustScaler

As usual, you can find the full notebook on my GitHub here.

For the purpose of this tutorial, we will be using one of the toy datasets in Scikit-learn, the Boston house prices dataset.

This is a regression problem in Machine Learning as house price is a continuous variable.


Examine data

Let us first get an overall feel for our data.

These are the first 5 rows of the dataset.

First 5 rows of the dataset
First 5 rows of the dataset

As we can see, we have 13 independent variables and a target variable.

Missing values and data type
Missing values and data type

Hooray, no missing values! This means we don’t have to worry about imputation or dropping rows or columns with missing data.

Furthermore, it also appears that all of our independent variables, as well as the target variable, are of the float64 data type.

Summary statistics
Summary statistics

We can clearly observe that the features have very different scales. This is largely attributed to the different units in which these features were measured and recorded.

This is where feature scaling can help us resolve this issue.


Understanding the effects of different scalers

In this section, we will learn the distinction between normalization and standardization. In addition, we will also examine the transformational effects of 3 different feature scaling techniques in Scikit-learn.

Normalization

Normalisation, also known as min-max scaling, is a scaling technique whereby the values in a column are shifted so that they are bounded between a fixed range of 0 and 1.

MinMaxScaler is the Scikit-learn function for normalization.

Standardization

On the other hand, standardization or Z-score normalization is another scaling technique whereby the values in a column are rescaled so that they demonstrate the properties of a standard Gaussian distribution, that is mean = 0 and variance = 1.

StandardScaler is the Scikit-learn function for standardization.

Unlike StandardScaler, RobustScaler scales features using statistics that are robust to outliers. More specifically, RobustScaler removes the median and scales the data according to the interquartile range, thus making it less susceptible to outliers in the data.

Normalisation vs standardisation

Here comes the million-dollar question – when should we use normalization and when should we use standardization?

As much as I hate the response I’m about to give, it depends.

The choice between normalization and standardization really comes down to the application.

Standardization is generally preferred over normalization in most machine learning contexts as it is especially important when comparing the similarities between features based on certain distance measures. This is most prominent in Principal Component Analysis (PCA), a dimensionality reduction algorithm, where we are interested in the components that maximize the variance in the data.

Normalization, on the other hand, also offers many practical applications, particularly in computer vision and image processing where pixel intensities have to be normalized in order to fit within the RGB color range between 0 and 255. Moreover, neural network algorithms typically require data to be normalized to a 0 to 1 scale before model training.

At the end of the day, there is no definitive answer as to whether you should normalize or standardize your data. One can always apply both techniques and compare the model performance under each approach for the best result.

Application

Now that we have gained a theoretical understanding of feature scaling and the difference between normalization and standardization, let’s see how they work in practice.

To demonstrate the effects of MinMaxScaler, StandardScaler, and RobustScaler, I have chosen to examine the following features in our dataset before and after implementing feature scaling:

  • ZN
  • AGE
  • TAX
  • B
Original vs MinMaxScaler vs StandardScaler vs RobustScaler
Original vs MinMaxScaler vs StandardScaler vs RobustScaler

As we can see, our original features have wildly different ranges.

MinMaxScaler has managed to rescale those features so that their values are bounded between 0 and 1.

StandardScaler and RobustScaler, on the other hand, have rescaled those features so that they are distributed around the mean of 0.


Which models require feature scaling?

I mentioned in the introduction that unscaled data can adversely impact a model’s ability to make accurate predictions but so far, we have yet to discuss exactly how and why they do.

As a matter of fact, feature scaling does not always result in an improvement in model performance. There are some machine learning models that do not require feature scaling.

In this section of the article, we will explore the following classes of machine learning algorithms and address whether or not feature scaling will impact their performance:

  • Gradient descent-based algorithms
  • Distance-based algorithms
  • Tree-based algorithms

Gradient descent-based algorithms

Gradient descent is an iterative optimization algorithm that takes us to the minimum of a function.

Machine learning algorithms like linear regression and logistic regression rely on gradient descent to minimize their loss functions or in other words, to reduce the error between the predicted values and the actual values.

Having features with varying degrees of magnitude and range will cause different step sizes for each feature. Therefore, to ensure that gradient descent converges more smoothly and quickly, we need to scale our features so that they share a similar scale.

Check out this video where Andrew Ng explains the gradient descent algorithm in more detail.

Distance-based algorithms

The underlying algorithms of distance-based models make them the most vulnerable to unscaled data.

Algorithms like k-nearest neighbors, support vector machines, and k-means clustering use the distance between data points to determine their similarity. Hence, features with a greater magnitude will be assigned a higher weightage by the model. This is not an ideal scenario as we do not want our model to be heavily biased toward a single feature.

Evidently, it is crucial that we implement feature scaling to our data before fitting them to distance-based algorithms to ensure that all features contribute equally to the result of the predictions.

Tree-based algorithms

Each node in a classification and regression trees (CART) model, otherwise known as decision trees represents a single feature in a dataset.

The tree splits each node in such a way that it increases the homogeneity of that node. This split is not affected by the other features in the dataset.

For that reason, we can deduce that decision trees are invariant to the scale of the features and thus do not require feature scaling.

This also includes other ensemble models that are tree-based, for example, random forest and gradient boosting.


Comparing model accuracy

Now that we understand the types of models that are sensitive and insensitive to feature scaling, let us now convince ourselves with a concrete example using the Boston house prices dataset.

I have chosen 2 distance-based algorithms (KNN and SVR) as well as 1 tree-based algorithm (decision trees regressor) for our little experiment.

Here, I will construct a machine learning pipeline that contains a scaler and a model. Using that pipeline, we will fit and transform the features and subsequently make predictions using the model. These predictions are then evaluated using root mean squared error.

We should expect to see an improved model performance with feature scaling under KNN and SVR and a constant model performance under decision trees with or without feature scaling.

KNN

The results of the KNN model are as follows.

KNN
KNN

As expected, the errors are much smaller with feature scaling than without feature scaling. In other words, our model performed better using scaled features.

In this example, KNN performed best under RobustScaler.

SVR

The results of the SVR model are as follows.

SVR
SVR

Similar to KNN, SVR also performed better with scaled features as seen by the smaller errors.

In this example, SVR performed best under StandardScaler.

Decision tree

The results of the decision tree model are as follows.

Decision tree
Decision tree

As expected, a decision tree is insensitive to all feature scaling techniques as seen in the RMSE that are indifferent between scaled and unscaled features.


Conclusion

Well done for getting all the way through the end of this article!

To summarise, feature scaling is the process of transforming the features in a dataset so that their values share a similar scale.

In this article, we have learned the difference between normalization and standardization as well as 3 different scalers in the Scikit-learn library, MinMaxScaler, StandardScaler, and RobustScaler.

We also learned that gradient descent and distance-based algorithms require feature scaling while tree-based algorithms do not. We managed to prove this via an example with the Boston house prices dataset and compare the model accuracy with and without feature scaling.

I hope that you have learned something new from this article. Feel free to check out my other articles on data preprocessing using Scikit-learn.

Happy learning!

Stop Wasting Useful Information When Imputing Missing Values

Guide to Encoding Categorical Features Using Scikit-Learn For Machine Learning


References

Follow me on other platforms


Related Articles