Photo by Theme Inn on Unsplash

Normalization vs Standardization

The two most important feature scaling techniques

Ramya Vidiyala
Towards Data Science
6 min readJun 28, 2020

--

In Machine Learning, a model will be as good (or as bad) as the data you train the model with. The magnitude of different features affects different machine learning models for various reasons.

For example, consider a data set containing two features, age, and income. Here age ranges from 0–100, while income ranges from 0 to a huge amount which is mostly higher than 100. Income is about 1,000 times larger than age. So, these two features are in very different ranges. When we do further analysis, like multivariate linear regression, for example, the attributed income will intrinsically influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. Therefore, the range of all features should be scaled so that each feature contributes approximately proportionately to the final distance.

For this exact purpose, using Feature Scaling is essential.

In this article, we will be discussing what, why of feature scaling, the techniques to achieve feature scaling, it’s usefulness, and python snippet to achieve feature scaling using these techniques.

The flow of discussion will be as follows,

  • Feature Scaling
  • Normalization
  • Standardization
  • Implementation
  • When to use what?

Feature scaling

Feature scaling is a technique to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

Why use Feature Scaling?

  1. Gradient descent converges much faster with feature scaling than without it.
  2. Many classifiers (like KNN, K-means) calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. So the range of features should be scaled so that each feature contributes approximately proportionately to the final distance.

However, every dataset does not require features scaling. It is required only when features have different ranges.

This can be achieved using two widely used techniques.

  1. Normalization
  2. Standardization

Normalization

Normalization (also called, Min-Max normalization) is a scaling technique such that when it is applied the features will be rescaled so that the data will fall in the range of [0,1]

Normalized form of each feature can be calculated as follows:

The mathematical formula for Normalization

Here ‘x’ is the original value and ‘x`’ is the normalized value.

Standardization

Standardization (also called, Z-score normalization) is a scaling technique such that when it is applied the features will be rescaled so that they’ll have the properties of a standard normal distribution with mean,μ=0 and standard deviation, σ=1; where μ is the mean (average) and σ is the standard deviation from the mean.

Standard scores (also called z scores) of the samples are calculated as follows:

The mathematical formula for Standardization

This scales the features in a way that they range between [-1,1]

Implementation

Let us now implement these techniques using a simple python code.

The dataset that we will be using is Wine classifier data from Kaggle.

About the dataset:

This data set is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Importing pandas and reading data in CSV ( Comma Separated Values) to the data frame using pandas.

Output:

The output of the head()

Head() returns the first five elements in the data.

Now let us explore each column/ feature in detail.

Output:

Here we can see that the values of feature alcohol range from 11–15

Output:

The output of describe()

Clearly we can see various statistical function values and the minimum value around 11.03 and maximum value around 14.83

Output:

Here we can see the values of features malic lie in the range of 1–6 with few outliers.

Output:

Observations:

  • Feature alcohol ranges between [11,15]
  • Feature malic ranges in [0.5,6]
  • When we perform a machine learning algorithm (like KNN) in which distances are considered, the model will be biased to the feature, alcohol. Hence, Feature scaling is necessary for this data set.

Let us now perform normalization and standardization. And compare the results.

Applying Normalization

Now, let us apply normalization to each of the features and observe the changes.

Normalization can be achieved using MinMaxScaler.

Applying normalization on features, alcohol and malic

Output:

Applying Standardization

Let us now perform standardization

Standardization can be achieved using StandardScaler( )

Applying standardization on features, alcohol and malic

Output:

Comparing the results

Let us now plot to understand the results.

Output:

Scatter plot of Raw data, Normalized data, Standardized data

Observations:

  • In the raw data, feature alcohol lies in [11,15] and, feature malic lies in [0,6]
  • In the normalized data, feature alcohol lies in [0,1] and, feature malic lies in [0,1]
  • In the standardized data, feature alcohol and malic are centered at 0

When to use what?

“ Normalization or Standardization?” — There is no obvious answer to this question: it really depends on the application.

For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over normalization since we are interested in the components that maximize the variance.

However, this doesn’t mean that normalization is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, a typical neural network algorithm requires data on a 0–1 scale.

Thanks for the read. I am going to write more beginner-friendly posts in the future too. Follow me up on Medium to be informed about them. I welcome feedback and can be reached out on Twitter ramya_vidiyala and LinkedIn RamyaVidiyala. Happy learning!

--

--