The world’s leading publication for data science, AI, and ML professionals.

Normalization vs Standardization, which one is better

In this tutorial let us see which one is the best feature engineering technique of them all.

Image credits to Author (Tanu Nanda Prabhu)
Image credits to Author (Tanu Nanda Prabhu)

As we all know feature engineering is a problem of transforming raw data into a dataset. There are various feature engineering techniques available out there. The two most widely used and commonly confused feature engineering techniques are:

  • Normalization
  • Standardization

Today on this beautiful day or night we will explore both of these techniques and see some of the common assumptions made by data analysts while solving a Data Science problem. Also, the whole code for this tutorial can be found on my GitHub Repository below

Tanu-N-Prabhu/Python


Normalization

Theory

Normalization is the process of converting a numerical feature into a standard range of values. The range of values might be either [-1, 1] or [0, 1]. For example, think that we have a data set comprising two features named "Age" and the "Weight" as shown below:

Image Credits to Author (Tanu Nanda Prabhu)
Image Credits to Author (Tanu Nanda Prabhu)

Suppose the actual range of a feature named "Age" is 5 to 100. We can normalize these values into a range of [0, 1] by subtracting 5 from every value of the "Age" column and then dividing the result by 95 (100–5). To make things clear in your brain we can write the above as a formula.

Image credits to The Hundred-Page Machine Learning Book by Andriy Burkov
Image credits to The Hundred-Page Machine Learning Book by Andriy Burkov

where min^(j) and max^(j) are the minimum and the maximum values of the feature j in the dataset.


Implementation

Now that you know the theory behind it let’s now see how to put it into production. As normal there are two ways to implement this: Traditional Old school manual method and the other using sklearn preprocessing library. Today let’s take the help of sklearn library to perform normalization.

Using sklearn preprocessing – Normalizer

Before feeding the "Age" and the "Weight" values directly to the method we need to convert these data frames into a numpy array. To do this we can use the to_numpy() method as shown below:

# Storing the columns Age values into X and Weight as Y
X = df['Age']
y = df['Weight']
X = X.to_numpy()
y = y.to_numpy()

The above step is very important because of both the fit() and the transform() method works only on an array.

from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit([X])
normalizer.transform([X])
Image credits to Author (Tanu Nanda Prabhu)
Image credits to Author (Tanu Nanda Prabhu)
normalizer = Normalizer().fit([y])
normalizer.transform([y])
Image credits to Author (Tanu Nanda Prabhu)
Image credits to Author (Tanu Nanda Prabhu)

As seen above both the arrays have the values in the range [0, 1]. More details about the library can be found below:

6.3. Preprocessing data – scikit-learn 0.22.2 documentation


When should we actually normalize the data?

Although normalization is not mandatory or a requirement (must-do thing). There are two ways it can help you which is

  • Normalizing the data will increase the speed of learning. It will increase the speed both during building (training) and testing the data. Give it a try!!
  • It will avoid numeric overflow. What is really means is that normalization will ensure that our inputs are roughly in a small relatively small range. This will avoid problems because computers usually have problems dealing with very small or very large numbers.

Standardization

Theory

Standardization or z-score normalization or min-max scaling is a technique of rescaling the values of a dataset such that they have the properties of a standard normal distribution with μ = 0 (mean – average values of the feature) and σ = 1 (standard deviation from the mean). This can be written as:

Image credits to The Hundred-Page Machine Learning Book by Andriy Burkov
Image credits to The Hundred-Page Machine Learning Book by Andriy Burkov

Implementation

Now there are plenty of ways to implement standardization, just as normalization, we can use sklearn library and use StandardScalar method as shown below:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform([X])
sc.transform([X])
sc.fit_transform([y])
sc.transform([y])

You can read more about the library from below:

6.3. Preprocessing data – scikit-learn 0.22.2 documentation


Z-Score Normalization

Similarly, we can use the pandas mean and std to do the needful

# Calculating the mean and standard deviation
df = (df - df.mean())/df.std()
print(df)
Image credits to Author (Tanu Nanda Prabhu)
Image credits to Author (Tanu Nanda Prabhu)

Min-Max scaling

Here we can use pandas min and max to do the needful

# Calculating the minimum and the maximum 
df = (df-df.min())/(df.max()-df.min())
print(df)
Image credits to Author (Tanu Nanda Prabhu)
Image credits to Author (Tanu Nanda Prabhu)

Usually, the Z-score normalization is preferred because min-max scaling is prone to overfitting.


When to actually use Standardization and Normalization?

There is no one answer to the above question. If you have a small dataset and have sufficient time then you can experiment with both of the above techniques and choose the best one. Below is the rule of thumb that you can follow:

  • You can use standardization on unsupervised learning algorithms. In this case, standardization is more beneficial than normalization.
  • If you see a bell-curve in your data then standardization is more preferable. For this, you will have to plot your data.
  • If your dataset has extremely high or low values (outliers) then standardization is more preferred because usually, normalization will compress these values into a small range.

In any other cases apart from the above-given one’s normalization holds good. Again if you have enough time experiment with both of the feature engineering techniques.


Alright, you guys have reached the end of the tutorial. I hope you guys learned a thing or two today. I used the textbook named "The Hundred-Page Machine Learning Book by Andriy Burkov" as a reference (Chapter 5) to write this tutorial. You can have a look at it. If you guys have any doubt regarding this tutorial you can use the comment section down below. I will try to answer it as soon as possible. Until then Stay Safe, Good Bye. See you next time. For more updates on Datafied to read and write more Python notebooks.

Datafied


Related Articles