
As we all know feature engineering is a problem of transforming raw data into a dataset. There are various feature engineering techniques available out there. The two most widely used and commonly confused feature engineering techniques are:
- Normalization
- Standardization
Today on this beautiful day or night we will explore both of these techniques and see some of the common assumptions made by data analysts while solving a Data Science problem. Also, the whole code for this tutorial can be found on my GitHub Repository below
Normalization
Theory
Normalization is the process of converting a numerical feature into a standard range of values. The range of values might be either [-1, 1] or [0, 1]. For example, think that we have a data set comprising two features named "Age" and the "Weight" as shown below:

Suppose the actual range of a feature named "Age" is 5 to 100. We can normalize these values into a range of [0, 1] by subtracting 5 from every value of the "Age" column and then dividing the result by 95 (100–5). To make things clear in your brain we can write the above as a formula.

where min^(j) and max^(j) are the minimum and the maximum values of the feature j in the dataset.
Implementation
Now that you know the theory behind it let’s now see how to put it into production. As normal there are two ways to implement this: Traditional Old school manual method and the other using sklearn preprocessing
library. Today let’s take the help of sklearn
library to perform normalization.
Using sklearn preprocessing – Normalizer
Before feeding the "Age" and the "Weight" values directly to the method we need to convert these data frames into a numpy
array. To do this we can use the to_numpy()
method as shown below:
# Storing the columns Age values into X and Weight as Y
X = df['Age']
y = df['Weight']
X = X.to_numpy()
y = y.to_numpy()
The above step is very important because of both the fit()
and the transform()
method works only on an array.
from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit([X])
normalizer.transform([X])

normalizer = Normalizer().fit([y])
normalizer.transform([y])

As seen above both the arrays have the values in the range [0, 1]. More details about the library can be found below:
When should we actually normalize the data?
Although normalization is not mandatory or a requirement (must-do thing). There are two ways it can help you which is
- Normalizing the data will increase the speed of learning. It will increase the speed both during building (training) and testing the data. Give it a try!!
- It will avoid numeric overflow. What is really means is that normalization will ensure that our inputs are roughly in a small relatively small range. This will avoid problems because computers usually have problems dealing with very small or very large numbers.
Standardization
Theory
Standardization or z-score normalization or min-max scaling is a technique of rescaling the values of a dataset such that they have the properties of a standard normal distribution with μ = 0 (mean – average values of the feature) and σ = 1 (standard deviation from the mean). This can be written as:

Implementation
Now there are plenty of ways to implement standardization, just as normalization, we can use sklearn
library and use StandardScalar
method as shown below:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform([X])
sc.transform([X])
sc.fit_transform([y])
sc.transform([y])
You can read more about the library from below:
Z-Score Normalization
Similarly, we can use the pandas mean
and std
to do the needful
# Calculating the mean and standard deviation
df = (df - df.mean())/df.std()
print(df)

Min-Max scaling
Here we can use pandas min
and max
to do the needful
# Calculating the minimum and the maximum
df = (df-df.min())/(df.max()-df.min())
print(df)

Usually, the Z-score normalization is preferred because min-max scaling is prone to overfitting.
When to actually use Standardization and Normalization?
There is no one answer to the above question. If you have a small dataset and have sufficient time then you can experiment with both of the above techniques and choose the best one. Below is the rule of thumb that you can follow:
- You can use standardization on unsupervised learning algorithms. In this case, standardization is more beneficial than normalization.
- If you see a bell-curve in your data then standardization is more preferable. For this, you will have to plot your data.
- If your dataset has extremely high or low values (outliers) then standardization is more preferred because usually, normalization will compress these values into a small range.
In any other cases apart from the above-given one’s normalization holds good. Again if you have enough time experiment with both of the feature engineering techniques.
Alright, you guys have reached the end of the tutorial. I hope you guys learned a thing or two today. I used the textbook named "The Hundred-Page Machine Learning Book by Andriy Burkov" as a reference (Chapter 5) to write this tutorial. You can have a look at it. If you guys have any doubt regarding this tutorial you can use the comment section down below. I will try to answer it as soon as possible. Until then Stay Safe, Good Bye. See you next time. For more updates on Datafied to read and write more Python notebooks.