
Oftentimes, the input features in our data can have different units of measurement. As a result, each feature can have its own unique distribution of values.
Unfortunately, incorporating features with different distributions can lead to a model showing bias towards features with larger values and variance.
Feature scaling addresses this issue by fitting all data to a specific scale, which is why it is often a necessary component in feature engineering.
The two most common methods of feature scaling are standardization and normalization.
Here, we explore the ins and outs of each approach and delve into how one can determine the ideal scaling method for a machine learning task.
Standardization
Standardization entails scaling data to fit a standard normal distribution.
A standard normal distribution is defined as a distribution with a mean of 0 and a standard deviation of 1.
Visualizing standardization
To better understand standardization, it would help to visualize its effects on some data.
We’ll perform standardization on 1000 random values ranging from 1 to 1000. After that, the distribution of the data before and after the scaling will be displayed with histograms.

From the histograms, we can see how standardization makes data conform to a standard normal distribution. After being scaled, the data has a mean of 0 and a standard deviation of 1.
However, even with a considerable change in values, the shape of the distribution is kept relatively intact after the transformation. This is key in scaling as the information stored in the features has to be preserved.
The Math
So, how exactly are these new values generated?
The formula used to derive standardized values is as follows:

In layman’s terms, standardization transforms values based on the mean and the standard deviation of the data in question.
To see how this formula is implemented, let’s use it in an example.
Suppose we are working with the following data:
Training set: [1, 4, 5, 11]
Testing set: [7]
In this case, the mean is 5.25 and the standard deviation is 3.63. Keep in mind that we do not consider the testing set when determining the parameters.
Knowing all the necessary information, we can standardize the raw values with a simple plug-and-chug.

To verify these results, we can perform the same operation in Python.

Normalization
Normalizing entails scaling data to fit into a specific range.
While this range can be of your choosing, typically, normalization fits data into a range of 0 and 1.
Visualizing normalization
Again, visualization can help give insight into the effects of normalization on data.
We can perform normalization on the same 1000 random numbers and use histograms to see how they change after being scaled .

As shown in the plots, the distribution has shifted and shrunk, with all values being between 0 and 1.
Similar to standardization, normalization does not alter the shape of the distribution too much as it aims to preserve information.
The Math
Normalization adheres to a simple formula:

This means that normalization transforms values based on the minimum and maximum values in the distribution.
We can repeat the previous exercise by performing normalization on the same made-up data:
Training set: [1, 4, 5, 11]
Testing set: [7]
In this case, the minimum value is 1 and the maximum value is 11.
Once again, the testing set is not considered when deriving these parameters. This means that even if the testing set had values less than 1 or greater than 11, the minimum and maximum values in the formula would still be 1 and 11, respectively.
With this information, the normalized values are easy to derive:

We can verify these results by performing the same operation in Python.

An important note
As mentioned in the examples, the testing data should not be considered when determining the parameters for a scaling operation.
The parameters used in any scaling should always be determined by the training data alone. While the testing data is also transformed, its values are only scaled based on the parameters derived from the training data.
If you’re unsure of why such an arrangement is necessary, check out my article on data leakage:
Which should you use?
Although standardization and normalization have the same basic function, they utilize different approaches. As a result, their use cases differ as well.
Standardization is ideal for data that fits a normal/gaussian distribution.
It is also superior when handling data with outliers as it is more resistant to extreme values. Standardization is often used in PCA, where the aim is to maximize variance while reducing dimensionality.
Normalization, on the other hand, is the safer alternative when you are unsure of the distribution of your data.
All in all, determining the best feature scaling method in a machine learning task requires a strong understanding of the data being used.
Conclusion

Now, you understand how standardization and normalization are utilized to scale data. Despite having similar functions, they take different approaches, meaning that their usability varies depending on the situation.
Understanding the differences between the two techniques is essential as it enables you to determine and apply the best scaling method for any given situation. This will optimize your feature engineering and boost the overall performance of your models.
I wish you the best of luck in your Data Science endeavors!