Data normalization with Pandas and Scikit-Learn

The complete guide to clean datasets — Part 1

Amanda Iglesias Moreno
Towards Data Science

--

Photo by Jeremy Perkins on Unsplash

The success of a machine learning algorithm highly depends on the quality of the data fed into the model. Real-world data is often dirty containing outliers, missing values, wrong data types, irrelevant features, or non-standardized data. The presence of any of these will prevent the machine learning model to properly learn. For this reason, transforming raw data into a useful format is an essential stage in the machine learning process. One technique you will come across multiple times when pre-processing data is normalization.

Data Normalization is a common practice in machine learning which consists of transforming numeric columns to a common scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the leaning process. However, it does not mean those variables are more important to predict the outcome of the model. Data normalization transforms multiscaled data to the same scale. After normalization, all variables have a similar influence on the model, improving the stability and performance of the learning algorithm.

There are multiple normalization techniques in statistics. In this article, we will cover the most important ones:

  1. The maximum absolute scaling
  2. The min-max feature scaling
  3. The z-score method
  4. The robust scaling

Besides, we will explain how to implement them with Pandas and Scikit-Learn.

So, let’s get started 💙

The following data frame contains the inputs (independent variables) of a multiple regression model for predicting the price of a second-hand car: (1) the odometer reading (km) and (2) the fuel economy (km/l). In this article, we use a small data set for learning purposes. However, in the real world, the data sets employed will be much larger.

As you can observe, the odometer reading ranges from 120000 to 400000, while the fuel economy ranges from 10 to 17. The multiple linear regression model will weight the odometer reading variable more heavily than the fuel economy attribute due to its higher values. However, it does not mean that the odometer reading attribute is more important as a predictor. To solve this problem, we have to normalize the values of both variables. ❤️

The maximum absolute scaling

The maximum absolute scaling rescales each feature between -1 and 1 by dividing every observation by its maximum absolute value.

We can apply the maximum absolute scaling in Pandas using the .max() and .abs() methods, as shown below.

Alternatively, we can use the Scikit-learn library to compute the maximum absolute scaling. First, we create an abs_scaler with the MaxAbsScaler class. Then, we use the fit method to learn the required parameters for scaling the data (the maximum absolute value of each feature). Finally, we transform the data using those parameters.

As you can observe, we obtain the same results using Pandas and Scikit-learn. The following plot shows the transformed data after performing the maximum absolute scaling.

The min-max feature scaling

The min-max approach (often called normalization) rescales the feature to a fixed range of [0,1] by subtracting the minimum value of the feature and then dividing by the range.

We can apply the min-max scaling in Pandas using the .min() and .max() methods.

Alternatively, we can use the MinMaxScaler class available in the Scikit-learn library. First, we create a scaler object. Then, we fit the scaler parameters, meaning we calculate the minimum and maximum value for each feature. Finally, we transform the data using those parameters.

Additionally, we can obtain the minimum and maximum values calculated by the fit function for normalizing the data with the data_min_ and data_max_ attributes.

The following plot shows the data after applying the min-max feature scaling. As you can observe, this normalization technique rescales all feature values to be within the range of [0, 1].

As you can observe, we obtain the same results using Pandas and Scikit-learn. However, if you want to perform many data transformation steps, it is recommended to use the MinMaxScaler as input in a Pipeline constructor instead of performing the normalization with Pandas.

Furthermore, it is important to bear in mind that the maximum absolute scaling and the min-max scaling are very sensitive to outliers because a single outlier can influence the minimum and maximum values and have a big effect on the results.

The z-score method

The z-score method (often called standardization) transforms the data into a distribution with a mean of 0 and a standard deviation of 1. Each standardized value is computed by subtracting the mean of the corresponding feature and then dividing by the standard deviation.

Unlike min-max scaling, the z-score does not rescale the feature to a fixed range. The z-score typically ranges from -3.00 to 3.00 (more than 99% of the data) if the input is normally distributed. However, the standardized values can also be higher or lower, as shown in the picture below.

It is important to bear in mind that z-scores are not necessarily normally distributed. They just scale the data and follow the same distribution as the original input. This transformed distribution has a mean of 0 and a standard deviation of 1 and is going to be the standard normal distribution (see the image above) only if the input feature follows a normal distribution.

We can compute the z-score in Pandas using the .mean() and std() methods.

Alternatively, we can use the StandardScaler class available in the Scikit-learn library to perform the z-score. First, we create a standard_scaler object. Then, we calculate the parameters of the transformation (in this case the mean and the standard deviation) using the .fit() method. Next, we call the .transform() method to apply the standardization to the data frame. The .transform() method uses the parameters generated from the .fit() method to perform the z-score.

To simplify the code, we have used the .fit_transform() method which combines both methods (fit and transform) together.

As you can observe, the results differ from those obtained using Pandas. The StandardScaler function calculates the population standard deviation where the sum of squares is divided by N (number of values in the population).

On the contrary, the .std() method calculates the sample standard deviation where the denominator of the formula is N-1 instead of N.

To obtain the same results with Pandas, we set the parameter ddof equal to 0 (default value is ddof=1) which represents the divisor used in the calculations (N-ddof).

We can obtain the parameters calculated by the fit function for standardizing the data with the mean_ and scale_ attributes. As you can observe, we obtain the same results in Scikit-learn and Pandas when setting the parameter ddof equals to 0 in the .std() method.

The following plot shows the data after applying the z-score method which is computed using the population standard deviation (divided by N).

The Robust Scaling

In robust scaling, we scale each feature of the data set by subtracting the median and then dividing by the interquartile range. The interquartile range (IQR) is defined as the difference between the third and the first quartile and represents the central 50% of the data. Mathematically the robust scaler can be expressed as:

where Q1(x) is the first quartile of the attribute x, Q2(x) is the median, and Q3(x) is the third quartile.

This method comes in handy when working with data sets that contain many outliers because it uses statistics that are robust to outliers (median and interquartile range), in contrast with the previous scalers, which use statistics that are highly affected by outliers such as the maximum, the minimum, the mean, and the standard deviation.

Let’s see how outliers affect the results after scaling the data with min-max scaling and robust scaling.

The following data set contains 10 data points, one of them being an outlier (variable1 = 30).

The min-max scaling shifts the variable 1 towards 0 due to the presence of an outlier as compared with variable 2 where the points are evenly distributed in a range from 0 to 1.

Before scaling, the first data point has a value of (1,1), both variable 1 and variable 2 have equal values. Once transformed, the value of variable 2 is much larger than variable 1 (0.034,0.142). This is because variable 1 has an outlier.

On the contrary, if we apply robust scaling, both variables have the same values (-1.00,-1.00) after the transformation, because both features have the same median and interquartile range, being the outlier the value that is shifted.

Now, it is time to apply the robust scaling to the cars data set 💜

As we previously did, we can perform robust scaling using Pandas.

The median is defined as the midpoint of the distribution, meaning 50% of the values of the distribution are smaller than the median. In Pandas, we can calculate it with the .median() or the .quantile(0.5) methods. The first quartile is the median of the lower half of the data set (25% of the values lie below the first quartile) and can be calculated with the .quantile(0.25) method. The third quartile represents the median of the upper half of the data set (75% of the values lie below the third quartile) and can be calculated with the .quantile(0.75) method.

As an alternative to Pandas, we can also perform robust scaling using the Scikit-learn library.

As shown above, we obtain the same results as before 🙌

The following plot shows the results after transforming the data with robust scaling.

Summary

Data normalization consists of transforming numeric columns to a common scale. In Python, we can implement data normalization in a very simple way. The Pandas library contains multiple built-in methods for calculating the most common descriptive statistical functions which make data normalization techniques really easy to implement. As another option, we can use the Scikit-Learn library to transform the data into a common scale. In this library, the most frequent scaling methods are already implemented.

Besides data normalization, there are multiple data pre-processing techniques we have to apply to guarantee the performance of the learning algorithm. We will cover some of them in future articles. 🙌

Thanks for reading :)

Amanda 💜

--

--