The elementary concept for data transformation

If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.
Introduction
What is Data Transformation?— I am pretty sure anybody who is learning data and Statistics would come across these terms at some point. Data transformation is a concept that refers to the mathematical function applied to each value in the dataset to replace the value into a new value. In a mathematical equation, we could express it in the image below.

If I put it in a simpler explanation, data transformation is a process that changes your data into another data via a mathematical equation.
Why do we need to do Data Transformation? Is there any benefit to transforming data? From a statistical point of view, the reasons are:
- Transforming data allowed you to fulfill certain statistical assumptions, e.g., Normality, Homogeneity, Linearity, etc.
- Data transformation scales the values from different columns to be comparable, e.g., Salary in USD (range from 100–10000) with Weight in Kilograms (range from 20–100).
Data transformation is useful to gain new insight and clear noise in your data. However, utilizing the data transformation method required you to understand the transformation effect, implication, and conclusion based on the transformed data. In my opinion, you only do data transformation if it is necessary and you understand your transformation goal.
What are the methods for data transformation? According to McCune and Grace (2002) in their Analysis of Ecological Communities Book, the methods are:
- Monotonic Transfromation
- Relativizations (Standardization)
- Probabilistic Transformation (Smoothing)
If the terms above sound unfamiliar to you, it’s alright. Let’s explore all these methods deeper!
One disclaimer I would make is that you need to be careful when doing Data Transformation because you would end up with a transformed data – which is not your original data anymore. Learn what is the purpose of the data transformation and report any data transformation you have done.
Monotonic Transfromation
What is Monotonic Transformation? It is a data transformation method that applies math function to each of the data values independent of the other data. The word monotonic came from the method procedure, which transforms the data values without changing their rank. In a simpler term, Monotonic transformation changed your data without rely on other data and did not change their rank within the column.
An example of the renowned Monotonic Transformation function is Logarithmic Transformation or Log Transformation. Just like the name implies, Log Transformation change your data value into their logarithmic values by applying a log function to each data values. Many variables follow log-normal distributions, meaning that the values would follow a normal distribution after the log transformation. This is one of the benefits of Log Transformation – to follow the assumption of normality, or at least close to.
In a mathematical term, Log Transformation is expressed in the equation below.

Let’s try the log transformation method with sample data. I would use the data from the Kaggle regarding the Engineering Graduate Salary. First, read the data into the data frame.
import numpy as np
import pandas as pd
import seaborn as sns
data = pd.read_csv('Engineering_graduate_salary.csv')
There are 33 features in this dataset, but I would not use every available data. This data is to know what affects the salary, so let’s try to visualize the salary data distribution.
sns.distplot(data['Salary'])

As we can see in the image above, the salary feature is not normally distributed. Let’s apply the log transformation to transform the data into a normal distribution.
#Salary Log Transformation with base 10
data['log10_Salary'] = data['Salary'].apply(np.log10)
With a single line, we have transformed the data into the log base 10 values. Let’s try to visualize it once more.
sns.distplot(data['log10_Salary'])

The salary data is now closer to the normal distribution. We could try to check the normality by using normality tests such as the Shapiro test, but I would not explain that concept in this article.
Another purpose of the data transformation is to acquire a better insight from the data relationship. For example, I am only interested in the relationship between the college GPA and the Engineering graduate’s salary. Let’s try to visualize it with the scatterplot.
sns.scatterplot(x = 'Salary', y = 'collegeGPA', data = data)

I am trying to visualize the relationship between Salary and the college GPA, and I ended up with a data cluster with not much insight. This is one of the moments where we could apply log transformation to rescale the data to get better clarity.
sns.scatterplot(x = 'log10_Salary',y = 'collegeGPA', data = data)

The Salary and college GPA relationship is much clearer right now, where there is not much relationship between the GPA and the Salary. Although, what we do right now is visualize the relationship between log value with unscaled features. Let’s try to transform the college GPA feature as well and visualize the relationship.
data['log10_collegeGPA'] = data['collegeGPA'].apply(np.log10)
sns.scatterplot(x = 'log10_Salary',y = 'log10_collegeGPA', data = data)

The relationship is perfectly seen right now compared to when we visualized it without any data transformation. This is another benefit why you do a data transformation.
It is beneficial, especially when you need to present it to the business user where you want to show the data relationship, but your data is clustered, so that it is hard to get any insight.
There are many methods in the Monotonic Transformation. Still, I would not explain them in this article as I planned to make another article to outline the other Monotonic Transformation method. What is important is that you understand what Monotonic Transformation is.
Relativizations (Standardization)
Relativizations or Standardization is a Data Transformation method where the column or row standard transforms the data values (e.g., Max, Sum, Mean). It is different from the Monotonic Transformation, where Standardization is not independent and relies on another statistic.
You would often need Standardization when you occur attributes with a different unit, and your analysis needs the data to have a similar unit. The analysis example is clustering analysis or dimensionality reduction, where they rely on the data distance.
The famous standardization method is Z-score standardization, where the data is transformed by the mean and standard deviation of the feature to scale. The transformed feature mean would ~0 and standard deviation ~1. After the Z-score standardization transformation, the transformed data itself would be called Z-score. In a mathematical notation, it is expressed in the equation below.

where x = value in feature, μ = feature mean, and σ = feature standard deviation.
One note to remember, even though Z-score standardization transformed your data to follow normal distribution standards, the feature distribution itself isn’t necessarily following the normal distribution. The point of Z-score standardization is to rescale the feature, after all.
Let’s try the Z-score standardization with a dataset example. First, we need to import the package we want to use.
#Import Z-Score Standard Scaler from the Sklearn package
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Let’s say I want to rescale the Salary data from our previous example. Here is our original data and statistic.
data['Salary'].head()

data['Salary'].agg(['mean', 'std'])

Our data unit is in ten-thousands with the Salary mean shown in the image above. Then, we transformed the data into Z-score using the scaler we import previously but first, we need to fit the scaler (this is the process to acquire the mean and standard deviation of the feature).
scaler.fit(np.array(data['Salary']).reshape(-1, 1))
If you want to double-check whether our scaler obtained the correct mean and standard deviation, we could access the value with the code below.
print('Salary Mean', scaler.mean_)
print('Salary STD', np.sqrt(scaler.var_))

The result is slightly different but almost negligible. Let’s transform our Salary data into Z-score values.
data['Z_Salary'] = scaler.transform(np.array(data['Salary']).reshape(-1, 1))
data[['Salary', 'Z_Salary']].head()

We can see the difference now between the original data and the transformed data. The negative value Z-score is when your data is less than the mean and vice versa. Let’s examine the transformed data statistic.
data['Z_Salary'].agg(['mean', 'std'])

As you can see, the transformed data mean is close to 0, and the std is close to 1. Every feature that scaled using Z-Score Standardization would follow the same standard.
There is another benefit specific to the Z-Score Standardization, and that is an Outlier Detection. I would not explain in detail, but basically, the outlier detection concept is related to the empirical rule. Any Z-score that is more than 3 or less than 3 is considered an outlier.
Like Monotonic Transformation, Relativizations or Standardization have many methods within, but it would be another article to talk more about it.
Probabilistic Transformation (Smoothing)
Probabilistic Transformation or Smoothing is a Data Transformation process to eliminate any noises in the data to enhance the strongest pattern within the data.
The transformation is particularly effective on heterogeneous or noisy data. The smoothing process allowed you to see data patterns that previously were unseen. Although you need to be careful when interpreting the result from the smoothing process – it could show you a trend looks reliable even from random data.
The common Smoothing technique used is the Kernel-Density Estimation (KDE) smoothing. This technique basically smoothing the data by estimate the data probabilistic function of the population random variable based on the finite data sample.
Let’s try to smooth the data sample to obtain the data pattern. For example, I want to see the distribution of the computer programming data.
sns.distplot(data['ComputerProgramming'], kde = False)

The binning data pattern is seen there, but we might want to eliminate any noise that might distract us from the real pattern. Let’s use KDE Smoothing to acquire that pattern.
sns.distplot(data['ComputerProgramming'], hist = False)

With the smoothing technique, we transformed the data into density values using the probabilistic function estimation of the data. As we could see, there are two peaks within our data – one in the 0 and one near the ~500 with the highest peak in the latter.
This pattern is only seen if we are smoothing the data. You might ask, the smoothing pattern seems to show a different pattern from the binning pattern. Remember the smoothing purpose? It is to eliminate noises in data and to enhance the strongest pattern. Moreover, the KDE estimates the probabilistic function in the population based on the sample data – which means the smoothing pattern is the estimation of what would happen in the population.
There are still many Probabilistic Transformation or Smoothing methods you could learn, but I would leave it for another time. The important point of Smoothing is to transform your data to eliminate any noises and enhance the pattern.
Conclusion
Data Transformation is a technique that Data scientists should know because of their benefit. According to McCune and Grace (2002) in their Analysis of Ecological Communities Book, there are 3 methods for Data Transformation. They are:
- Monotonic Transfromation
- Relativizations (Standardization)
- Probabilistic Transformation (Smoothing)
I hope it helps!
Visit me on my LinkedIn or Twitter.
If you are not subscribed as a Medium Member, please consider subscribing through my referral.