
Data transformation is the process of converting raw data into a format or structure that would be more suitable for model building and also data discovery in general. It is an imperative step in feature engineering that facilitates discovering insights. This article will cover techniques of numeric data transformation: log transformation, clipping methods, and data scaling.
Why need data transformation?
- the algorithm is more likely to be biased when the data distribution is skewed
- transforming data into the same scale allows the algorithm to compare the relative relationship between data points better
When to apply data transformation
When implementing supervised algorithms, training data and testing data need to be transformed in the same way. This is usually achieved by feeding the training dataset to building the data transformation algorithm and then apply that algorithm to the test set.
Basic Feature Engineering and EDA
For this exercise, I am using the Marketing Analytics dataset from Kaggle.
If you would like to access the full code, please visit my website.
Firstly I performed some basic feature engineering to make data tidier and more insightful.
# 1. Transform Year_Birth into Age
from datetime import date
df['Age'] = date.today().year - df['Year_Birth']
# 2. Transform Dt_Customer into Enrollment_Length
df['Year'] = pd.DatetimeIndex(df['Dt_Customer']). year
df['Enrollment_Length'] = date.today().year - df['Year']
# 3. Transform Currency format into numbers
df['Income$'] = df[' Income '].str.replace(',', '').str.replace('$', '').str.replace('.', '').fillna(0).astype(int)
df['Income_M$'] = df['Income$'].apply(lambda x: round(x/1000000, 2))
print(df[['Income_M$', 'Income$']])
1.Transform year of birth to "Age"
Subtract current year from Year_Birth.
2. Transform the date customer enrolled ("Dt_Customer") into "Enrollment_Length"
It is similar the one above with additionally extracting the year part from the date feature.
3. Transform currency ("Income") into numbers ("Income_M$")
This involves four steps: 1) clean data by removing characters ", $ ." 2) substitute null value to 0; 3) convert string into integer; 4) scale down the numbers into million dollar which helps with visualizing the data distribution
Now, let’s visualize current data distribution using a simple univariate EDA technique – histogram. It is not hard to see that most variables are heavily skewed.
# select numeric variables
numeric_vars = ['Age', 'Enrollment_Length', 'Income_M$', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']
# create histograms for each numeric variable
fig = plt.figure(figsize=(24, 15))
for i in range(len(numeric_vars)):
var = numeric_vars[i]
sub = fig.add_subplot(3, 5, i + 1)
sub.set_xlabel(var)
df[var].plot(kind = 'hist')

If you want to know more about Data Cleaning, feature engineering and exploratory data analysis (EDA), I have curated a list that might be useful:
1. Log Transformation – right skewed data
When the data sample follows the power law distribution, we can use log scaling to transform the right skewed distribution into normal distribution. To achieve this, simply use the np.log() function. In this dataset, most variables fall under this category.

## log transformation - power law distribution ##
log_var = ['Income_M$', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases']
fig = plt.figure(figsize = (24,10))
for j in range(len(log_var)):
var = log_var[j]
transformed = "log_" + var
df[transformed] = np.log10(df[var]+1)
sub = fig.add_subplot(2, 5, j + 1)
sub.set_xlabel(var)
df[transformed].plot(kind = 'hist')
After the log transformation, these features have become more normally distributed.
After the log transformation, these features have become more normally distributed.

2. Clipping – handle outliers
This approach is more suitable when there are outliers in the dataset. Clipping method sets up the upper and lower bound, and all data points will be contained within the range.
We can use quantile() to find out what is the range of the majority amount of data (between 0.05 percentile and 0.95 percentile). Any numbers below the lower bound (defined by 0.05 percentile) will be rounded up to the lower bound. Similarly, the numbers above upper bound (defined by 0.95 percentile) will be rounded down to upper bound.
## clipping methods - handle outliers ##
clip_var = ['Age', 'NumWebVisitsMonth']
for i in clip_var:
transformed = 'clipped_'+ i
# upper limit - .95 quantile
upper_limit = df[i].quantile(0.95)
# lower limit - .05 quantile
lower_limit = df[i].quantile(0.05)
df[transformed] = df[i].clip(lower_limit, upper_limit, axis = 0)
print(df[i].describe())
print(df[transformed].describe())
plt.figure(figsize = (5,5))
df[i].plot(kind = 'hist')
plt.figure(figsize = (5,5))
df[transformed].plot(kind = 'hist')
From the histogram in the EDA process, we can see that variable "Age" and "NumWebVisitsMonth" have outliers with extraordinary large numbers. So I will only apply clipping to these two columns.
Consequently, the maximum value for both fields dropped significantly:
- Age: from 128 to 71
- NumWebVisitMonth: from 20 to 8


3. Scaling Transformation
After log transformation and addressing the outliers, we can the scikit-learn preprocessing library to convert the data into the same scale. This library contains some useful functions: min-max scaler, standard scaler and robust scaler. Each scaler serves different purpose.
Min Max Scaler – normalization
MinMaxScaler() is usually applied when the dataset is not distorted. It normalizes the data into a range between 0 and 1 based on the formula:
x’ = (x – min(x)) / (max(x) – min(x))
Standard Scaler – standardization
We use standardization when the dataset conforms to normal distribution. StandardScaler() converts the numbers into the standard form of mean = 0 and variance = 1 based on z-score formula:
x’ = (x – mean) / standard deviation.
Robust Scaler
RobustScaler() is more suitable for dataset with skewed distributions and outliers because it transforms the data based on median and quantile:
x = (x – median) / inter-quartile range.
To compare how these three scalers work, I use an iteration to scale the remaining variables and those two variables after clipping transformation, based on StandardScaler(), RobustScaler(), MinMaxScaler() respectively.
## data scaling methods ##
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
scale_var = ['Enrollment_Length', 'Recency', 'NumStorePurchases', 'clipped_Age', 'clipped_NumWebVisitsMonth']
scalers_list = [StandardScaler(), RobustScaler(), MinMaxScaler()]
for i in range(len(scalers_list)):
scaler = scalers_list[i]
fig = plt.figure(figsize = (26, 5))
plt.title(scaler, fontsize = 20)
for j in range(len(scale_var)):
var = scale_var[j]
scaled_var = "scaled_" + var
model = scaler.fit(df[var].values.reshape(-1,1))
df[scaled_var] = model.transform(df[var].values.reshape(-1, 1))
sub = fig.add_subplot(1, 5, j + 1)
sub.set_xlabel(var)
df[scaled_var].plot(kind = 'hist')
As shown, the scalers don’t change the shape of the data distribution but instead changing the spread of data point.

Take "NumStorePurchases" as an example, MinMaxScaler() converts the values to be strictly between 0 and 1, StandardScaler() transforms dataset into mean = 0 whereas RobustScaler() transforms dataset into median = 0.

In this dataset, these five variables are neither distorted nor normally distributed, therefore choosing a MinMaxScaler() among those three scalers should suffice.
Now that all features have been transformed into according to their properties. Let’s visualize them again. We can see that the data looks more organized and less distorted, hence more suitable for feeding into models and generating insights.
before transformation

after transformation

Hope you enjoy my article :). If you would like to read more of my articles on Medium, please feel free to contribute by signing up Medium membership using this affiliate link (https://destingong.medium.com/membership).
Take-Home Message
This article takes you through the journey of transforming data and demonstrates how to choose the appropriate technique according to the data properties.
- data cleaning: converting data type and removing unnecessary characters
- log transformation for right skewed data
- clipping methods for handling outliers
- data scaling: minmax scaler, standard scaler, robust scaler
Originally published at https://www.visual-design.net on July 27th, 2021.
More Articles Like This
Clustering Algorithm for Customer Segmentation