The world’s leading publication for data science, AI, and ML professionals.

3 Common Techniques for Data Transformation

How to Choose the Appropriate One for Your Data

Data Transformation Overview (image by author from www.visual-design.net)
Data Transformation Overview (image by author from www.visual-design.net)

Data transformation is the process of converting raw data into a format or structure that would be more suitable for model building and also data discovery in general. It is an imperative step in feature engineering that facilitates discovering insights. This article will cover techniques of numeric data transformation: log transformation, clipping methods, and data scaling.

Why need data transformation?

  • the algorithm is more likely to be biased when the data distribution is skewed
  • transforming data into the same scale allows the algorithm to compare the relative relationship between data points better

When to apply data transformation

When implementing supervised algorithms, training data and testing data need to be transformed in the same way. This is usually achieved by feeding the training dataset to building the data transformation algorithm and then apply that algorithm to the test set.


Basic Feature Engineering and EDA

For this exercise, I am using the Marketing Analytics dataset from Kaggle.

If you would like to access the full code, please visit my website.

Firstly I performed some basic feature engineering to make data tidier and more insightful.

# 1. Transform Year_Birth into Age
from datetime import date
df['Age'] = date.today().year - df['Year_Birth']
# 2. Transform Dt_Customer into Enrollment_Length
df['Year'] = pd.DatetimeIndex(df['Dt_Customer']). year
df['Enrollment_Length'] = date.today().year - df['Year']
# 3. Transform Currency format into numbers
df['Income$'] = df[' Income '].str.replace(',', '').str.replace('$', '').str.replace('.', '').fillna(0).astype(int)
df['Income_M$'] = df['Income$'].apply(lambda x: round(x/1000000, 2))
print(df[['Income_M$', 'Income$']])

1.Transform year of birth to "Age"

Subtract current year from Year_Birth.

2. Transform the date customer enrolled ("Dt_Customer") into "Enrollment_Length"

It is similar the one above with additionally extracting the year part from the date feature.

3. Transform currency ("Income") into numbers ("Income_M$")

This involves four steps: 1) clean data by removing characters ", $ ." 2) substitute null value to 0; 3) convert string into integer; 4) scale down the numbers into million dollar which helps with visualizing the data distribution

Now, let’s visualize current data distribution using a simple univariate EDA technique – histogram. It is not hard to see that most variables are heavily skewed.

# select numeric variables
numeric_vars =  ['Age', 'Enrollment_Length', 'Income_M$', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']
# create histograms for each numeric variable
fig = plt.figure(figsize=(24, 15))
for i in range(len(numeric_vars)):
    var = numeric_vars[i]
    sub = fig.add_subplot(3, 5, i + 1)
    sub.set_xlabel(var)
    df[var].plot(kind = 'hist')
EDA (image by author)
EDA (image by author)

If you want to know more about Data Cleaning, feature engineering and exploratory data analysis (EDA), I have curated a list that might be useful:

  1. Feature Selection

Feature Selection and EDA in Machine Learning

2. Exploratory Data Analysis

Semi-Automated Exploratory Data Analysis (EDA) in Python

3. Data Visualization

How to Choose the Most Appropriate Chart?

4. Address Missing Data

How to Address Missing Data


1. Log Transformation – right skewed data

When the data sample follows the power law distribution, we can use log scaling to transform the right skewed distribution into normal distribution. To achieve this, simply use the np.log() function. In this dataset, most variables fall under this category.

before transformation (image by author)
before transformation (image by author)
## log transformation - power law distribution ##
log_var = ['Income_M$', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases']
fig = plt.figure(figsize = (24,10))
for j in range(len(log_var)):
    var = log_var[j]
    transformed = "log_" + var
    df[transformed] = np.log10(df[var]+1)
sub = fig.add_subplot(2, 5, j + 1)
    sub.set_xlabel(var)
    df[transformed].plot(kind = 'hist')

After the log transformation, these features have become more normally distributed.

After the log transformation, these features have become more normally distributed.

after transformation (image by author)
after transformation (image by author)

2. Clipping – handle outliers

This approach is more suitable when there are outliers in the dataset. Clipping method sets up the upper and lower bound, and all data points will be contained within the range.

We can use quantile() to find out what is the range of the majority amount of data (between 0.05 percentile and 0.95 percentile). Any numbers below the lower bound (defined by 0.05 percentile) will be rounded up to the lower bound. Similarly, the numbers above upper bound (defined by 0.95 percentile) will be rounded down to upper bound.

## clipping methods - handle outliers ##
clip_var = ['Age', 'NumWebVisitsMonth']
for i in clip_var:
    transformed = 'clipped_'+ i
# upper limit - .95 quantile
    upper_limit = df[i].quantile(0.95)
# lower limit - .05 quantile
    lower_limit = df[i].quantile(0.05)
df[transformed] = df[i].clip(lower_limit, upper_limit, axis = 0)

    print(df[i].describe())
    print(df[transformed].describe())

    plt.figure(figsize = (5,5))
    df[i].plot(kind = 'hist')

    plt.figure(figsize = (5,5))
    df[transformed].plot(kind = 'hist')

From the histogram in the EDA process, we can see that variable "Age" and "NumWebVisitsMonth" have outliers with extraordinary large numbers. So I will only apply clipping to these two columns.

Consequently, the maximum value for both fields dropped significantly:

  • Age: from 128 to 71
  • NumWebVisitMonth: from 20 to 8
apply clipping method to Age
apply clipping method to Age
apply clipping method to NumWebVisitsMonth
apply clipping method to NumWebVisitsMonth

3. Scaling Transformation

After log transformation and addressing the outliers, we can the scikit-learn preprocessing library to convert the data into the same scale. This library contains some useful functions: min-max scaler, standard scaler and robust scaler. Each scaler serves different purpose.

Min Max Scaler – normalization

MinMaxScaler() is usually applied when the dataset is not distorted. It normalizes the data into a range between 0 and 1 based on the formula:

x’ = (x – min(x)) / (max(x) – min(x))

Standard Scaler – standardization

We use standardization when the dataset conforms to normal distribution. StandardScaler() converts the numbers into the standard form of mean = 0 and variance = 1 based on z-score formula:

x’ = (x – mean) / standard deviation.

Robust Scaler

RobustScaler() is more suitable for dataset with skewed distributions and outliers because it transforms the data based on median and quantile:

x = (x – median) / inter-quartile range.

To compare how these three scalers work, I use an iteration to scale the remaining variables and those two variables after clipping transformation, based on StandardScaler(), RobustScaler(), MinMaxScaler() respectively.

## data scaling methods ##
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
scale_var = ['Enrollment_Length', 'Recency', 'NumStorePurchases', 'clipped_Age', 'clipped_NumWebVisitsMonth']
scalers_list = [StandardScaler(), RobustScaler(), MinMaxScaler()]
for i in range(len(scalers_list)):
    scaler = scalers_list[i]
    fig = plt.figure(figsize = (26, 5))
    plt.title(scaler, fontsize = 20)
    for j in range(len(scale_var)):
        var = scale_var[j]
        scaled_var = "scaled_" + var
        model = scaler.fit(df[var].values.reshape(-1,1))
        df[scaled_var] = model.transform(df[var].values.reshape(-1, 1))
sub = fig.add_subplot(1, 5, j + 1)
        sub.set_xlabel(var)
        df[scaled_var].plot(kind = 'hist')

As shown, the scalers don’t change the shape of the data distribution but instead changing the spread of data point.

scalers comparison (image by author)
scalers comparison (image by author)

Take "NumStorePurchases" as an example, MinMaxScaler() converts the values to be strictly between 0 and 1, StandardScaler() transforms dataset into mean = 0 whereas RobustScaler() transforms dataset into median = 0.

NumStorePurchases
NumStorePurchases

In this dataset, these five variables are neither distorted nor normally distributed, therefore choosing a MinMaxScaler() among those three scalers should suffice.


Now that all features have been transformed into according to their properties. Let’s visualize them again. We can see that the data looks more organized and less distorted, hence more suitable for feeding into models and generating insights.

before transformation

after transformation


Hope you enjoy my article :). If you would like to read more of my articles on Medium, please feel free to contribute by signing up Medium membership using this affiliate link (https://destingong.medium.com/membership).


Take-Home Message

This article takes you through the journey of transforming data and demonstrates how to choose the appropriate technique according to the data properties.

  1. data cleaning: converting data type and removing unnecessary characters
  2. log transformation for right skewed data
  3. clipping methods for handling outliers
  4. data scaling: minmax scaler, standard scaler, robust scaler

Originally published at https://www.visual-design.net on July 27th, 2021.

More Articles Like This

Clustering Algorithm for Customer Segmentation

Feature Selection and EDA in Machine Learning

Semi-Automated Exploratory Data Analysis (EDA) in Python


Related Articles