The world’s leading publication for data science, AI, and ML professionals.

Elo Merchant Category Recommendation – A Case Study

A Machine Learning Case Study walk-through.

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

In today’s modern era, Machine Learning is involved in almost every aspect of our lives. From something as simple as providing movie and product recommendations to something complex such as leveraging business data to infer and make business decisions for organizations, Machine Learning and AI over past few decades has come a long way. Today, I am going to discuss one such example of the Elo Merchant Category Recommendation – A Machine Learning(ML) case study that I enjoyed working on.


Overview of the Case Study:

The case study approach is divided into following steps:

  • Business Problem
  • ML Problem Formulation
  • Exploratory Data Analysis
  • Feature Engineering
  • Feature Correlation and Feature Selection
  • Regression Models
  • Conclusion and Results

Business Problem

Elo is one of the largest payment brands in Brazil. They provide restaurant recommendations for a user along with discounts based on the user’s credit card provider and restaurant preferences. Elo has built partnerships with various brands to offer promotions and discounts from various merchants to the user.

Now, our problem at hand is to find out how useful and beneficial these promotions are for merchants as well as the users(customers). We need to find out if the customers actually use these promotions or discounts offered to them. This can be achieved by predicting a metric called customer loyalty score which is the target variable. So each card will have a corresponding predicted loyalty score.

Customer loyalty score will give us an idea of how often the users/customers use these promotions and discounts offered to them. With the predicted data in hand, the company(Elo) can now focus on the customers which are more loyal. This means that they can direct their marketing efforts towards these loyal customers. This would also ensure that Elo reduces the unwanted marketing campaigns towards the customers who are predicted to have low customer loyalty. This would ultimately lead to better customer retention rates.

ML Problem Formulation

Even though the name suggests recommendation, it is actually a regression problem because for a given card ID, we need to predict its loyalty score given the transactions and merchants data for the cards. So, we will be training our Regression Models on the features generated from this data. These Regression Models would then predict loyalty scores based on the generated features.

Data Overview

All data is simulated and fictitious, and is not real customer data for obvious reasons. The data provided contains the transaction data of up to 3 months for each card ID. It also contains the merchant data based on the merchants involved in these transactions. There is an additional file containing all purchase/transaction data of another 2 months that is not included in the initial transaction data of 3 months. Below is the detailed description for each of these files:

  1. Data_Dictionary.xlsx → This file contains the datafield description for each csv file.
  2. train.csv and test.csv → These files contain the Card IDs(card_id) and the information about the cards. They also contain the target variable(loyalty score) that needs to be predicted. Below are the descriptions for each of the columns:
card_id → Unique card identifier
first_active_month → month of first purchase in 'YYYY-MM' format
feature_1 → Anonymized card categorical feature
feature_2→ Anonymized card categorical feature
feature_3 → Anonymized card categorical feature
target → Loyalty numerical score calculated 2 months after historical and evaluation period
  1. historical_transactions.csv and new_merchant_transactions.csv → These files contain the transactions data. They contain information about transactions for each card. Below are the descriptions for each of the columns:
card_id → Card identifier
month_lag → month lag to reference date
purchase_date → Purchase date
authorized_flag → 'Y' if approved, 'N' if denied
category_3 → anonymized category
installments → number of installments of purchase
category_1 → anonymized category
merchant_category_id → Merchant category identifier(anonymized)
subsector_id → Merchant category group identifier(anonymized)
merchant_id → Merchant identifier(anonymized)
purchase_amount → Normalized purchase amount
city_id → City identifier(anonymized)
state_id → State identifier (anonymized)
category_2 → anonymized category
  1. merchants.csv → This file contains the additional information of the merchants involved in the transactions. Below are the descriptions for each of the columns:
merchant_id → Unique merchant identifier
merchant_group_id → Merchant group(anonymized)
merchant_category_id → Unique identifier for merchant category (anonymized)
subsector_id → Merchant category group (anonymized)
numerical_1 → anonymized measure
numerical_2 → anonymized measure
category_1 → anonymized category
category_2 → anonymized category
category_4 → anonymized category
city_id → City identifier(anonymized)
most_recent_sales_range → Range of revenue (monetary units) in last active month (A > B > C > D > E)
most_recent_sales_range → Range of revenue (monetary units) in last active month (A > B > C > D > E)
most_recent_purchases_range → Range of quantity of transactions in last active month (A > B > C > D > E)
avg_sales_lag3 → Monthly average of revenue in last 3 months divided by revenue in last active month
avg_purchases_lag3 → Monthly average of transactions in last 3 months divided by transactions in last active month
active_months_lag3 → Quantity of active months within last 3 months
avg_sales_lag6 → Monthly average of revenue in last 6 months divided by revenue in last active month
avg_purchases_lag6 → Monthly average of transactions in last 6 months divided by transactions in last active month
active_months_lag6 → Quantity of active months within last 6 months
avg_sales_lag12 → Monthly average of revenue in last 12 months divided by revenue in last active month
avg_purchases_lag12 → Monthly average of transactions in last 12 months divided by transactions in last active month
active_months_lag12 → Quantity of active months within last 12 months

All the data files can be downloaded from this Kaggle link.

Performance Metric

The performance metric that we will be using in order to calculate the error in predictions from the actual loyalty scores is RMSE(Root Mean Squared Error).

Image by Author
Image by Author

Here, ŷ is the loyalty score predicted and y is the actual loyalty score for each card ID.

Exploratory Data Analysis

Train and test files

  1. Target variable distribution:
Target Variable Distribution (Image by Author)
Target Variable Distribution (Image by Author)

Majority of the loyalty scores are between -1 and 1. Also, these are centered around zero. So there might be a possibility that these are already standardized.

As we can see there are some points that are far away from all the points. These have loyalty scores below -30. Since, these points constitute 1% of data, these cannot be labelled as outliers. It all depends whether these points are present in our test data or not. As we would come to know in the latter stages, we do have these points in the test data. So, lets just call them rare data points.

2. Categorical Features

Categorical Features Distribution for Non Rare Data points (Image by Author)
Categorical Features Distribution for Non Rare Data points (Image by Author)

Lets check out these categorical features for our rare data points..

Categorical Features Distribution for Rare Data points (Image by Author)
Categorical Features Distribution for Rare Data points (Image by Author)

There is not much difference between the categorical features 1,2 and 3 for rare and non rare data points. Extracting features from historical transactions and merchant transactions might be helpful to better predict these rare loyalty scores of <-30.

3. First Active Month

Since this is given in ‘YYYY-MM’ format. Let’s convert this to a simple measure like month difference from today’s date. This is quite simple to implement using datetime in pandas.

Now we can plot this against the loyalty score(target)..

Scatterplot of loyalty scores against the month difference from first active month (Image by Author)
Scatterplot of loyalty scores against the month difference from first active month (Image by Author)
  • We can observe a trend here that the most recent users have higher loyalty scores as well as high variance in loyalty scores.
  • However, this also shows that there are more number of recent users as compared to long time users.
  • One more important observation is the number of users that have a loyalty score of <=-30(that might be outliers) are at the very bottom of the plot for each of the bin/value ranges of the active month difference from today. This shows that this feature might not be useful enough to separate our rare data points from the actual data.
  • This feature would definitely help in predicting the loyalty scores.

Historical Transactions

  1. Authorized Flag
(Image by Author)
(Image by Author)

As it should be, majority of the transactions are authorized.

  1. Installments
Installments distribution (Image by Author)
Installments distribution (Image by Author)

Values of -1 and 999 seem odd for the installments. We might need to trim these later on if needed.

  1. Purchase Amount
(Image by Author)
(Image by Author)

It seems that these were already standardized as well.

New Merchant Transactions

It was observed that this file did not contain any unauthorized transactions.

The installments and purchase amounts distributions were very similar to what we had observed for the historical transactions.

Plotting Purchase Amounts and Installments against Loyalty Score

Scatterplot of installments against loyalty score(target) - (Image by Author)
Scatterplot of installments against loyalty score(target) – (Image by Author)

As the target score seems to be standardized(in some range), the people with higher number of installments are observed to have a loyalty score that is closer to zero.

(Image by Author)
(Image by Author)
  • As we can see here that the loyalty score increases with the increase in sum of transaction values.
  • The same trend is observed for the number of transactions made by each card_id as well.

Merchants Data

This file had a lot of missing values and infinity values in its columns. The infinity values were replaced with Null values. Then, mean and mode imputation was used accordingly to deal with these missing values.

  1. Anonymized Categories
(Image by Author)
(Image by Author)
  • Since all three features are anonymized, we cant say much about the merchants from these plots even though much can be interpreted from the plots.
  • However, it can be further explored and checked if the merchants involved in a transaction belongs to the majority category then what possible loyalty score could it lead to.
  1. Active Months Lag
(Image by Author)
(Image by Author)
  • As we can observe here, the cards with less number of active months have a low loyalty score.
  • We can see a lot of variance here. These active months columns might be useful as features.
  1. Average Purchases Lag
(Image by Author)
(Image by Author)
  • For our rare data points(loyalty scores<=-30), the purchase range lag is less than 2000.
  • This can be a very useful feature as well for predicting the loyalty scores.

EDA Summary:

  1. The target variable(Loyalty score) has 1% of its points that seem like outliers. Dealing with these points depends on the number of these points in the test data. In that case, these aren’t actually outliers. They can be termed as Rare data points. However, the loyalty score would be difficult to predict for these points and they could have a major impact on our final scores.
  2. The first active month in train file could be very useful.
  3. Aggregated transaction features for each card would be very helpful in predicting the loyalty score using the regression models. Most of our features would be from the transaction files.
  4. Some features can be used from the merchants file as well. For example, we can have a categorical variable that states if the transaction involved a merchant that was active in the last 3/6 or 12 months. The average purchases and sales ranges might be useful as well.

Data Preprocessing

Data Preprocessing included imputing the missing values with either mean and mode for categorical and continuous variables respectively.

In case of Purchase amounts, this issue was handled by trimming the purchase amount values in a specific range that covered up to 99.9 percentile of its values.

As observed earlier, the installments values such as -1 and 999 were replaced with nan(null value) and imputed accordingly.

historical_transactions['purchase_amount'] = historical_transactions['purchase_amount'].apply(lambda x: min(x, 0.8))
historical_transactions['installments'].replace([-1, 999], np.nan, inplace=True)
historical_transactions['installments'].fillna(historical_transactions['installments'].mode()[0], inplace=True)

Feature Engineering

The Feature Engineering for this problem was done in two iterations. In the first iteration, both the transaction files were combined into a single file with all transactions and the features were generated using this combined file. The RMSE scores obtained from these features were not good enough.

Hence I had to go back to Feature Engineering to extract features from both these files separately. Along with this, additional features were also generated in the second iteration. This improved the RMSE score by ~0.35.

Handling Date Features

purchase_date and first_active_month were very useful in generating additional features such as quarter of the year and month difference from today. This can be easily done using datetime in Pandas.

Similarly, for purchase_date we generated the features that indicated if the purchase was done on a weekend, holiday or weekday. This also included features such as the month, year, date, time of the day, and hour of the purchase.

Handling Categorical Features

Two approaches were used for Categorical Features:

  1. One Hot Encoding
  2. Mean Encoding

One hot encoding was used in the first iteration. However, mean encoding provided much better results after the second iteration of feature engineering. The mean encoding for the categorical variables was done based on whether the value of the corresponding target variable is less than -30(is rare data point) or not. This was done to improve the predictions on the rare data points.

The features from transactions were aggregated by ‘card_id’ using mean, max, min, var, skew and n_unique/count based on the variables to eventually combine with train and test files.

More features were generated using the existing features and the available domain expertise. Some of these are shown below:

In the end, 226 features were generated in total.

Feature Correlation and Feature Selection:

Since we have generated close to 200 features, the correlation matrix looks something like this..

Correlation matrix (Image by Author)
Correlation matrix (Image by Author)

Yikes!! This does not look good.

It is very difficult to interpret anything from this as we have high number of features. So a slightly different approach was taken using the Correlation matrix.

A threshold of 0.85 was set and one of the features from feature pairs having a collinearity higher than the threshold were removed. This approach brought down the number of features from 226 to 134.

Regression Models

Baseline Regression Models

The initial models were tuned using RandomizedSearchCV and GridSearchCV and trained on the train data obtained after train test split.

Linear Regression:

Linear Regression (Image by Author)
Linear Regression (Image by Author)

SGD Regressor:

SGD Regressor (Image by Author)
SGD Regressor (Image by Author)

Random Forest Regressor:

Random Forest Regressor (Image by Author)
Random Forest Regressor (Image by Author)

LGBM Regressor:

LGBM Regressor (Image by Author)
LGBM Regressor (Image by Author)

Since RandomizedSearchCV wasn’t providing the best results, Bayesian Optimization was used for hyper tuning the LGBM models

The LGBM models performed better on full set of 226 features instead of 134 features. The possible reason for this might be the fact that Trees and Tree based ensemble models are not affected much by Correlated Features.

Bayesian Ridge Regression stacked model with LGBM Goss

Predictions were made using LGBM Goss models trained on StratifiedKfold and RepeatedKFold on the train data. Here, there was no need of train test split since we were making out of fold predictions on our train data. Hence it was possible to use the entire train data for training the model.

These two predictions were stacked and given as an input to Bayesian Ridge Regression model (Meta model).

Low and High Probability Model Approach

Since, the rare data points were severely affecting the outliers. I tried the below model architecture to deal with this..

Image by Evgeny Patekha
Image by Evgeny Patekha
  1. Binary Classification Model: This classification model identifies the rare data points. This was an LGBM Classifier model. A threshold(hyper parameter) was used to classify the card/user as rare data point(outlier) based on the class probabilities obtained from the model.
  2. Regression(Full): This was a simple regression model that was trained on the entire dataset. This was the stacked Bayesian ridge regression model that was trained on entire data.
  3. Regression (Low Prob Model): Regression Model trained on low concentration(hyperparameter) of rare data points(outliers). This was also an LGBM regressor model trained on all features.
  4. Regression(High Prob Model): Regression Model trained with high concentration(hyperparameter) of outliers. We have a very low amount of rare data points so the model could easily overfit. Hence, this was a simple regression model(Bayesian Ridge Regression). The predictions from Full Regression model, Binary Classification were used along with 10 most important regression features based on feature importances obtained from full regression model for training.
  5. Full(Blend) Model: The predictions from High prob model, Low prob model and full regression model were blended to give a final loyalty score prediction. The final meta model here was also a Bayesian Ridge Regression Model.

Results

The performances of all the above approaches are summarized in the below tables:

Baseline Regression Models:

Regression Model results (Image by Author)
Regression Model results (Image by Author)

Stacked and LGBM Goss Models results:

RMSE scores for the different approaches mentioned (Image by Author)
RMSE scores for the different approaches mentioned (Image by Author)

The stacked Bayesian Ridge model provided the best Kaggle score. Hence this was used for final submission.

Kaggle Submission

Here is my best Kaggle score post submission..

(Image by Author)
(Image by Author)

Summary

  1. Feature Engineering is of utmost importance in this problem. Better the features generated from transactions and merchants data, better are the RMSE scores.
  2. Hnadling the outliers/rare data points is the crux of this case study. These points hugely impact the final RMSE score.
  3. After trying different regression models and stacking architectures, LGBM Goss Models stacked on Bayesian Ridge Regression as the meta model provided the best results.

Future Work/Improvements

  1. Generating more features using target encoding can help to improve the score.
  2. The High and Low prob model approach can be further optimized with more features and experimenting with the high and low prob models. It was successfully implemented by the Evgeny Patekha in his 5th placed solution as explained here.
  3. Most of the categorical and numerical features are anonymized. Hence, we don’t exactly have a clear idea on what these features are. This makes it difficult to generate new features based on domain expertise. However, this limitation/constraint is in place to avoid revealing the user/company data.

Conclusion

There are a lot of techniques that I learned and improved upon from working on this case study. Skills such as Feature Engineering and Model Hypertuning were of huge importance to this case study. This was also my first time using Bayesian Optimization for hypertuning the models. In the field of ML, learning is a continuous and never ending process as new techniques and advances are being made every year. Hence, the learning will continue…


Hope you enjoyed reading this as much as I enjoyed working on the case study. Thank you for reading!!

Complete code for the case study can be found here.

LinkedIn: www.linkedin.com/in/rjt5412

Github: https://github.com/Rjt5412


References:

  1. https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-elo
  2. https://www.kaggle.com/artgor/elo-eda-and-models
  3. A Closer Look at Date Variables by Robin Denz
  4. https://www.kaggle.com/fabiendaniel/elo-world/notebook
  5. https://www.kaggle.com/fabiendaniel/hyperparameter-tuning
  6. https://www.kaggle.com/mfjwr1/simple-lightgbm-without-blending
  7. https://www.kaggle.com/roydatascience/elo-stack-with-goss-boosting
  8. https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/82314

Related Articles