Will Customers Buy the Products in their Cart?

Using XGB Classifier to predict if a customer will eventually make a purchase once they add the item to cart.

Andy Yu
Towards Data Science

--

Picture Source: https://www.thesouthafrican.com/lifestyle/black-friday-deals-in-south-africa-tips-2019/

Understanding what are the key drivers of customers’ buying behavior is always the holy grail for the eCommerce industry. The knowledge can be used to improve the shopping process and eventually result in higher sales and customer satisfaction. In this project, I choose the 2019-November data from Kaggle — eCommerce behavior data from a multi-category store to demonstrate how I analyze and build a basic prediction model through XGBoost.

Know your customers

Before we devoted ourselves to feature engineering and building models, it’s always good to step back and have an EDA (exploratory data analysis). Usually, we will found some useful information helping later process — data preparation, feature engineering, and modeling.

The data looks like below, for details of each attribute please check out here:

Data source: REES46 Marketing Platform for Kaggle

After checking the basic descriptive analysis, there are some questions might be interesting to explore further. Knowing the data set contains time, product information, and price, I am thinking of a few business questions I am able to get an answer from the current data set.

  1. Daily traffic in November?

Total customer visits in November is 3,696,117 but it’s unlikely they came to the site evenly.

We can see a big spike around Nov 16th and 17th, and I guess it would be an on-sell event. This can be further proved if we check the daily price trend for the price of a product (ex: product_id = 1003461, Xiaomi or 1005115, Apple), and you would always a lower price for the product during the 16th and 17th.

2. What product categories and brands are the most popular?

The top 5 categories people view is ‘electronics.smartphone’, ‘electronics.video.tv’, ‘computers.notebook’, ‘electronics.clocks’, and ‘apparel.shoes’, as below treemap. When I check the brand from those purchase event data, the most popular brands are Samsung, Apple, Xiaomi, and other brands of electronics. It’s apparent that the majority of customers come for shopping electronics. Hence, it would be an open strategic question for the managers whether they should concentrate on this particular category instead of being a multi-category store, or would it be beneficial to include other categories? In case this phenomenon can be observed from other month’s data as well.

Treemap of top 30 categories

3. Are the customers' purchase journey just like the typical funnel (view => cart => purchase)?

While the pie chart shows us only a small portion of people really add the item into the cart but it is roughly 30% (1.36% divided by 4.49%) purchase conversion once the item was put into the cart by customers. Though each step of the conversion funnel can be optimized, in this project, I focus on the cart-to- purchase conversion.

Will the customer purchases the products while adding them to the shopping cart?

Now I am going to build the prediction model. For this use case, I only use “Cart” and “Purchase” data. Furthermore, I also re-engineer the data structure by introducing a few new features:

  • category_code_level1: category
  • category_code_level2: sub-category
  • event_weekday: weekday of the event
  • activity_count: number of activities in that session, including all event type
  • is_purchased: whether the item put in the cart is purchased

With two more features, ‘brand’ and ‘price’, from raw data, it’s good to go. If you are interested in data cleaning and feature engineering process, please check the code here.

Modeling

In my analysis, I use XGB classifier. XGBoost is an implementation of gradient boosted decision trees with good performance. Considering the amount of data, I also random downsample the raw data (500,000 records for each class: purchase and not purchase) to avoid the issue of class imbalance. I aim to get a quick look result and then ponder on the next iteration of model improvement and tuning.

Fbeta shows 0.68 and Recall is 0.74, it’s okay for the first simple model. However, I am more curious about what feature plays an important role to predict purchase.

Feature_importance gives us some insight, event_weekday and activity_count seem to dominate the prediction, according to the data set. When reviewing the daily traffic, I can see a few increases during the weekend and holidays. It makes sense that customers spend more time during these days and make purchases. However, it might also because of some implicit reason, such as a promotion, as I mentioned before — People buy on a particular date because of the price rather than their habit. To verify this hypothesis, I need more context for the store. Activity_count is a metric that I count total event records for a particular user session, serving as a proxy of customer engagement. Therefore, interactions between customers and the site really show the significance. It might be because customers usually spending time comparing their items and gathering information to make buying decisions. If in such case, making UI friendly or introducing some features like “the lowest price guarantee”, or “other people also look at” to help customer expedite their buy decision process will eventually drive up the conversion.

Conclusion and Future Work

Looking at the result, now we know that customer activity on the site is the key driver to decide whether a customer would buy from the analysis, we can reconsider our strategy to further develop our model. For example, to enhance the prediction capability by exploring personal purchase experience, we might need clickstream data for more event type, like which component customer clicks. We might also need to acquire customers’ profile data so that we can cluster the type of customers accordingly, which will help improve the product recommendation system. Since the purchase behavior is very personal, to have data from various facets allows us to better analyze what characteristics and features are important to work on.

Apart from acquiring more data points, it could be useful to try other classifiers and tuning parameters for the model. To scale this, it would be better to build a training pipeline in which we experiment faster and pick the best performance result for our next model. In my other project — Finding donors, I showcase how to build the pipeline so that you try many different classifiers and parameter settings efficiently.

Once the model is ready to predict customer purchase, it can be used as an actionable marketing tool, in which we have win-back approaches while customers seem to change their minds — the real business value we can drive from the prediction!

To see more about this analysis, see the link to my Github available here.

--

--