Optimal Coupon Targeting for Grocery Items: an Instacart Case Study

Charissa R.
Towards Data Science
5 min readApr 26, 2018

--

Or: How I Lost Myself in Feature Engineering

Photo by Leonie Wise on Unsplash

For a recent project I set out to develop an optimal coupon targeting strategy for stores offering their products through Instacart, to increase their sales, and, ultimately, profit.

In order to do so, I looked into the likelihood of an individual user reordering a product included in a past order. More specifically, I wanted to find those Instacart users who might be on the tipping point of reordering a product, and to incentivize them to reorder by sending them targeted coupons. In order to identify these users, I built a machine learning model to assign probabilities to a product being reordered by specific Instacart users in their next order, based on their past behavior. I would then use these probabilities as input for deciding which group to target with coupons for these items, to have maximum ‘persuasive’ effect.

So, what did I do to find the users in this sweet spot? And what do we do with this information?

The Data, Data Structure, and Tools

For this project I used the Instacart data as available on kaggle.com. The dataset contains information about 3m+ orders through Instacart in the form of .csv files.

I set up a relational, SQL-database on AWS to contain all data and then used Python’s Pandas on a powerful AWS instance to perform the analyses.

The SQL-database of original features had the following structure:

Where the order_products_train table consists of the latest cart ordered by users, and thus containing the information about whether or not the product was reordered in the latest cart. This represents our target variable.

Given that not all products purchased by a user at some point will be reordered, I had to take care of the class imbalance that was present in the data, not to get a predictor that was biased to predicting ‘no reorder’. To do this, I made use of oversampling after making an 80/20 train-test split.

The Feature Engineering

Working with the original data and features, I did a lot of feature engineering and by combining or extracting meaning from the original features to make the most of the available data. In total I created 55+ features to more accurately estimate the probability a user will reorder a product that has been ordered by that user in the past.

The features created belonged to one of the following three categories:

  • User Specific
  • Product Specific
  • User * Product Specific

Given the nature of the problem, being estimating the likelihood a specific users reorders a specific product, the latter category turned out to be most relevant to the prediction.

Some of the User and User * Product features I created that proved explanatory were:

  • user_total_orders (total number of orders placed by user)
  • user_days_since_order (to be able to assign decreasing weights to orders farther in the past)
  • user_days_between_orders (mean, mode, last 4, last 2)
  • user_cart_new_product_share_mean (share of new products per cart, the higher, the lower the likelihood the user will reorder a product)
  • user_product_orders_share (share of orders that contained the product)
  • user_product_reorders (the number of times the user has reordered the product since the first purchase)
  • user_product_days_between_orders (mean, mode, first 2, last 2)
  • user_product_cart_rank_mean & _mode (the order in which products are added to the cart shows the importance of the products to the user)
  • user_product_in_last_order (boolean for whether or not product was in previous cart)
  • user_product_streak_current (boolean for whether or not user is on streak for that product, in 2+ of past orders)
  • user_product_streak_length_current (length of user’s current ordering streak for the product, conditional on product being in previous cart)
  • user_product_streak_length_latest (length of user’s latest ordering streak for the product)
  • user_product_days_since_last_order (days since last order of the product)

The Models

As for the modeling, I used three types of models: Logistic Regression, Naive Bayes, and Random Forest Regression.

Given the objective to get relative likelihoods among users reordering a specific product, rather than 0–1 predictions about reorders, I did not use regular the common scoring method for this dataset, being F1. Instead, I used ROC-AUC, which combines specificity and precision, giving me an indication of the probability that a randomly chosen reordered product by a user will be ranked higher by the model than a non-reordered product by a user.

Of these three models, the Random Forest Classifier performed best, both on the validation set, and on the test set, with a test AUC of 0.824. This means that the model will rank a randomly chosen positive instance (reordered) higher than a randomly chosen negative one (non-reordered) 82.4% of the time. Given time, I plan on doing some more tweaking to the model to see if I could get a higher AUC, but for me, this was primarily an exercise in feature engineering and working with large, relational, data.

The Business Product

The results of this model then serve as input to the marketing department of the various suppliers on Instacart. In practice, it would come down to the store selecting a specific product, and for that product looking at the probabilities for each user of reordering that particular product. The store would then select the users in the middle, say, 20% of the ranked probabilities of reorder for that product, see what kind of coupon would make most sense and would result in the highest increase in profits, taking into account various factors such as the conversion rate. E.g. offering a coupon reduces revenue per product sold with that coupon, but might offset that by a quantitative increase in number of products sold if the coupon has the intended effect of increasing sales.

The end product is a dashboard in which business decision makers in the stores can find the product, assess the number of ‘target’ users for each product. The dashboard might look something like this:

Supplementing the available data with with inventories, product prices, and coupon details, this dashboard would ultimately help marketers find the optimal product, targets, and coupon combination for their marketing efforts to ultimately increase sales and profitability.

You can find the relevant code for this project in this GitHub repo. Thanks for your interest, more to come shortly. In the meantime, feel free to comment / ask away!

--

--

Data Scientist with a passion for social issues, health, and education - and the Oxford comma.