Hands-on Tutorials

Starbucks Offer Dataset — Udacity Capstone

An investigation of ‘wasted offers’

Linda Chen
Towards Data Science
12 min readOct 31, 2020

--

Photo by Karl Fredrickson on Unsplash

Introduction

Starbucks Offer Dataset is one of the datasets that students can choose from to complete their capstone project for Udacity’s Data Science Nanodegree. The dataset contains simulated data that mimics customers' behavior after they received Starbucks offers. The data is collected via Starbucks rewards mobile apps and the offers were sent out once every few days to the users of the mobile app.

The data file contains 3 different JSON files.

*File descriptions provided by Udacity*portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)profile.json — demographic data for each customertranscript.json — records for transactions, offers received, offers viewed, and offers completed

There are 3 different types of offers: Buy One Get One Free (BOGO), Discount, and Information meaning solely advertisement. However, for each type of offer, the offer duration, difficulties or promotional channels may vary. This dataset contains about 300,000+ stimulated transactions.

The goal of this project was not defined by Udacity. Thus, it is open-ended. Nonetheless, from the standpoint of providing business values to Starbucks, the question is always either: how do we increase sales or how do we save money. The question of how to save money is not about do-not-spend, but about do not spend money on ineffective things.

There are many things to explore approaching from either 2 angles. One caveat, given by Udacity drawn my attention. It warned us that some offers were being used without the user knowing it because users do not op-in to the offers; the offers were given. Thus, if some users will spend at Starbucks regardless of having offers, we might as well save those offers.

I thought this was an interesting problem. I decided to investigate this. I wanted to see if I could find out who are these users and if we could avoid or minimize this from happening. In the following article, I will walk through how I investigated this question. I will follow the CRISP-DM process. If you’re not familiar with the concept. Here is an article I wrote to catch you up.

Business Understanding and Data Understanding

Let’s first take a look at the data. From the datasets, it is clear that we would need to combine all three datasets in order to perform any analysis. Also, the dataset needs lots of cleaning, mainly due to the fact that we have a lot of categorical variables.

Here are the five business questions I would like to address by the end of the analysis. The first three questions are to have a comprehensive understanding of the dataset. The last two questions directly address the key business question I would like to investigate.

  • Q1: Which one is the most popular offer?
  • Q2: Do different groups of people react differently to offers?
  • Q3: Do people generally view and then use the offer? or they use the offer without notice it?
  • Q4: Which group of people is more likely to use the offer or make a purchase WITHOUT viewing the offer, if there is such a group?
  • Q5: Which type of offer is more likely to be used WITHOUT being viewed, if there is one?

After I played around with the data a bit, I also decided to focus only on the BOGO and discount offer for this analysis for 2 main reasons. One was because I believed BOGO and discount offers had a different business logic from the informational offer/advertisement. For BOGO and discount offers, we want to identify people who used them without knowing it, so that we are not giving money for no gains. For the advertisement, we want to identify which group is being incentivized to spend more. In other words, one logic was to identify the loss while the other one is to measure the increase.

Another reason is linked to the first reason, it is about the scope. Due to the different business logic, I would like to limit the scope of this analysis to only answering the question: who are the users that ‘wasted’ our offers and how can we avoid it. Therefore, I did not analyze the information offer type.

Success Metrics and their justifications

To repeat, the business question I wanted to address was to investigate the phenomenon in which users used our offers without viewing it. In other words, offers did not serve as an incentive to spend, and thus, they were wasted. Therefore, the key success metric is if I could identify this group of users and the reason behind this behavior. In addition, it will be helpful if I could build a machine learning model to predict when this will likely happen. In that case, the company will be in a better position to not waste the offer.

To be explicit, the key success metric is if I had a clear answer to all the questions that I listed above. Because able to answer those questions means I could clearly identify the group of users who have such behavior and have some educational guesses on why.

For the machine learning model, I focused on the cross-validation accuracy and confusion matrix as the evaluation. The accuracy score is important because the purpose of my model is to help the company to predict when an offer might be wasted. Therefore, the higher accuracy, the better.

Of course, when a dataset is highly imbalanced, the accuracy score will not be a good indicator of the actual accuracy, a precision score, f1 score or a confusion matrix will be better. In this case, however, the imbalanced dataset is not a big concern. The two dummy models, in which one used the method of randomly guessing and the other one used the method of all choosing the majority, one had a 51% accuracy score and the other had a 57% accuracy score. This shows that the dataset is not highly imbalanced.

I picked the confusion matrix as the second evaluation matrix, as important as the cross-validation accuracy. The reason is that the business costs associate with False Positive and False Negative might be different. So it will be good to know what type of error the model is more prone to. Some people like the f1 score. However, I found the f1 score a bit confusing to interpret. Therefore, I stick with the confusion matrix. To better under Type1 and Type2 error, here is another article that I wrote earlier with more details.

Data Preparation

In the data preparation stage, I did 2 main things. One was to merge the 3 datasets. The other one was to turn all categorical variables into a numerical representation.

One difficulty in merging the 3 datasets was the ‘value’ column in the transcript dataset contained both the offer id and the dollar amount. In addition, that column was a dictionary object. Here’s how I separated the column so that the dataset can be combined with the portfolio dataset using ‘offer_id’.

When turning categorical variables to numerical variables. There were 2 trickier columns, one was the ‘year’ column and the other one was the ‘channel’ column. The ‘year’ column was tricky because the order of the numerical representation matters. For example, if I used: 0–2017, 1–2018, 2–2015, 3–2016, 4–2013. This against our intuition. However, for other variables, like ‘gender’ and ‘event’, the order of the number does not matter. Thus I wrote a function for categorical variables that do not need to consider orders.

The ‘channel’ column was tricky because each cell was a list of objects. There are two ways to approach this. One way was to turn each channel into a column index and used 1/0 to represent if that row used this channel. However, I used the other approach. I realized that there were 4 different combos of channels. I want to know how different combos impact each offer differently. Therefore, I want to treat the list of items as 1 thing. Here is how I handled all it.

EDA and Results:

Q1: Which one is the most popular offer?

Left-absolut numbers, Right-percentages | Image by Author

Answer: The discount offer is more popular because not only it has a slightly higher number of ‘offer completed’ in terms of absolute value, it also has a higher overall completed/received rate (~7%). However, it is worth noticing that BOGO offer has a much greater chance to be viewed or seen by customers.

Q2: Do different groups of people react differently to offers?

In both graphs, red- ’N’ represents did not complete (view or received) and green-’Yes’ represents ‘offer completed’.

For BOGO offer
For Discount offer

Answer: For both offers, men have a significantly lower chance of completing it. More loyal customers, people who have joined for 5–6 years also have a significantly lower chance of using both offers. Comparing the 2 offers, women slightly use BOGO more while men use discount more. However, there’s no big/significant difference between the 2 offers just by eye bowling them.

Q3: Do people generally view and then use the offer? or they use the offer without notice it?

Answer: The peak of ‘offer completed’ was slightly before the ‘offer viewed’ in the first 5 days of experiment time. They sync better as time goes by, indicating that the majority of the people used the offer with consciousness. The gap between “offer completed” and “offer viewed” also decreased as time goes by.

Q4: Which group of people is more likely to use the offer or make a purchase WITHOUT viewing the offer, if there is such a group?

I picked out the customer id, whose first event of an offer was ‘offer received’ following by the second event ‘offer completed’. They are the people who skipped the ‘offer viewed’. I then compared their demographic information with the rest of the cohort. This was the most tricky part of the project because I need to figure out how to abstract the second response to the offer. Here is how I did it.

Here are the results of my analysis:

Answer: As you can see, there were no significant differences, which was disappointing. This indicates that all customers are equally likely to use our offers without viewing it. So, could it be more related to the way that we design our offers? Let’s look at the next question.

Q5: Which type of offer is more likely to be used WITHOUT being viewed, if there is one?

As you can see, the design of the offer did make a difference. For example, the blue sector, which is the offer ends with ‘1d7’ is significantly larger (~17%) than the normal distribution. Offer ends with ‘2a4’ was also 4–5% larger than the normal distribution. Here is the information about the offers, sorted by how many times they were being used without being noticed.

Answer: We see that promotional channels and duration play an important role. Here are the things we can conclude from this analysis.

  • If an offer is being promoted through web and email, then it has a much greater chance of not being seen
  • Being used without viewing to link to the duration of the offers. Longer duration increase the chance
  • discount offer type also has a greater chance to be used without seeing compare to BOGO.

Modeling and Evaluation

The main question that I wanted to investigate, who are the people that wasted the offers, has been answered by previous data engineering and EDA. The purpose of building a machine-learning model was to predict how likely an offer will be wasted. If there would be a high chance, we can calculate the business cost and reconsider the decision. Thus, the model can help to minimize the situation of ‘wasted offers’.

For model choice, I was deciding between using decision trees and logistic regression. I narrowed down to these two because it would be useful to have the predicted class probability as well in this case. We can know how confident we are about a specific prediction. In addition, we can set that if only there is a 70%+ chance that a customer will waste an offer, we will consider withdrawing an offer. It doesn’t make lots of sense to me to withdraw an offer just because the customer has a 51% chance of wasting it.

I finally picked logistic regression because it is more robust. Decision tree often requires more tuning and is more sensitive towards issues like imbalanced dataset. Our dataset is slightly imbalanced with

1- ‘wasted offers’: 31723
0- ‘used offers’: 23499

One important step before modeling was to get the label right. In this case, the label ‘wasted’ meaning that the customer either did not use the offer at all OR used it without viewing it. Here is how I created this label.

I then drop all other events, keeping only the ‘wasted’ label. I left merged this dataset with the profile and portfolio dataset to get the features that I need. In the end, the data frame looks like this:

I used GridSearchCV to tune the “C” parameters in the logistic regression model. I used the default “l2” for the “penalty”. The reason is that we don’t have too many features in the dataset. Here is the code:

The best model achieved 71% for its cross-validation accuracy, 75% for the precision score. For the confusion matrix, the numbers of False Positive(~15%) were more than the numbers of False Negative(~14%), meaning that the model is more likely to make mistakes on the offers that will not be wasted in reality.

To improve the model, I downsampled the majority label and balanced the dataset. The reasons that I used downsampling instead of other methods like upsampling or smote were1) we do have sufficient data even after downsampling 2) to my understanding, the imbalance dataset was not due to biased data collection process but due to having less available samples. In this case, using SMOTE or upsampling can cause the problem of overfitting our dataset. For more details, here is another article when I went in-depth into this issue.

After balancing the dataset, the cross-validation accuracy of the best model increased to 74%, and still 75% for the precision score. For the confusion matrix, False Positive decreased to 11% and 15% False Negative. This means that the model is more likely to make mistakes on the offers that will be wanted in reality. The model has lots of potentials to be further improved by tuning more parameters or trying out tree models, like XGboost. However, I stopped here due to my personal time and energy constraint.

Summary/Recap

In summary, I have walked you through how I processed the data to merge the 3 datasets so that I could do data analysis. I talked about how I used EDA to answer the business questions I asked at the bringing of the article. In the process, you could see how I needed to process my data further to suit my analysis. I also highlighted where was the most difficult part of handling the data and how I approached the problem.

The result was fruitful. I did successfully answered all the business questions that I asked. Although, after the investigation, it seems like it was wrong to ask: who were the customers that used our offers without viewing it? Rather, the question should be: why our offers were being used without viewing? The reason is that demographic does not make a difference but the design of the offer does.

Finally, I built a machine learning model using logistic regression. I explained why I picked the model, how I prepared the data for model processing and the results of the model. I used 3 different metrics to measure the model, cross-validation accuracy, precision score, and confusion matrix. I want to end this article with some suggestions for the business and potential future studies.

Final Recommendation and Future Studies

To avoid or to improve the situation of using an offer without viewing, I suggest the following:

  • promote the offer via at least 3 channels to increase exposure
  • eliminate offers that last for 10 days, put max. 7 days. If an offer is really hard, level 20, a customer is much less likely to work towards it. Meanwhile, those people who achieved it are likely to achieve that amount of spending regardless of the offer.

Another suggestion I have is that I believe there is a lot of potential in the discount offer. The completion rate is 78% among those who viewed the offer. Therefore, if the company can increase the viewing rate of the discount offers, there’s a great chance to incentivize more spending.

For future studies, there is still a lot that can be done. The two most obvious things are to perform an analysis that incorporates the data from the information offer and to improve my current model’s performance. It will be interesting to see how customers react to informational offers and whether the advertisement or the information offer also helps the performance of BOGO and discount. It will be very helpful to increase my model accuracy to be above 85%.

My full repo can be accessed here.

--

--

Share what I learned, and learn from what I shared. All about machines, humans, and the links between them. Take everything with a grain of salt.