A-Z Feature Engineering With Starbucks

Published in

Towards Data Science

11 min readAug 1, 2019

It is estimated that 80% of the data science process is dedicated to gathering and molding the data into a workable shape. Many organizations sit on heaps of data that can come in various shapes an forms.

In this post we are going to see how business acumen and a customer-centric approach translate into deriving useful insights, known as feature engineering.

The data can be found on github and contains simulated customer behavior on the Starbucks rewards mobile app, as it would happen in real life. Once every few days, Starbucks sends out an offer to users of the mobile app.

A rigorous methodology at the beginning can translate into a seamless implementation, such as making a recommendation engine.

This post is verbose in explanation and code comments: It seeks to elucidate every step that goes into the feature engineering process.

The Dataset

The dataset is divided into three JSON files:

portfolio: contains offer ids and meta data about each offer (duration, type, etc.)
profile: demographic data for each customer
transcript: records for transactions, offers received, offers viewed, and offers completed

Each file deals with a theme of the feature engineering process. Our end goal is to funnel the three files into a comprehensive user-feature dataframe and a user-item matrix. We will make use of python 3.x libraries such as pandas (pd), numpy (np), matplotlib (plt), and seaborn (sns).

Portfolio

The first file deals with offers that can be merely an advertisement for a drink or an actual offer such as a discount or buy one get one free(BOGO). Some users might not receive any offer during certain weeks.

We load the portfolio using the pandas.read_json method

We can display the portfolio in a single view:

Below is an interpretation of our original columns, it is advised to use pandas.dtypes at the beginning to confirm the data types.

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

We notice that the offer id is a series of strings with no particular meaning, it would be useful to replace those with integers. It would be more useful if those integers had some meaning behind the numbering: Let a smaller offer id denote an easier difficulty.

We make use of python’s built-in zip and dict constructor to map offers to integers. We create a dictionary for the other direction should we need to go back.

Moreover we can see that channel can only take four values: email, mobile, media, and social. Given that it is a list nested within a column, we use pandas.get_dummies and pandas.Series:

The result is a much more readable table, where a 1 indicates true and 0 false. We apply the same method to the offer_type feature. We apply the same method to the offer_type column and create a offer_type_dum table.

For reward and difficulty, we do not know much about the units. Is the reward in USD or some app coins? How is difficulty measured? Uncertainty is a part of the data science process, in this case we will leave the two features unchanged.

We will convert the duration into hours, as we will see later it will match the units in the transcript dataset. We can create a clean portfolio with pandas.concat, replace, and reset_index.:

The result is a much cleaner table that can stand on its own.

We can use this cleaned data to create a Pearson correlation heat map with a few lines of code:

The heat map reveals some interesting correlations, for example BOGO offers tend to be correlated to a higher rewards.

In the portfolio we saw how simple cleaning steps can make the data tell a story by itself. In the next section we will see how to deal with missing values.

User Profile

We download the profile data in a similar manner as the portfolio, below is a sample screenshot:

We have a total of 17,000 observations under the following feature space:

age (int) — age of the customer (118 if unknown)
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

Right off the bat we can see that missing values are a concern here. We use pandas.isna and pandas.mean methods to see a fraction of missing values:

Interestingly enough, both income and gender have the same fraction of missing values. Could it be that for the same users we know nothing about age, gender, and income? We can check the hypothesis with an assert statement:

The statement passes: it turns out that these are the same users.

When dealing with missing values, we must ask what is the variance and the bias we are injecting in the model. Dropping all missing values might deprive the model of useful information. For income and age, we look at the distribution of the data without the missing values.

We see that the mean and the median are almost the same. Therefore if we impute the median, we do not expect neither the mean or variance to change.

For gender the issue is more delicate. In 2018 Amazon scrapped their AI recruiting tool for bias against women. Gender is self reported, we will consider those who have answered other to be in the same category as those who do not wish to answer.

The user id is a series of strings with no particular meaning, we can create a human_to_int dictionary as before.

The final feature is the date when customer created an account. The feature appears important to measure customer tenure. However a measure of time is of little use if we can’t reference it against some point in time. We decide to drop the became_member_on feature.

Transcript

The final dataset is also the most interesting and most complex. Below is a sample of the transcript:

The features are:

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

We can tell that the transcript will take a lot of cleaning. As a first step, it is helpful to apply the get_dummies method on the event column.

We can explode the values within the value column by turning it into a series.

We see that the keys, now the columns, take four different values. It seems that offer id and offer_id are two separate columns, it could be an error in recording the data. We can check this by seeing if the two columns are mutually exclusive:

The line returns an empty dataframe. This means that no value in offer id is contained in offer_id, there is no apparent conflict between the two columns. We can remedy this typo with pandas.combine_first:

We can now create a transcript_comb dataframe that combines these columns and makes use of the human_to_int and offer_to_int dictionaries. In addition we suggest to make use of pandas.map instead of pandas.replace for a performance boost.

The combined dataframe transcript_comb looks like:

We can investigate person 2 and see a sample of their history:

Notice that the clock does not start ticking at time=0. This conflicts with the table of givens, recall to always verify statements and do not assume true until proven so. The user also has purchases that occurred before the offer, and we have no definite way of knowing if later purchases were done because of the offer.

Moreover, the start time is not tied to some reference, such as the date when the user signed up or the time of day. Therefore we will assume that the observation period is not influenced by seasonability.

Consider a different individual shown below: person 35. This customer received an offer at time=0, opened it at once, and was done within 6 hours.

On the other hand they received an offer at time=168, opened it at once, received another offer at time=336, opened it again, and spent money during that period without completing neither offer.

What these two cases reveal is that spending is not labeled yet, and that multiple offers can occur at the same time. This is no different from reality where users receive multiple offers that can be combined together.

In some cases the users have not seen the offer and still spend money. It would be a gross overestimation to assume their spending was because of an offer.

We will assume that the user is under the influence of the first offer viewed which would be first offer sent. The table below is a schematic of the assumption, and the crux of this entire post:

Consider a single user that will receive two offers O1 and O2, such that time is recorded at time T0 when O1 is received. At T0 the user is not under the influence of any offer. Starting T1 the user has seen O1 and therefore any purchases made are done because of O1.

At T1 the user also receives O2 but has not seen it yet. At T2 the offer O2 is seen, however the user is still under the influence of O1. Only after the validity period of O1 can we consider the purchase done because of O2.

This is a simplification deemed necessary for the scope of this post. In reality a more detailed evaluation would include at least 90 possible offer combinations, and perhaps determine which offer will govern based on metrics such as difficulty, reward, and medium of communication.

Our objective is to create such a function that assigns an offer id to a transaction. In addition, we want to measure user behavior such as time to open and time to complete, if possible, to a temporal table.

We begin with a pairwise function that returns every part of an iterable.

We define person_fill_na.py to assign offer id to transactions as follows:

Note that in the function above, we defined time_to_open and time_to_complete as fractions of the offer duration. Indeed, it makes more sense to measure user responsiveness relative to the offer rather than in absolute terms.

We build upon the function above with transcript_fill_na.py.

In anticipation for a pipeline, it is useful to have one cleaner function that ingests data from a database and returns cleaned dataframes. For this purpose we define a transcript_cleaner.py:

In future applications, we can process the transcript with one line:

Putting it all together

Create a User Dataframe

With all three files cleaned, we can begin assembling the tables with pandas.merge left join to match one-to-many on a common key. A pandas.groupby statement aggregates statistics by user and item. Moreover, we multiply the channel of communication by the amounts of time it was seen to distribute weights accordingly.

In the next step we ponder on what statistics should be by sum and what should be by mean. We can see in the code below that number of transactions should be a sum whereas measures of time should be a mean.

Finally we create the user dataframe by inner joins. We also define more features such as:

amount_pct: percentage of money spent because of promotion
seen_ratio: the ratio of offers seen to offers received
completed_ratio: the ratio of offers completed to offers received

We also fill time_to_complete and time_to_open with 1, such that 0 indicates a very high response rate and 1 little to no response.

The user_df is now a massive feature set of each user’s profile and spending habits. The picture below shows a handful of the 23 features.

User-Item Matrix

Creating the user item matrix is less involved, as we use the transcript_full from the cleaning function:

The resulting user_by_item dataframe is:

Summary

The cleaning process is by no means a trivial task and requires calculated judgement from all responsible parties. Most real world data is disjointed, comes in a messy form, and needs to be verified for reliability.

To quote from Cathy O’Neil and Rachel Schutt, a data scientist spends her time on:

How the data is going to be used to make decisions, and how it’s going to be built back in the product […] She spends a lot of time in the process of collecting, cleaning, and mining data, because the data is never clean. This process requires persistence, statistics, and software engineering skills — skills that are also necessary for understanding biases in the data, and for debugging logging output from code.

I hope this article was useful to you! Special thanks to Udacity and Starbucks for teaming up and making this data available.