Yet Another Full Stack Data Science Project

A CRISP-DM Implementation

Ram Saran Vuppuluri
Towards Data Science

--

Photo by rawpixel.com from Pexels

Introduction

I came across the term “Full Stack Data Science” for the first time a couple of years back when I was searching for Data Science meetups in Washington D.C. region.

Coming from a software development background, I am quite familiar with the term Full Stack Developer, but Full Stack Data Science sounded mystical.

With more and more companies incorporating Data Science & Machine Learning in their traditional software applications, the term Full Stack Data Science makes more sense now than at any point in history.

Software development methodologies are meticulously developed over the years to ensure high-quality software applications in low turn around time. Unfortunately, traditional software development methodologies do not work well in the context of Data Science applications.

In this blog post, I am going to emphasize on Cross-industry standard process for data mining (CRISP-DM) to develop a viable full stack Data Science product.

I firmly believe in the proverb, “the proof of the pudding is in the eating.” so I have implemented Starbucks challenge by applying CRISP-DM methodology as a sister project for this post, and is referred at multiple places in this blog post.

Starbucks Dataset Overview

Starbucks has generated the data set using a simulator program that mimics how people make purchasing decisions and how promotional offers influence those decisions.

Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable characteristics. People perform various events, including receiving offers, opening offers, and making purchases.

As a simplification, there are no specific product to track. Only the amounts of each transaction or offer are recorded.

There are three types of offers that can be sent:

  • buy-one-get-one (BOGO)
  • discount, and
  • informational

In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount.

On receiving a discount, a user gains a reward equal to a fraction of the amount spent.

In an informational offer, there is no reward, but neither is there a required amount that the user is expected to spend. Offers can be delivered via multiple channels.

The primary task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present these offers.

What is the Cross-Industry Standard Process for Data Mining (CRISP-DM)?

The cross-industry standard process for data mining (CRISP-DM) methodology is an open standard process that describes conventional approaches used by data mining experts.

CRISP-DM is a cyclic process that breaks down into six phases.

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment
Cross-Industry Standard Process for Data Mining (CRISP-DM) By Kenneth Jensen

Business Understanding

Focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan.

For the current scenario, we are going to:

  1. Perform exploratory data analysis on multi-variant frequency distributions.
  2. Build a machine learning model that predicts whether or not someone will respond to an offer.
  3. Build a machine learning model that predicts the best offer for an individual.
  4. Build a machine learning model that predicts purchasing habits.

Data Understanding

There are two ways in which Data Understanding phase is practiced:

  1. Start with an existing data collection and proceed with activities to get familiar with the data, to discover first insights into the data, or to detect interesting subsets to form hypotheses for concealed information.
  2. Recognize specific interesting questions and then collect data related to those questions.

The transformation from Business to Data understanding phase is not linear; instead, it is cyclic.

In this project, we are going to utilize only the data provided by Starbucks as it is challenging to work with inherent limitations in data. Thereby we are practicing the first method.

Starbucks has distributed data in three json files:

  • portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers complete

Data Preparation

The data preparation phase covers all activities to construct the final dataset from the initial raw data. Data preparation is 80% of the process.

Data wrangling and Data Analysis are the core activities in the Data Preparation phase of the CRISP-DM model and are the first logical programming steps. Data Wrangling is a cyclic process, and often we need to revisit the steps again and again.

Data Wrangling is language and framework independent, and there is no one right way. For the sister project, I am using Python as the programming language of choice and Pandas as the data manipulation framework.

As a rule of thumb I will approach data wrangling in two steps:

  • Assess Data — In this step we are going to perform the syntactical and semantical check on the data and identify any issues in the data along with potential fixes.
  • Clean Data — In this step we implement the data fixes from the Assessment phase. We also run small unit tests to make sure the data repairs are working as expected.

I performed the Data Wrangling on all three of the data sources provided by Starbucks.

Data Wrangling — portfolio.json

From visual and programmatic assessment, Portfolio data set has only ten rows with no missing data.

However, the data is not in machine learning friendly structure. We are going to apply one hot encoding methodologies on “channels” and “offer_type” columns.

Data Wrangling — portfolio.json

Data Wrangling — profile.json

From the visual assessment, on the Profile Data set:

  • “became_member_on” column is not in DateTime format.
  • if the age information is missing, then the age is populated by default with ‘118.’
Age frequency distribution —before data wrangling

From the programmatic assessment, on the Profile Data set:

  • gender and income columns have missing data.
  • the same columns with missing gender and income information are having age value ‘118.’

Following fixes are implemented on the Profile Data set in clean phase:

  • Drop rows with missing values, which should implicitly drop rows with age ‘118.’
  • Convert became_member_on to Pandas DateTime datatype.

The data is not in machine learning friendly structure. I will create a new ML friendly Pandas data frame with the following changes:

  • apply one hot encoding methodologies on gender column.
  • “became_member_on” column is split into year, month and date columns, and the became_member_on column is dropped
Data Wrangling — profile.json

Once the data wrangling step is completed, there are no rows with missing values (implicitly dropping rows with age ‘118.’)

Age frequency distribution — after data wrangling

Data Wrangling —transcript.json

From a visual and programmatic assessment, there are no data issues in the Transcript data set.

However, the data whether a promotion influenced the user is not defined. A user is deemed to be influenced by the promotion only after the individual made a transaction after viewing the advertisement. We will apply multiple data transformations to extract this information.

Data Wrangling — transcript.json

Now that we have all three data frames cleaned, let us consolidate into one data frame.

Data Wrangling — consolidation

Exploratory Data Analysis — multi-variant frequency distributions

From the transcript data, we identify five kinds of events:

  1. Offer received
  2. Offer Viewed
  3. Offer Completed
  4. Transaction (purchases)
  5. Influenced (only if the purchase was made after the offer is viewed)

The characteristics extrapolated from clean data are mentioned below the corresponding visualizations.

Event Distribution by Gender
  1. Profiles that were registered as Male made the most number of transactions.
  2. Profiles that were registered as Female are much likely to be influenced by the promotions.
Event Distribution by Income
  1. Individuals in the data set are within the income range of 30,000 to 120,000.
  2. Most of the individuals in the data set make less than 80,000 per annum.
  3. Most of the transactions were made by individuals earning less than 80,000 per annum.
Event Distribution by Offer Type
  1. BOGO offers have a higher rate of influence.
  2. Informational offers have negligible influence.

Even though there is not much knowledge gained from data exploration, it has yielded a critical insight — target classes are imbalanced. Primarily when working on classification models, this information is vital to decide what evaluation metrics should be used.

Modeling & Evaluation

Modeling is not a necessary step and is solely dependent on the scope of the project. For this project, I am going to build a machine learning models that:

  • Predicts whether or not someone will respond to an offer.
  • Predicts the best offer for an individual.
  • Predicts purchasing habits.

All three models are trained on ensemble models.

For classification models, due to the imbalance in the target classes, we are going to use Precision, Recall, and F1 score values as the evaluation metrics.

For the regression model, we are going to use mean squared error and R2 values as the evaluation metrics.

Predicting whether or not someone will be influenced by an offer

Model for predicting whether or not someone will be influenced by an offer

I have employed a grid search to find the model that yields high F1 score values. AdaBoostClassifier with a learning rate ~ 1.87 with 10 estimators produced consolidated F1 score of 0.87 both on training and testing datasets achieving a right balance between bias (underfitting) and variance (overfitting).

Precision, Recall and F1-score for training and testing data sets

Not all features in the data set are utilized to make predictions. We can get the weight of each feature from the model’s perspective. The weights will be in the range of 0 to 1, and the total weight of all features will add up to 1.

Feature Importance for model to predict influence

Model to predict whether an individual is influenced by promotion or not is highly dependent on the amount spent. This model is highly reliant on after the action (purchase). Therefore we cannot use this model as it is. Ideally, we need to collect more data to address this problem. As I am working only with the data provided by Starbucks, I cannot devise the desired model.

Predicting the best offer for an individual

Model for predicting the best offer for an individual

I have employed a grid search to find the model that yields high F1 score values. Unfortunately, the maximum consolidated F1 score that could be achieved was “0.2”.

Gathering more data should help to increase the F1 score, but as I am working only within the limits of data provided, I cannot deploy this model.

Precision, Recall and F1-score for training and testing data sets

Predicting the purchasing habits

Model for predicting the purchasing habits

I have employed a grid search to find the model that yields decent R2 and MSE values. GradientBoostingRegressor with a learning rate 0.1 with 100 estimators produced R2 score of ~ 0.32 and MSE of ~2300 for both training and testing datasets achieving a right balance between bias (underfitting) and variance (overfitting).

R2 & MSE for training and testing data sets
Feature Importance for model to predict amount

Unlike the model that was used to predict whether an individual will be influenced or not, the model to predict purchasing power is dependent on multiple features and none of them are after the fact attributes.

I will use this model in the Web application to make the predictions.

Deployment

Generally, this will mean deploying a code representation of the model into an application to score or categorize new unseen data as it arises.

Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.

I have created the web application which will utilize the data analysis and the pre-trained model to predict the purchase amount for a profile and offer code combination.

A complete description of the steps that need to be followed to launch the web application is mentioned in the README.MD file.

Below are the screenshots from the web application:

Overview of the data set

Path: http://0.0.0.0:3001/

Conclusion

  • Contrary to conventional software development, it is not always feasible to materialize the business requirements. Sometimes more data will help, but in our current case, we are set to work within the data provided.
  • Evaluation of the test set was part of the modeling phase; this is not common in the real world. As there are no dedicated testers for this project both modeling and evaluation are done simultaneously. At no point, testing data is exposed to pretrained models.
  • Source code for this analysis is hosted on GIT.

--

--