The world’s leading publication for data science, AI, and ML professionals.

Predict Transactions On Your Website Using Big Query ML

Train a model on Google Analytics data

Photo by Pickawood on Unsplash
Photo by Pickawood on Unsplash

If you sell products online, predicting which users will convert can help you to increase your revenues. For example, it can allow you to:

  • Create custom advertising audiences
  • Generate dynamic product recommendations based on user behavior
  • Customize email workflows of your existing users

For this analysis, we will use the google_analytics_sample public dataset, which you can find here.

Big Query public datasets are an amazing resource to practice Data Science. However if you have your own website, it’s even more interesting to train your model using your website data!

Step 1 : Understanding the dataset

When we explore the schema, we can see that we have a column visitId, which means we have one line per session.

Session

The period of time a user is active on your site or app. By default, if a user is inactive for 30 minutes or more, any future activity is attributed to a new session. Users that leave your site and return within 30 minutes are counted as part of the original session.

Source : Google Analytics Documentation

We also have a columns transactions, which is an integer column corresponding to the number of transactions made in the session. We don’t really care about the actual number, we will just want to check if the user converted or not.

Then we have our arrays of features :

  • totals contains some interesting metrics such as the number of page views, the time spent on site …
  • trafficSource contains data about source (where the user comes from)
  • device contains info about the device used
  • geoNetwork contains geographical info
  • hits contains info about events, even transactions.

We have a small issue here: hit contains data about transactions, which would be data leakage. The ML model should only access what happened before the transaction.

In order to avoid data leakage, we will get rid of the hits column. Of course, we might still have some issue with the transactions data contained in other metrics. The duration of the session, for example, includes the duration of the checkout process, which takes some time. Same goes for the number of page views. Ideally, we would need to substract what happened from the checkout page from these metrics, but we will ignore this for now.

Step 2 : Splitting your data into train and test set

Big Query allows us to split the data while creating the model, but we also want to avoid looking at the testing data, so we will split our dataset into a train and test table.

Our table has 2556 rows, so an 80–20 % split would give us:

  • 2044 rows for the train split
  • 512 rows for the test split

To do this, we need to generate random numbers. SQL does not have a native random function, so we will use a Javascript Function. This is one of the things I love about Big Query, you can apply a JavaScript function to any column!

And that gives us:

Yes, it works ! Now let’s create our tables. First, let’s create the table holding the random numbers:

Then, we will create our training and test sets.

Step 3: Exploring the training set in Data Studio

Let’s create a new report on Data Studio and select Big Query as a source.

The first thing we need to know is the share of visits with transactions.

This is clearly an unbalanced dataset, we will have to bear this in mind!

Now let’s try to see the impact of page views and time spent on site for buyers and non-buyers.

There is clearly a pattern here: it seems that buyers spend more time on the site, and view more pages than non-buyers, which makes sense.

Lastly, let’s have a look at the relationship between source and conversion. We will use channelGrouping this time.

Channel Groupings

Channel Groupings are rule-based groupings of your traffic sources. Throughout Analytics reports, you can see your data organized according to the Default Channel Grouping, a grouping of the most common sources of traffic, like Paid Search and Direct. This allows you to quickly check the performance of each of your traffic channels.

Source : Google Analytics Documentation

Channel groupings seem to have an impact, however we have to be careful as we don’t have a lot of data.

Step 4: Building our ML model

Let’s try a Logistic Regression, one of the most common classification models. If you don’t know Logistic Regression, I have put some links to StatQuest videos in the Resources.

The syntax to create a model on Big Query is simple:

Step 5 : Evaluating our model

Now that we created our model, it’s time to see if it performs well on our test set!

As our data is unbalanced, accuracy would not be a good metric to evaluate a model : just by predicting False every time, we would get more than 98% accuracy.

Therefore, we are going to use the ROC curve to predict our model.

Let’s write a query to evaluate our logistic regression :

This tells use :

  • We identify 50% of the transactions.
  • Among the sessions we identify as transactions, only 22% are actually transactions.

Even though these numbers don’t look so good, it’s not that bad when you think that a user has only 1.7% chance to convert. This is what gives us a good ROC_AUC of almost 95%.

Again, we probably don’t have enough data to build a powerful model, but imagine doing this with hundreds of thousands of session!

Step 6: Using the model

There are several ways we could use this model in real life.

One way would be to use the ML model in real time to determine the probability that a user is going to purchase :

  • If the probability is very high, we don’t do anything.
  • If the probability is medium, we try to convince the user to buy by showing a popup with a discount code that is only valid today.
  • If the probability is low, we don’t do anything.

This model will help us convince undecided users.

Another way to use this model would be to build audiences ; we take the users with the highest probability but have not bought yet or not for a long time, and we use them to create a custom advertising audience, or we send them an email.

Resources


Related Articles