The world’s leading publication for data science, AI, and ML professionals.

How to Build Machine Learning Model using SQL

Machine Learning and Predictive Modelling in BigQuery

How to Build ML Model using BigQuery - image by author
How to Build ML Model using BigQuery – image by author

While taking the first step into the field of machine learning, it is so easy to get overwhelmed by all kinds of complex algorithms and ugly symbols. Therefore, hopefully, this article can lower the entry barrier by providing a beginner-friendly guide. Allow you to get a sense of achievement by building your own ML model using Bigquery and SQL. That’s right, we can use SQL to implement machine learning. If you are looking for several lines of code to get your hands dirty in the ML field, please continue reading 🙂


1. Set Up the Basic Structure 📁

BigQuery Basic Structure - image by author
BigQuery Basic Structure – image by author

In a nutshell, BigQuery project contains datasets and a dataset contains tables and models.

Follow these steps to set up a project using Google public dataset – "Google Analytics Sample". If you are familiar with the procedure, feel free to jump to the second section of this article.

1)Navigate to BigQuery in the left sidebar of the Google Cloud Platform Console and then create a project in Google Cloud Platform, name it as anything you want (e.g. "project" in this example)

create a project - image by author
create a project – image by author

2) With the "SQL Workspace" panel, select your project and then click "Create Dataset" and give your dataset a name (e.g. analytics).

create dataset- image by author
create dataset- image by author

3) Click "Add data" and select "Explore public data sets". From the public dataset library, search google analytics sample and add it to your project. Have a read of the Google Analytics sample schema if interested.

add Google Analytics Sample dataset - image by author
add Google Analytics Sample dataset – image by author

4) Explore the dataset by navigating to the "gasessions" in the "bigquery-public-data" and select the "Preview" tab. It is a massive dataset with multiple columns and a collection of subsets which are partitioned by date (e.g. "2017–08–01" in the example below). Notice that the dataset sits within the project "bigquery-public-data" rather than our own project "project-303707", but we can still query this dataset from our project as it is publicly accessible.

preview dataset - image by author
preview dataset – image by author

2. What’s Your Problem 🤨 ?

Identify the problem - image by author
Identify the problem – image by author

Before jumping into building the model, clarify the problem we are trying to solve by asking the following questions:

1) What is the label?

A label is a variable to be predicted. In this example, I will predict whether the website visitor will make any transactions and I gave this label the name "purchase". This can be derived from the existing variable "totals.transactions".

2) How is the label value defined?

For simplicity, let’s make this prediction a black or white situation, either "purchase" or "no purchase". Since the model training cannot handle string value as the output result, therefore it is necessary to code them into numbers.

This criterion can be defined using an IF statement in SQL:

IF(totals.transactions > 0, 1, 0) as purchase

3) What are the features/predictors?

Predictors are the features selected to predict the label. For this simple model, "totals.pageviews", "device.isMobile", "totals.timeonsite" are selected as the predictors. Let’s find out if they are actually good predictors to estimate the chance of making purchase.

4) What do we already know? What are we trying to predict?

We used the data that we already know to build the model. For this exercise, I chose a range of subsets with date between 2017–01–01 and 2017–07–01 from the google analytics dataset. We have already known if these records have made the purchase or not based on the "totals.transactions" value.

I chose the subsets from 2017–07–01 to 2017–08–01 as the prediction set. Although we already know if they have made any transactions from the data. But for illustration purposes, let’s assume that we only know the predictors and use them to predict the label.

5) What type of model should we choose?

There are two major types of problems in Machine Learning: classification and regression. Classification problem is to predict categorical data (e.g. true/false, cat/dog) whereas regression problem is to numerical data with continuous quantity (e.g. temperature, revenue etc). Since the label "purchase" is designed as a category with discrete value (either 1 or 0), this is a classification problem. Logistic regression, decision tree, support vector machine are commonly used models to solve classification. I choose logistic regression because it is the easiest to start with and also it is supported by BigQuery ML.

3. Build the Model 📊

Build the Model - image by author
Build the Model – image by author

Building ML models in BigQuery split into training dataset and evaluation dataset.

Training dataset: a subset of sample data used to create the model.

Evaluation dataset: a subset of sample data used to assess the performance of the model.

In this example, data subset with date from "2017–01–01" to "2017–05–01" is used for training and "2017–05–01" to "2017–07–01" is used for evaluation.

1) Train the Model

CREATE OR REPLACE MODEL `analytics.my_model`
OPTIONS 
(
  model_type = "logistic_reg",
  labels = ['purchase']
)
AS
SELECT * 
EXCEPT(fullvisitorid)
FROM
(SELECT fullvisitorid,
  IFNULL(totals.pageviews, 0) AS page,
  IFNULL(totals.timeonsite, 0) AS duration,
  CASE 
    WHEN device.isMobile is True THEN 1
    ELSE 0
  END mobile
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170501"
)
JOIN
(SELECT fullvisitorid, 
  IF(totals.transactions > 0, 1, 0) as purchase
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
) target 
USING (fullvisitorid)

Let’s break down this chunk of code into pieces:

CREATE OR REPLACE MODEL `analytics.my_model`
OPTIONS 
(
  model_type = "logistic_reg",
  labels = ['purchase']
)
AS

The first part creates the model "my_model" within the "analytics" dataset by passing three parameters:

  • Model name with the dataset prefix: analytics.my_model
  • Model type: "logistic_reg" stands for logistic regression model
  • Labels: "purchase" which is the variable we are predicting

If you would like to learn how to build a logistic regression model using Python instead, this article may help:

Simple Logistic Regression using Python scikit-learn

The second part, after the "AS" keyword, specifies the training dataset:

SELECT * 
EXCEPT(fullvisitorid)
FROM
(SELECT fullvisitorid,
  IFNULL(totals.pageviews, 0) AS page,
  IFNULL(totals.timeonsite, 0) AS duration,
  CASE 
    WHEN device.isMobile is True THEN 1
    ELSE 0
  END mobile
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170501"
)
JOIN
(SELECT fullvisitorid, 
  IF(totals.transactions > 0, 1, 0) as purchase
  FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
) target 
USING (fullvisitorid)

We are joining two tables together over here, the first one select all features which are used as predictors in this scenario:

  • "totals.pageviews": indicates the number of page views within the session and then apply IFNULL function to replace the null value with 0
  • "totals.timeonsite": indicates the time spent within the session and then apply IFNULL function to replace the null value with 0
  • "device.isMobile": indicates if the visitor uses a mobile device and then apply CASE function to transform boolean (true or false) into numeric values (1 or 0)

The keyword "_TABLE_SUFFIX" is used to filter the dataset so that only the training subset of "ga_session" is included (2017–01–01" to "2017–05–01").

Then we join the first table containing the predictors with the second table that includes each visitor’s "purchase" label defined by the statement IF(totals.transactions > 0, 1, 0).

Then we exclude the fullvisitorid from the joined table using the statement SELECT * EXCEPT(fullvisitorid) FROM ... because ID doesn’t contribute to the prediction of purchase.

After running the training statement in the query composer panel, we will be able to see "my_model" appeared in the "analytics" dataset.

model details - image by author
model details – image by author

2) Evaluate the Model

SELECT 
    roc_auc
FROM ML.EVALUATE(MODEL analytics.my_model, 

(
    SELECT * 
    EXCEPT(fullvisitorid)
    FROM
    (SELECT fullvisitorid,
    IFNULL(totals.pageviews, 0) AS page,
    IFNULL(totals.timeonsite, 0) AS duration,
    CASE 
        WHEN device.isMobile is True THEN 1
        ELSE 0
    END mobile
    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE _TABLE_SUFFIX BETWEEN "20170501" AND "20170701"
    )

    JOIN
    (SELECT fullvisitorid, 
    IF(totals.transactions > 0, 1, 0) as purchase
    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    ) target 
    USING (fullvisitorid)
));

This can be broken down into two layers.

-- outer layer
SELECT roc_auc
FROM ML.EVALUATE(...)

The outer layer is to select the metric from the result returned from the ML.EVALUATE clause. In this case, only "roc_auc" is selected, but we can also use "SELECT *" to get other common metrics to evaluate classification models such as precision, recall, accuracy, f1_score and log_loss.

-- inner layer
MODEL analytics.my_model, 
(
    SELECT * 
        EXCEPT(fullvisitorid)
        FROM
        (SELECT fullvisitorid,
        IFNULL(totals.pageviews, 0) AS page,
        IFNULL(totals.timeonsite, 0) AS duration,
        CASE 
            WHEN device.isMobile is True THEN 1
            ELSE 0
        END mobile
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
        WHERE _TABLE_SUFFIX BETWEEN "20170501" AND "20170701"
        )

        JOIN
        (SELECT fullvisitorid, 
        IF(totals.transactions > 0, 1, 0) as purchase
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
        ) target 
        USING (fullvisitorid)
)

The inner layer is to pass the model "analytics.my_model" and the evaluation dataset as the parameters of this EVALUATE function. As we can see, the evaluation dataset is almost identical to the definition of the training dataset, except that the _TABLE_SUFFIX has been changed into another set of data range between "20170501" and "2017070.

evaluation result - image by author
evaluation result – image by author

The code returned the roc_auc value as 0.75. With this value, we can only say that this is a decent but not good enough model. There are many ways to improve its performance, for instance, dive deeper into feature selection by introducing more informative features as the predictors. I encourage you to have a think about it, what other attributes may contribute to the prediction of "purchase". Give it a go to implement it and reach a higher roc_auc score. Think of it as an iterative process that continuously tests the performance of the model and improves on it, until it is reaching an optimal value (e.g. roc_auc > 0.9).

Feel free to comment down below if you thought of any approaches to improve the score.

4. Make Predictions 📈

make predictions - image by author
make predictions – image by author

Comparing the prediction statement below to the training or evaluation statement, there is apparent difference that only predictors ( page, duration and mobile) are selected. This is because the label will be the query output rather than the input.

SELECT
  *
FROM
  ML.PREDICT(MODEL `analytics.my_model`,
    (
    SELECT fullvisitorid,
    IFNULL(totals.pageviews, 0) AS page,
    IFNULL(totals.timeonsite, 0) AS duration,
    CASE 
        WHEN device.isMobile is True THEN 1
        ELSE 0
    END mobile
    FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170801" 
     )
  );

For each Predictive Model in BigQuery, the predicted label will be named as "predicted_

The result below shows the predicted label of each visitor as well as the probability of being labeled as 1 or 0 respectively.

prediction result - image by author
prediction result – image by author

Take-Home Message 🏠

This article briefly introduces how to build a simple machine learning model using Sql in BigQuery, with four major steps being covered:

  1. Set up the basic structure
  2. Be clear about the problem
  3. Build the model (train and evaluate the model)
  4. Make the prediction

Hopefully, it helps to ease the friction of entering the machine learning field. Most importantly, let’s learn and grow together!

More Related Articles

Top 15 Websites for Data Scientists to Follow in 2021

Learn SQL in Everyday Language

Get Started with SQL JOINs


Originally published at https://www.visual-design.net on February 9, 2021.


Related Articles