How to train and predict regression and classification ML models using only SQL — using BigQuery ML

Lak Lakshmanan
Towards Data Science
3 min readJul 25, 2018

--

In my book (Data Science on the Google Cloud Platform), I walk through a flight-delay prediction problem and show how to address it using a variety of tools including Spark Mlib and TensorFlow. Now that BigQuery ML has been announced, I thought I’d show how to predict flight delays using BQ ML.

Predict whether this flight will arrive late. Using only SQL.

Make no mistake — you still have to collect the data, explore it, clean it up, and enrich it. Essentially all the stuff I do in Chapter 1–9. In Chapter 10, I used TensorFlow. In this article, I will use BQML.

Create Regression Model

Here’s a BigQuery query to create the model:

#standardsql
CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay']) AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL

Note that:

  1. It starts with a CREATE MODEL, with the name of the model looking just like a table name. Note: ‘flights’ is the name of the dataset to store the resulting model — so, you’ll need to create an empty dataset before running the query.
  2. The options specify the algorithm — in this case, a linear regression algorithm, with arr_delay being the label
  3. Essentially, I’m pulling in the predictor and label variables in the SELECT

About 10 minutes later, the model is trained and evaluation results have been populated for each iteration:

The loss here is mean squared error, so the model converges on iteration #6 with a RMSE of about sqrt(97) = 10 minutes.

Predict using Model

The purpose of training a model is to predict with it. You can do model predictions with a SQL statement:

#standardsql
SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,
(
SELECT
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance,
arr_delay AS actual_arr_delay
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
LIMIT 10))

This results in:

As you can see, because we trained the model to predict a variable called “arr_delay”, ML.PREDICT creates a result column named predicted_arr_delay. In this case, I’m pulling 10 rows from the original table and predicting the arrival delay for those flights.

Create Classification Model

In the book, I don’t actually try to predict the arrival delay as such. Instead, I predict the probability that a flight will be more than 15 minutes late. This is a classification problem, and you can do that by changing the training query slightly:

#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL

Here’s the evaluation results:

and an example of the predictions:

Evaluating the model

It is possible to evaluate the model on an independent dataset. I don’t have one handy, so I’ll just show you how to run the evaluation on the same dataset the model was trained on:

#standardsql
SELECT * FROM ML.EVALUATE(MODEL flights.ontime,
(
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
))

The result:

BQML is really easy and really powerful. Enjoy!

--

--