How to train and predict regression and classification ML models using only SQL — using BigQuery ML
In my book (Data Science on the Google Cloud Platform), I walk through a flight-delay prediction problem and show how to address it using a variety of tools including Spark Mlib and TensorFlow. Now that BigQuery ML has been announced, I thought I’d show how to predict flight delays using BQ ML.
Make no mistake — you still have to collect the data, explore it, clean it up, and enrich it. Essentially all the stuff I do in Chapter 1–9. In Chapter 10, I used TensorFlow. In this article, I will use BQML.
Create Regression Model
Here’s a BigQuery query to create the model:
#standardsql
CREATE OR REPLACE MODEL flights.arrdelayOPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay']) ASSELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
Note that:
- It starts with a CREATE MODEL, with the name of the model looking just like a table name. Note: ‘flights’ is the name of the dataset to store the resulting model — so, you’ll need to create an empty dataset before running the query.
- The options specify the algorithm — in this case, a linear regression algorithm, with arr_delay being the label
- Essentially, I’m pulling in the predictor and label variables in the SELECT
About 10 minutes later, the model is trained and evaluation results have been populated for each iteration:
The loss here is mean squared error, so the model converges on iteration #6 with a RMSE of about sqrt(97) = 10 minutes.
Predict using Model
The purpose of training a model is to predict with it. You can do model predictions with a SQL statement:
#standardsql
SELECT * FROM ML.PREDICT(MODEL flights.arrdelay,
(
SELECT
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance,
arr_delay AS actual_arr_delay
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
LIMIT 10))
This results in:
As you can see, because we trained the model to predict a variable called “arr_delay”, ML.PREDICT creates a result column named predicted_arr_delay. In this case, I’m pulling 10 rows from the original table and predicting the arrival delay for those flights.
Create Classification Model
In the book, I don’t actually try to predict the arrival delay as such. Instead, I predict the probability that a flight will be more than 15 minutes late. This is a classification problem, and you can do that by changing the training query slightly:
#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) ASSELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
Here’s the evaluation results:
and an example of the predictions:
Evaluating the model
It is possible to evaluate the model on an independent dataset. I don’t have one handy, so I’ll just show you how to run the evaluation on the same dataset the model was trained on:
#standardsql
SELECT * FROM ML.EVALUATE(MODEL flights.ontime,
(
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
))
The result:
BQML is really easy and really powerful. Enjoy!