The world’s leading publication for data science, AI, and ML professionals.

How to Forecast Trip Duration Using BigQuery ML

Use GCP to conduct ML regression projects

Google Cloud Platform

Photo by Lucian Alexe from Unsplash
Photo by Lucian Alexe from Unsplash

This tutorial will go through the challenge lab in Google Cloud Qwiklab: Create ML Models with BigQuery ML: Challenge Lab.

We will use a real-life dataset shared on Google Cloud Public Dataset: austin_bikeshare to help select new bicycle models for public bike share systems. This lab aims at predicting average trip durations for different bike schemes. Two schemes are provided using different assumptions of features selected. You’ll compare which one has a better performance.

Step 1: Create a dataset to store your machine learning models

After logging into the Google Cloud Platform using the assigned credential, open Bigquery and create a dataset in your project.

Create a dataset in BigQuery (image by the author)
Create a dataset in BigQuery (image by the author)

Step 2: Create a forecasting BigQuery machine learning model

We’ll incorporate the starting station name, the hour the trip started, the weekday of the trip, and the address of the start station in the first model and use 2018 data only to train the first model.

Model1 training:

# Create a Linear Regression model
CREATE or REPLACE MODEL austin_bikeshare.bikeshare_model
OPTIONS
(model_type='linear_reg', labels=['duration_minutes']) AS
# Data preprocessing and feature selection
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS hourofstart,
EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
address AS location,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
# Join to get the address of the start station
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
ON
trips.start_station_name=stations.name
WHERE
EXTRACT(YEAR FROM start_time) = 2018
AND duration_minutes > 0

Step 3: Create the second machine learning model

The features in the second model are relatively simpler. We’ll incorporate the starting station name, the hour the trip started, and the bike share subscriber type in the second model and use 2018 data only to train this model.

Model 2 training:

# Create a Linear regression model
CREATE or REPLACE MODEL austin_bikeshare.bikeshare_model2
OPTIONS
(model_type='linear_reg', labels=['duration_minutes']) AS
# Data preprocessing and feature selection
SELECT
start_station_name,
subscriber_type,
EXTRACT(HOUR FROM start_time) AS hourofstart,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2018
AND duration_minutes > 0

Step 4: Evaluate the two machine learning models

Both Mean Absolute Error and Root Mean Squared Error are used for evaluation.

Model1 Evaluation:

# Choose evaluation metrics
SELECT
SQRT(mean_squared_error) AS rmse,
mean_absolute_error AS mae
FROM
ML.EVALUATE(MODEL austin_bikeshare.bikeshare_model,
(
# Prepare the testing dataset with the same features as the training set using for 2019
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS hourofstart,
EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week,
address AS location,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations
ON
trips.start_station_name=stations.name
WHERE
EXTRACT(YEAR FROM start_time) = 2019))
Evaluation for the first model (image by the author)
Evaluation for the first model (image by the author)

Model2 Evaluation:

# Choose evaluation metrics
SELECT
SQRT(mean_squared_error) AS rmse,
mean_absolute_error AS mae
FROM
ML.EVALUATE(MODEL austin_bikeshare.bikeshare_model2,
(
# Prepare the testing dataset with the same features as the training set using for 2019
SELECT
start_station_name,
subscriber_type,
EXTRACT(HOUR FROM start_time) AS hourofstart,
duration_minutes
# No need to join anther table since all the features needed are in one table
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019))
Evaluation for the second model (image by the author)
Evaluation for the second model (image by the author)

From the comparison, we can find that the first model is more accurate.

Step 5: Use the second model to predict the average trip durations

We are going to use the second model to predict average trip length for trips from the busiest bike-sharing station in 2019, where the subscriber type is Single Trip.

#Create a table containing predicted trip duration for single trip subscribers at the busiet bike sharing station in 2019
WITH predicted AS (SELECT
*
FROM
ml.PREDICT(MODEL `austin_bikeshare.bikeshare_model2`,
(
# Find the busiest bike sharing station in 2019
WITH params AS (
SELECT
COUNT(*) AS counts,
start_station_name AS start,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019
GROUP BY start_station_name
ORDER BY counts DESC
LIMIT 1
)
# Prepare the features for the model
SELECT
start_station_name,
subscriber_type,
EXTRACT(HOUR FROM start_time) AS hourofstart,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`, params
WHERE
start_station_name = params.start
AND subscriber_type = 'Single Trip'
)))
# Get the average of the predicted duration.
SELECT AVG(predicted_duration_minutes) AS   predicted_average_trip_duration, start_station_name, subscriber_type
FROM predicted
GROUP BY start_station_name, subscriber_type
Script for prediction (image by the author)
Script for prediction (image by the author)
Predicted average trip duration (image by the author)
Predicted average trip duration (image by the author)

If the above script doesn’t pass the scoring system, you can separate it into two: one to find the busiest bike-share station, the other to predict using the station name as follows:

# Find the busiest station
SELECT
COUNT(*) AS counts,
start_station_name,
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019
GROUP BY start_station_name
ORDER BY counts DESC
# Predict
SELECT AVG(predicted_duration_minutes) AS predicted_average_trip_duration
FROM ML.predict(MODEL `austin_bikeshare.bikeshare_model2`, (
SELECT
start_station_name,
EXTRACT(HOUR FROM start_time) AS hourofstart,
subscriber_type,
duration_minutes
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE
EXTRACT(YEAR FROM start_time) = 2019
AND subscriber_type = 'Single Trip'
AND start_station_name = '21st & Speedway @PCL'))

Congratulations!

You passed the challenge lab and got the Create ML Models with BigQuery ML Badge from Google Cloud if you finished the previous labs, which have step-by-step instructions.

I hope you enjoy this journey. Welcome to leave a comment and connect with me on Linkedin.


Related Articles