
There are many reasons when we can’t use Python to do Machine Learning:
- We are still learning the new programming language
- We need to deliver the result very soon and prefer to analyze the data on where it resides
- Don’t have access to Python
No matter which case you have, I’m going to share how to do machine learning with only SQL, understand the key features that influence the duration of a bike-share trip, and my observations on the development of Data Warehouse.
Content
- Use Case: Predict the average duration of a bike-share trip
- Challenge
- Step1: Assume key features
- Step2: Define metrics
- Step3: Build and evaluate the model
- Step4: Predict the result
- Step5: How to improve the accuracy of your model?
- Thoughts
Use Case: Predict the average duration of a bike-share trip
The average duration in a bike-share trip directly links with revenue for a shared bike business, just as average order value for retails. To have more insight into the key factors that influence the duration, I used Austin Bikeshare, a public dataset on Google Cloud Platform. To analyze, point to Bigquery-public-data:austin_bikeshare after logging in BigQuery.
Challenge
When taking a closer look at table bikeshare_trips and bikeshare_stations, there are many features to choose from for training. One way to decide is leveraging AutoML, but it’s not our focus on this article. This article will develop two assumptions and compare them by training them into models and evaluating them with metrics.


Step1: Assume key features
Here are two assumptions:
- Assumption A: the key factors are the start station, the location of the start station, the day of the week, and the hour the trip started
- Assumption B: the key factors are start station, subscriber type, and the hour the trip started.
Step2: Define metrics
It is vital to decide the evaluation metrics before training a model. After all, if we want to compare two results, we must clearly state the criteria. Machine learning is permutations and combinations based on features, parameters, and frameworks. What we need is to pick the one with the highest accuracy.
There are many metrics. Here are some metrics commonly used in regression:
- Mean Absolute Error
- Root Mean Squared Error
If you are as forgetful as I do, no worries, Kahn Academy explains well on Mean absolute deviation and Root-mean-square error
Step3: Build and evaluate the model
We then use the selected features and desired outcome (the duration, in this example) to build a regression model using CREATE MODEL, and ML.EVALUATE syntax.
We identified the Mean Absolute Error and Root Mean Squared Error as much more minor in Assumption B, meaning the Assumption B features positively correlate to the desired outcome.

Step4: Predict the result
Finally, using ML.PREDICT to come up with the result that the average duration in the 2019 dataset will be 51 minutes.
Step5: How to improve the accuracy of your model?
How to improve accuracy is one of the most interesting for data scientists.
I used to decompose the steps in a process and tackle each.
Here is my thinking process:
- Data
- Feature
- Algorithm
- Parameter
Let’s focus on fundamentals, such as data and features. We’ve done a small experiment on adjusting features. If taking a deeper look into data, there are some basic observational statistics to work with, such as observing average, maximum, and minimum to see if they make sense. For example, some data in the table don’t make sense, duration more than 30,000 hours, or a length < 0.
If we could trim out these data, the credibility of the prediction result will be more desirable.
Thoughts
BigQuery makes ML with SQL possible
Not everyone has access to Python, but what if we need to deliver anyway? For some Database Administrator or Data Analysts, the familiarity of Python may not be a "must-have" skill. The built-in ML for BigQuery let us deliver business result without mastering a new language.
Streamlining the process so you can focus on what matters the most
It’s common to see my enterprise customers running between all kinds of prerequisites before jumping into a machine learning project. As a result, it’s more time-consuming to deliver results due to data movement, security checks, access control, and networking settings. Now many solution providers integrated tools so that the process is smoother. For example, when doing Data Analysis, we used to download data in the database, send the data, and import it to IDE. Now, only a Mount function will do the work, just like using a USB stack to load data into a computer.
Data Warehouse is evolving to combine BI and ML as a data analytics platform
In the past, a data warehouse was simply a tool for parallel data processing and storage. As time goes by, it also combined new elements such as business intelligence and machine learning for more complex analytics jobs. Besides, many data warehouses offer real-time analysis to fulfill streaming demand in the big data scenario. The architecture below shows how to use Synapse (the Data Warehouse solution from Microsoft) for IoT devices and web click data analysis.

Disclaimer: This is a write-up after finishing Create ML Models with BigQuery ML. Feel free to refer my query if you need more details.