The world’s leading publication for data science, AI, and ML professionals.

Machine Learning with Google’s BigQuery

How to easily create and deploy ML Models with SQL

Photo by Peter Olexa on Unsplash
Photo by Peter Olexa on Unsplash

The traditional approach of extracting data from a Data Warehouse or Data Lake – then cleaning, transforming and building a model with it, is slowly being overtaken by a newer approach called Bring the Compute to the Data or Bring Machine Learning/Algorithms to the Data [1]. Especially, since big providers like Google, Amazon and Microsoft provide link services such as databases and data warehouses with machine learning services, it is no longer necessary to integrate the data into other platforms or systems. In services like Google’s Bigquery, traditional database systems are even extended internally with ML tools [2].

Advantages of the new Paradigm

Advantages of the new approach include a simplified infrastructure. If we look at the simplified architecture below, it becomes clear that, if the services can already communicate with each other in a cloud environment, or are integrated in a service, no further interfaces to third-party systems are necessary. This significantly shortens the setup and maintenance of these environments.

Classical Analytic Process Architecture - Image by Author
Classical Analytic Process Architecture – Image by Author

Another important factor is that the Data Science process can be remarkably streamlined. Every data scientist and engineer knows how time consuming the process can be – so the approach of having everything you need in a cloud environment or even in a service simplifies this process significantly.

Integrated Data Lake & Analytic Platform - Image by Author
Integrated Data Lake & Analytic Platform – Image by Author

The third point is the simplification of programmability – for example, analysts can easily do Machine Learning tasks only with SQL. In the following, I would like to illustrate this with a simple example in BigQuery. If you want to dive deeper into the topic Data Platform modernization, you might find this article interesting.

Example with BigQuery ML

For a little show case and to proof how solutions simplify the data analytics process, I will use BigQuery and its ML functionalities. Here, I used the public dataset which contains every wholesale purchase of liquor in the State of Iowa.

In this example, I just wanted to do some clustering. The example is very simplified but shows quite well the possibilities and steps you have to take.

Step 1: Create the Model

CREATE OR REPLACE MODEL DATA.iowa_bottle_clusters OPTIONS (model_type='kmeans', num_clusters=3, distance_type = 'euclidean') ASSELECT item_description, AVG(state_bottle_cost) AS state_bottle_cost,FROM `bigquery-public-data.iowa_liquor_sales.sales`WHERE EXTRACT(YEAR FROM date) = 2018GROUP BY item_description;

Above, you can see that I used the k-Means algorithm and set the parameter for the number of clusters to three. My idea for this highly simplified example is to use the variable _state_bottlepaid (The amount that the Alcoholic Beverages Division paid for each bottle of liquor ordered) to divide it into three price categories.

Step 2: Evaluate Model

After creating the model, BigQuery automatically provides you with some metrics. These allow an evaluation of clustering algorithms. For the Davies-Bouldin index, the goal would be the lowest possible value [3].

Metrics - Image by Author
Metrics – Image by Author

Another great feature is a graph for the loss we get provided.

Loss - Image by Author
Loss – Image by Author

With

SELECT * FROM ML.TRAINING_INFO(MODEL Data.iowa_bottle_clusters);

and

SELECT davies_bouldin_index FROM ML.EVALUATE(MODEL Data.iowa_bottle_clusters);

you can query results later on, if needed.

Step 3:Predict

Via ML.PREDICT we will see which cluster a particular brands belongs to.

SELECT
 centroid_id,
 item_description,
 state_bottle_cost
FROM
 ML.PREDICT(MODEL Data.iowa_bottle_clusters,(SELECT
  item_description,
  AVG(state_bottle_cost) AS state_bottle_cost
 FROM
  bigquery-public-data.iowa_liquor_sales.sales
 WHERE
  date <= '2018-02-02'
  AND date >='2018-01-01'
 GROUP BY
  item_description) )
 ORDER BY
  centroid_id;

Step 4: Examine the Results

Now let’s check if the results make any sense. Here are examples from the three clusters:

Only one item in the first cluster, that seems to be the high class product:

1 – Remy Martin Louis Xiii Cognac – 1599.19

In the second group we have way more bottles , that seem to be considered as middle class – e.g.:

2 – The Dalmore Cigar Malt Scotch – 93.33

2 – Kavalan Sherry Oak Single Malt – 73.33

2 – Jim Beam Distillers Masterpiece – 104.91

And a cluster with products you would probably mix with coke or other soft drinks:

3 – Smirnoff Grape – 8.25

3 – Smirnoff Sours Green Apple – 8.25

3 – Burnett’s Fruit Punch – 4.48

So the results are not that bad in the end – but could be of course optimized. All you need here is BigQuery and some basic Sql.

Conclusion

In this short article, I wanted to provide some theoretical basic knowledge about the paradigm Move the Machine Learning or Algorithm to the Data. In addition, the advantages over classical approaches are clear. Especially in terms of setup and maintenance, but also the actual data analytic process in terms of time savings. BigQuery ML democratizes the use of Machine Learning. Data Scientists and Analysts can use BigQuery ML to build and run models without having to access new business intelligence tools and tables. Predictive analytics can help with business decision making. In the end, I demonstrated how easy you can now develop a machine learning model only using SQL and Google’s BigQuery. For more information, click on the links below.

Sources and Further Readings

[1] Downs, B. N., Opheim, D. M., Hale, W., Xi, L., Donehower, L. A., & Kalra, D. (2014). A Practical Example of Bringing Computation to Data. Journal of Biomolecular Techniques : JBT, 25(Suppl), S5.

[2] Google, What is BigQuery ML? (2020)

[3] Davies, D.L., Bouldin, D.W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2), S.224–227.


Related Articles