The world’s leading publication for data science, AI, and ML professionals.

Bring Machine Learning to the Data

A new Paradigm illustrated with Google BigQuery

Image from Mohamed Hassan on Pixabay
Image from Mohamed Hassan on Pixabay

The traditional approach of extracting data from a Data Warehouse or Data Lake— then clean it, transform it and build a model is slowly being overtaken by a newer approach [1] called

Bring the Compute to the Data or

Bring Machine Learning/Algorithms to the Data

Especially since the big providers like Google, Amazon and Microsoft link services like databases and data warehouses with machine learning services, it is no longer necessary to integrate the data into other platforms or systems. In services like Google’s Bigquery, traditional database systems are even extended internally with ML tools [2].

Advantages of the new Paradigm

Advantages of the new approach include a simplified infrastructure. If we look at the simplified architecture below, it becomes clear that if the services can already communicate with each other in a cloud environment, or are integrated in a service, no further interfaces to third-party systems are necessary. This significantly shortens the setup and maintenance of these environments.

Classical Analytic Process Architecture - Image by Author
Classical Analytic Process Architecture – Image by Author

Another important factor is that the Data Science process can be streamlined remarkably. Every data scientist and engineer knows how time consuming the process can be – so the approach of having everything you need in a cloud environment or even in a service simplifies this process significantly.

Integrated Data Lake & Analytic Platform
Integrated Data Lake & Analytic Platform

The third point is the simplification of programmability – for example, analysts can easily do Machine Learning tasks only with SQL. In the following I like to illustrate this with a simple example in Google BigQuery. If you want to dive deeper into the topic Data Platform modernization you might find this article interesting.

Example with BigQuery ML

For a little show case and to proof how solutions simplify the data analytics process I will use BigQuery and its ML functionalities. Here, I used the public dataset which contains every wholesale purchase of liquor in the State of Iowa.

In this example, I just wanted to do some clustering. The example is very simplified but shows quite well the possibilities and steps you have to take.

Step 1 – Create the Model

CREATE OR REPLACE MODEL
 DATA.iowa_bottle_clusters OPTIONS (model_type='kmeans',
 num_clusters=3, distance_type = 'euclidean') AS
SELECT
 item_description,
 AVG(state_bottle_cost) AS state_bottle_cost,
FROM
 `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
 EXTRACT(YEAR FROM date) = 2018
GROUP BY
 item_description;

Above, you can see that I used the k-Means algorithm and set the parameter for the number of clusters to three. My idea for this, as I said, highly simplified example is simply to use the variable _state_bottlepaid (The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered) to divide it into three price categories.

Step 2 – Evaluate Model

After creating the model BigQuery automatically provide you with some metrics. These allow an evaluation of clustering algorithms. For the Davies-Bouldin index, the goal would be the lowest possible value [3].

Metrics - Image by Author
Metrics – Image by Author

Another great feature is a graph for the loss we get provided.

With

SELECT * FROM ML.TRAINING_INFO(MODEL Data.iowa_bottle_clusters);

and

SELECT davies_bouldin_index FROM ML.EVALUATE(MODEL Data.iowa_bottle_clusters);

you can query result later on if needed.

Step 3 – Predict

Via ML.PREDICT we will see which cluster a particular brands belongs to.

SELECT
 centroid_id,
 item_description,
 state_bottle_cost
FROM
 ML.PREDICT(MODEL Data.iowa_bottle_clusters,(SELECT
  item_description,
  AVG(state_bottle_cost) AS state_bottle_cost
 FROM
  bigquery-public-data.iowa_liquor_sales.sales
 WHERE
  date <= '2018-02-02'
  AND date >='2018-01-01'
 GROUP BY
  item_description) )
 ORDER BY
  centroid_id;

Step 4 – Examine the Results

Now let’s check if the result make any sense (sure it’s a very simplified example) but let’s see. Here are examples from the three clusters:

Only one item in the first cluster which seems to be the high class product:

1 – Remy Martin Louis Xiii Cognac – 1599.19

In the second group we have way more bottles – which seems to be considered as middle class – e.g.:

2 – The Dalmore Cigar Malt Scotch – 93.33

2 – Kavalan Sherry Oak Single Malt – 73.33

2 – Jim Beam Distillers Masterpiece – 104.91

And a cluster with products you would probably drink with coke or other mixtures:

3— Smirnoff Grape – 8.25

3 – Smirnoff Sours Green Apple – 8.25

3 – Burnett’s Fruit Punch – 4.48

So the results are not that bad in the end – but could be of course optimized. All you need here for is BigQuery and some basic SQL.

Conclusion

In this short article I want to provide some theoretical basic knowledge about the paradigm Move the Machine Learning or Algorithm to the Data. In addition, the advantages over classical approaches are clear. Especially in terms of setup and maintenance, but also the actual data analytic process in terms of time savings. In the end, I demonstrated how easy you can now develop a machine learning model only using SQL and Google’s BigQuery. For more information click on the links below.

Sources and Further Readings

[1] Downs, B. N., Opheim, D. M., Hale, W., Xi, L., Donehower, L. A., & Kalra, D. (2014). A Practical Example of Bringing Computation to Data. Journal of Biomolecular Techniques : JBT, 25(Suppl), S5.

[2] Google, What is BigQuery ML? (2020)

[3] Davies, D.L., Bouldin, D.W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2), S.224–227.


Related Articles