The traditional approach of extracting data from a Data Warehouse or Data Lake – then cleaning, transforming and building a model with it, is slowly being overtaken by a newer approach called Bring the Compute to the Data or Bring Machine Learning/Algorithms to the Data [1]. Especially, since big providers like Google, Amazon and Microsoft provide link services such as databases and data warehouses with machine learning services, it is no longer necessary to integrate the data into other platforms or systems. In services like Google’s Bigquery, traditional database systems are even extended internally with ML tools [2].
Advantages of the new Paradigm
Advantages of the new approach include a simplified infrastructure. If we look at the simplified architecture below, it becomes clear that, if the services can already communicate with each other in a cloud environment, or are integrated in a service, no further interfaces to third-party systems are necessary. This significantly shortens the setup and maintenance of these environments.

Another important factor is that the Data Science process can be remarkably streamlined. Every data scientist and engineer knows how time consuming the process can be – so the approach of having everything you need in a cloud environment or even in a service simplifies this process significantly.

The third point is the simplification of programmability – for example, analysts can easily do Machine Learning tasks only with SQL. In the following, I would like to illustrate this with a simple example in BigQuery. If you want to dive deeper into the topic Data Platform modernization, you might find this article interesting.
Example with BigQuery ML
For a little show case and to proof how solutions simplify the data analytics process, I will use BigQuery and its ML functionalities. Here, I used the public dataset which contains every wholesale purchase of liquor in the State of Iowa.
In this example, I just wanted to do some clustering. The example is very simplified but shows quite well the possibilities and steps you have to take.
Step 1: Create the Model
CREATE OR REPLACE MODEL DATA.iowa_bottle_clusters OPTIONS (model_type='kmeans', num_clusters=3, distance_type = 'euclidean') ASSELECT item_description, AVG(state_bottle_cost) AS state_bottle_cost,FROM `bigquery-public-data.iowa_liquor_sales.sales`WHERE EXTRACT(YEAR FROM date) = 2018GROUP BY item_description;
Above, you can see that I used the k-Means algorithm and set the parameter for the number of clusters to three. My idea for this highly simplified example is to use the variable _state_bottlepaid (The amount that the Alcoholic Beverages Division paid for each bottle of liquor ordered) to divide it into three price categories.
Step 2: Evaluate Model
After creating the model, BigQuery automatically provides you with some metrics. These allow an evaluation of clustering algorithms. For the Davies-Bouldin index, the goal would be the lowest possible value [3].

Another great feature is a graph for the loss we get provided.

With
SELECT * FROM ML.TRAINING_INFO(MODEL Data.iowa_bottle_clusters);
and
SELECT davies_bouldin_index FROM ML.EVALUATE(MODEL Data.iowa_bottle_clusters);
you can query results later on, if needed.
Step 3:Predict
Via ML.PREDICT we will see which cluster a particular brands belongs to.
SELECT
centroid_id,
item_description,
state_bottle_cost
FROM
ML.PREDICT(MODEL Data.iowa_bottle_clusters,(SELECT
item_description,
AVG(state_bottle_cost) AS state_bottle_cost
FROM
bigquery-public-data.iowa_liquor_sales.sales
WHERE
date <= '2018-02-02'
AND date >='2018-01-01'
GROUP BY
item_description) )
ORDER BY
centroid_id;
Step 4: Examine the Results
Now let’s check if the results make any sense. Here are examples from the three clusters:
Only one item in the first cluster, that seems to be the high class product:
1 – Remy Martin Louis Xiii Cognac – 1599.19
In the second group we have way more bottles , that seem to be considered as middle class – e.g.:
2 – The Dalmore Cigar Malt Scotch – 93.33
2 – Kavalan Sherry Oak Single Malt – 73.33
2 – Jim Beam Distillers Masterpiece – 104.91
And a cluster with products you would probably mix with coke or other soft drinks:
3 – Smirnoff Grape – 8.25
3 – Smirnoff Sours Green Apple – 8.25
3 – Burnett’s Fruit Punch – 4.48
So the results are not that bad in the end – but could be of course optimized. All you need here is BigQuery and some basic Sql.
Conclusion
In this short article, I wanted to provide some theoretical basic knowledge about the paradigm Move the Machine Learning or Algorithm to the Data. In addition, the advantages over classical approaches are clear. Especially in terms of setup and maintenance, but also the actual data analytic process in terms of time savings. BigQuery ML democratizes the use of Machine Learning. Data Scientists and Analysts can use BigQuery ML to build and run models without having to access new business intelligence tools and tables. Predictive analytics can help with business decision making. In the end, I demonstrated how easy you can now develop a machine learning model only using SQL and Google’s BigQuery. For more information, click on the links below.
Sources and Further Readings
[1] Downs, B. N., Opheim, D. M., Hale, W., Xi, L., Donehower, L. A., & Kalra, D. (2014). A Practical Example of Bringing Computation to Data. Journal of Biomolecular Techniques : JBT, 25(Suppl), S5.
[2] Google, What is BigQuery ML? (2020)
[3] Davies, D.L., Bouldin, D.W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2), S.224–227.