The traditional approach of extracting data from a Data Warehouse or Data Lake— then clean it, transform it and build a model is slowly being overtaken by a newer approach [1] called
Bring the Compute to the Data or
Bring Machine Learning/Algorithms to the Data
Especially since the big providers like Google, Amazon and Microsoft link services like databases and data warehouses with machine learning services, it is no longer necessary to integrate the data into other platforms or systems. In services like Google’s Bigquery, traditional database systems are even extended internally with ML tools [2].
Advantages of the new Paradigm
Advantages of the new approach include a simplified infrastructure. If we look at the simplified architecture below, it becomes clear that if the services can already communicate with each other in a cloud environment, or are integrated in a service, no further interfaces to third-party systems are necessary. This significantly shortens the setup and maintenance of these environments.

Another important factor is that the Data Science process can be streamlined remarkably. Every data scientist and engineer knows how time consuming the process can be – so the approach of having everything you need in a cloud environment or even in a service simplifies this process significantly.

The third point is the simplification of programmability – for example, analysts can easily do Machine Learning tasks only with SQL. In the following I like to illustrate this with a simple example in Google BigQuery. If you want to dive deeper into the topic Data Platform modernization you might find this article interesting.
Example with BigQuery ML
For a little show case and to proof how solutions simplify the data analytics process I will use BigQuery and its ML functionalities. Here, I used the public dataset which contains every wholesale purchase of liquor in the State of Iowa.
In this example, I just wanted to do some clustering. The example is very simplified but shows quite well the possibilities and steps you have to take.
Step 1 – Create the Model
CREATE OR REPLACE MODEL
DATA.iowa_bottle_clusters OPTIONS (model_type='kmeans',
num_clusters=3, distance_type = 'euclidean') AS
SELECT
item_description,
AVG(state_bottle_cost) AS state_bottle_cost,
FROM
`bigquery-public-data.iowa_liquor_sales.sales`
WHERE
EXTRACT(YEAR FROM date) = 2018
GROUP BY
item_description;
Above, you can see that I used the k-Means algorithm and set the parameter for the number of clusters to three. My idea for this, as I said, highly simplified example is simply to use the variable _state_bottlepaid (The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered) to divide it into three price categories.
Step 2 – Evaluate Model
After creating the model BigQuery automatically provide you with some metrics. These allow an evaluation of clustering algorithms. For the Davies-Bouldin index, the goal would be the lowest possible value [3].

Another great feature is a graph for the loss we get provided.

With
SELECT * FROM ML.TRAINING_INFO(MODEL Data.iowa_bottle_clusters);
and
SELECT davies_bouldin_index FROM ML.EVALUATE(MODEL Data.iowa_bottle_clusters);
you can query result later on if needed.
Step 3 – Predict
Via ML.PREDICT we will see which cluster a particular brands belongs to.
SELECT
centroid_id,
item_description,
state_bottle_cost
FROM
ML.PREDICT(MODEL Data.iowa_bottle_clusters,(SELECT
item_description,
AVG(state_bottle_cost) AS state_bottle_cost
FROM
bigquery-public-data.iowa_liquor_sales.sales
WHERE
date <= '2018-02-02'
AND date >='2018-01-01'
GROUP BY
item_description) )
ORDER BY
centroid_id;
Step 4 – Examine the Results
Now let’s check if the result make any sense (sure it’s a very simplified example) but let’s see. Here are examples from the three clusters:
Only one item in the first cluster which seems to be the high class product:
1 – Remy Martin Louis Xiii Cognac – 1599.19
In the second group we have way more bottles – which seems to be considered as middle class – e.g.:
2 – The Dalmore Cigar Malt Scotch – 93.33
2 – Kavalan Sherry Oak Single Malt – 73.33
2 – Jim Beam Distillers Masterpiece – 104.91
And a cluster with products you would probably drink with coke or other mixtures:
3— Smirnoff Grape – 8.25
3 – Smirnoff Sours Green Apple – 8.25
3 – Burnett’s Fruit Punch – 4.48
So the results are not that bad in the end – but could be of course optimized. All you need here for is BigQuery and some basic SQL.
Conclusion
In this short article I want to provide some theoretical basic knowledge about the paradigm Move the Machine Learning or Algorithm to the Data. In addition, the advantages over classical approaches are clear. Especially in terms of setup and maintenance, but also the actual data analytic process in terms of time savings. In the end, I demonstrated how easy you can now develop a machine learning model only using SQL and Google’s BigQuery. For more information click on the links below.
Sources and Further Readings
[1] Downs, B. N., Opheim, D. M., Hale, W., Xi, L., Donehower, L. A., & Kalra, D. (2014). A Practical Example of Bringing Computation to Data. Journal of Biomolecular Techniques : JBT, 25(Suppl), S5.
[2] Google, What is BigQuery ML? (2020)
[3] Davies, D.L., Bouldin, D.W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2), S.224–227.