Machine Learning in SQL — it actually works!

With the newest version of dask-sql it is possible to use the huge Python ML ecosystem in SQL

Published in

Towards Data Science

7 min readJan 30, 2021

Sometimes it is hard to believe that a world before ML existed. So many modern data analyses are built on top of ML techniques and will continue to do so in the foreseeable future. However, not everyone is able to benefit from these vast advances, because using ML techniques mostly involves using Python, developing code, and understanding many new technologies. Especially when Big Data and distributed systems enter the game, things get messy.

This is a problem that SQL query engines are trying to solve. They make it possible to use the complicated world of distributed systems with simple, yet powerful SQL commands. dask-sql is a new SQL query engine (disclaimer: I am the author) building on top of the python-only Dask distributed library. This new library allows you to combine Python with SQL and distributed scalable computation! (Read more about it in one of my other posts).

The newest version 0.3.0 of dask-sql has experimental support for Machine Learning, so today we will see what we can do with it. Of course, we will use the well-known Iris dataset for this — the hello world of data science. Even though this particular data sample is quite small, everything shown in this post will easily scale up to large amounts of data by just adding more computation nodes to your cluster.

Side note: ML with dask-sql is still experimental :-) Feel free to try it out, but use it with caution.

Prerequisites and Preparations

In this post, I will assume you (or your Data Engineers) have dask-sql already set up and running. Refer to the documentation or this blog post for more information. I will also assume, that you connect to dask-sql via its SQL server interface — for example via a BI tool like Apache Hue, which even comes with native support for dask-sql in its newest version.

If you are using the Python interface to dask-sql, you can still follow along. Just initialize your Jupyter notebook with the following cell

from IPython.core.magic import register_line_cell_magic
from dask_sql import Context# Create a context to store the tables and models
c = Context()# Small helper function to make our life easier
@register_line_cell_magic
def sql(line, cell=None):
    if cell is None:
        cell = line
        line = None
    
    if not line:
        line = {}
    return c.sql(cell, return_futures=False, **line)

In the following code examples, prepend every SQL command with %%sql, e.g.

%%sql
SELECT 1 + 1

Loading and Preparing the Data

After these things are out of the way, let's begin by importing our data. As dask-sql makes use of the large Python and Dask ecosystem, you can read in data samples in many different formats and from many different locations. In this example, we will fetch the data from the Internet in the rather standard CSV format:

CREATE OR REPLACE TABLE iris WITH (
    location = 'https://datahub.io/machine-learning/iris/r/iris.csv',
    persist = True
)

The data setis loaded and stored as a table with the name “iris”. The persist = True makes dask-sql cache the data in memory.

Now, check if the data was loaded correctly

DESCRIBE iris

As you can see, the data types were assigned automatically. We can show the first ten rows of the data with this standard SQL command:

SELECT * FROM iris LIMIT 10

A short introduction to the Iris data set: as shown in the output above, the dataset describes measurements of Iris flowers together with their species. It is one of the standard datasets in Machine Learning and can be used as an example for many types of ML methods. In this example we will apply an unsupervised clustering algorithm.

Before we come to the training, let’s first do another important step in every ML pipeline: feature extraction. Since I am not a Biologist, I will just introduce a very simple and naive new feature as an example here: multiplying sepal length with sepal width. It is also possible to generate more complex features using SQL statements and functions. If this is not enough, dask-sql allows to register user-defined functions (UDF) written in Python.

SELECT 
    *, 
    sepallength * petallength AS new_feature 
FROM iris
LIMIT 10

To make our life a bit easier, let’s introduce an alias for this enhanced table

CREATE OR REPLACE TABLE transformed_data AS (
    SELECT 
        *, 
        sepallength * petallength AS new_feature
    FROM iris
)

We are now ready to apply Machine Learning!

Training a Machine Learning Model

There exists a large variety of possible Machine Learning models — from simple models like Linear Regression to the robust Boosted Decision Trees to top-of-the-edge-research models like Transformers. Many of them are implemented in the famous scikit-learn Python package and are therefore (among many other libraries) usable in dask-sql.

In this example, we apply the k-Means clustering algorithm. In short, it groups the dataset into clusters of rows with similar features. If everything goes well, we hope it will cluster flowers of the same species together — without us telling the algorithm about it. Let’s see how the algorithm performs (spoiler: not very good). If you want to learn more about the models compatible with dask-sql and their settings, I encourage you to have a look into the documentation.

So, let’s apply our clustering algorithm to the data!

CREATE OR REPLACE MODEL clustering WITH (
    model_class = 'sklearn.cluster.KMeans',
    wrap_predict = True,
    n_clusters = 3
) AS (
    SELECT sepallength, sepalwidth, petallength, petalwidth, new_feature
    FROM transformed_data
)

As you can see, we use a new SQL construct CREATE MODEL for the training, which gets some parameters to specify which model to train. In our case, we are choosing the k-Means algorithm from scikit-learn and set the number of groups or clusters we expect to three (as we have three species). The algorithms in the scikit-learn package work very well on medium-sized data, if you need to go beyond that, have a look into dask-ml.

The training should be finished in no-time (as the total dataset is rather small), so we can go ahead and check the prediction.

Check the performance

SELECT * FROM PREDICT (
    MODEL clustering,
    SELECT * FROM transformed_data
    LIMIT 10
)

This SQL statement will apply a trained model to the given data and add a new column “target” with the predicted target of the model to it. From the first ten lines, it looks rather good (all the “setosa” have the same predicted target). So let’s introduce an alias again to do some more calculations

CREATE OR REPLACE TABLE iris_results AS (
    SELECT class AS label, target AS predicted FROM PREDICT (
        MODEL clustering,
        SELECT * FROM transformed_data
    )
)

To keep it short, we only have a very quick glance at the results and check the distribution of species and predicted clusters.

SELECT 
    label, predicted, COUNT(*) AS numbers
FROM iris_results
GROUP BY label, predicted

Not perfect at all, but fortunately this is not a post on ML, so I will skip the optimization step :-). Your BI tool might be able to automatically plot these numbers, and as a python user you can use

df = c.sql("""
SELECT 
    label, predicted, COUNT(*) AS numbers
FROM iris_results
GROUP BY label, predicted
""", return_futures=False)df = df.set_index(["label", "predicted"])
df.numbers.unstack(0).plot.bar(ax=plt.gca())

Summary

Thanks for following this post until the end! We have covered quite some material, so here is a quick recap:

SQL query engines are cool because you can use them to query your complex data using complex distributed systems with simple SQL syntax (and a lot of magic).
dask-sql is one of them, which has a particularly simple interaction with the Python ecosystem (as it is built on top of Dask, which is a library to scale Python applications).
One benefit is the easy integration of a large variety of Python ML libraries, for example, the k-Means algorithm from the scikit-learn package as shown in this blog post. And another good thing is, everything works just from using SQL!

If you want to learn more, head over to the documentation, and start SQL-ing your data.