Mix SQL and Machine Learning and leverage your computation cluster

Learn how to deploy a Python SQL Engine to your k8s cluster and run complex Python functions from SQL

Published in

Towards Data Science

6 min readDec 8, 2020

Even the best data does not matter if no one can access it. Therefore, it is crucial to make data accessible to everyone, without the need for them to learn complicated programming languages. This is a problem, that SQL query engines try to solve. They allow everyone with some SQL knowledge to get access to data stored in typical big data systems.

Problem solved? Maybe not. Because in many cases, your data does not live in a nice-to-access data lake like S3. And the computations you want to apply to it are machine learning algorithms or other (typically Python-based) complex operations. That does not play well will the (typically) very strict and often Java-based SQL query engines, such as presto, Apache Impala or Apache Hive.

And now?

dask-sql (disclaimer: I am the author) is a young SQL query engine built on top of dask, a distributed computation framework written purely in Python and playing very well with the Python ecosystem, such as the ML frameworks you love. It allows you to use normal SQL, e.g. from your favorite BI tool, and query data living in the Python space. In the background, your calculations are distributed on a cluster of machines — but this complexity is carefully hidden from you.

In this blog post, we will deploy dask-sql to a Kubernetes (k8s) cluster and use it to query parts of the NYC taxi data and apply some Machine Learning to it. Additional to dask-sql and the Dask cluster (to run the calculations), we will deploy Apache Hue as an example for a BI tool. Dask plays well with a variety of computation systems (YARN, batch systems, k8s, etc.). We are using k8s, as it allows us to create a cluster in a public cloud provider very quickly for testing.

The final deployment we will produce in this blog post. Created by the author.

Deploy a k8s cluster

For this blog post, we are going to deploy a k8s cluster on the Google Cloud. However, you can use any k8s cluster you want (on AWS, Azure, your own infrastructure).

Update: As correctly mentioned by kernelcorn, you should take into account where your data lives to find a suitable location for your k8s cluster (otherwise you might pay both in performance and real money). We are querying data on AWS from GCP — which is fine for this example use case, but for any larger-scale use case you want to move your computation as closely as possible to your data.

To follow along, log into your Google Cloud console (or register for a new account) and enable the Kubernetes Engine API. After that, create a reasonable-sized k8s cluster, e.g. by running the following in the cloud console (or your local computer after having installed and configured the gcloud utility and kubectl):

gcloud container clusters create \
 — machine-type n1-standard-2 \
 — num-nodes 4 \
 — zone <compute zone> \
 — cluster-version latest \
 <cluster name>

Make sure to choose a compute zone near you and a reasonable cluster name (e.g. “dask-sql”). After the deployment, you should be able to see four running nodes in your cluster with

kubectl get node

Also, install helm (it is already installed on the Google Cloud console). If you don’t know about helm, make sure to also check out its project page (but in short: it is for k8s, what pip is for python)

Deploy dask-sql and access it

Now we are ready to deploy our Dask cluster, dask-sql and Apache Hue as a frontend. To simplify the deployment, I have packaged up everything into a git repository. Clone it and deploy it using helm to the cluster

git clone https://github.com/nils-braun/dask-sql-k8s-deployment
cd dask-sql-k8s-deployment
helm dependency update dask-sql
helm upgrade --cleanup-on-fail --install dask-sql dask-sql

The deployment will take some time (as it needs to download some data from S3). You can check with

kubectl get pods

if all pods are already in “ready” state. Once this is the case, access the Apache Hue BI tool by accessing the external IP printed on

kubectl get services hue

on port 8888 or — if you are deploying from your local computer — by making a port forwarding

kubectl port-forward svc/hue 8888:8888

and accessing http://localhost:8888. The first username and password you type in will be used as the admin account. Now, wait for the database to show up in the “PrestoSQL” view on the left and start querying it! For example, you could run

SELECT
    FLOOR(trip_distance / 5) * 5 AS "distance",
    AVG(tip_amount) AS "given tip",
    AVG(predict_price(total_amount, trip_distance, passenger_count)) AS "predicted tip"
FROM "nyc-taxi"
WHERE
    trip_distance > 0 AND trip_distance < 50
GROUP BY
    FLOOR(trip_distance / 5) * 5

The result of our SQL query in Apache Hue. Screenshot created by the author.

What is going on?

Let’s take a step back to actually understand, what is going on here. We have deployed a small cluster of 6 Dask workers, a scheduler to control them, a dask-sql server, and Apache Hue. If you are interested, have a look into the repository to find out more.

When you now enter a SQL query into Hue, it gets forwarded to the dask-sql server (which happens to speaks the presto wire protocol, even though it is not a presto server). dask-sql will parse the query and send it to the Dask cluster for computation, which will eventually fetch the data.

The Dask cluster working hard to solve our query. The screenshot shows the dask-scheduler. You can access it by port-forwarding port 8787 from the “dask-scheduler” pod. Each bar is a part of the computation, each row a parallel task. Screenshot created by the author.

Well, fine. But why is this so cool? Because you were using SQL, to not only trigger a complex distributed computation happening on a cluster of machines (which, by itself, is already cool) — you were even triggering a python computation. That is important because this opens up the possibility to (re)use your favorite python packages and functions.

Take a closer look at the SQL example from above. Do you see the predict_price function? That does not look like a standard SQL function! And yes, if you take a look into the dask-sql/files/run.py file in the repository, you will see, it is a python function!

# Our custom function for tip-prediction
# using the already loaded xgboost model
def predict_price(total_amount, trip_distance, passenger_count):
    # Create a dataframe out of the three columns
    # and pass it to dask-xgboost, to predict
    # distributed
    X = dd.concat([total_amount, trip_distance, passenger_count],
                        axis=1).astype("float64")
    return dask_xgboost.predict(client, bst, X)

It is using an already trained xgboost model, which is evaluated on the data, which is read from S3 (the data is preregistered, as you can see in the same file). And all of this is controlled from SQL — nice and clean without you even noticing it.

What can you do with it? Imagine, all your data engineers need to do is to define the data samples and databases, which are registered in dask-sql. After that, everyone in your organization can start querying the data, if they just know a bit of SQL. No matter if your data lives on S3 or hdfs (which can also be done using e.g. Impala or presto), but also from any system and in any format — as long as you can read it with Python (which is true for basically everything). Additionally, you can trigger arbitrary, preregistered Python computations on this data from SQL: machine learning, custom computations you share between your Data Science and BI team, complex operations which are too complicated to be implemented in SQL directly, you name it. With this, you can close the gap between Business Analysts, Data Scientists, and Data Engineers and get important business insights on your (big) data even faster.

Conclusion

This blog post described how you can get up and running with dask-sql on a kubernetes cluster and start querying your data with SQL. If you are interested in more details, here are some links to help start your journey:

Information on Dask: dask.org
dask-sql project page: https://nils-braun.github.io/dask-sql/
Documentation of dask-sql: dask-sql.readthedocs.io

Make sure to delete your cluster after you are finished.

Happy SQL-ing!