What is Serverless SQL? And how to use it for Data Exploration

Data Exploration tools for Data Scientists

Adi Polak
Towards Data Science
5 min readDec 1, 2020

--

So, you are a data scientist, you work with data and need to explore it and run some analytics on the data before jumping into running extensive machine learning algorithms.

Let’s start by examining what Serverless is.

Serverless:

According to Wikipedia, serverless computing is a cloud computing execution model where the cloud provider manages the server and dynamically allocates the resource needed to finish the task.

That means, as users, we are in charge of the logic only. We don’t need to take care of the servers, capacity planning, or maintenance operations scale. It doesn’t mean they are not happening; they are, just not by us. 😁

What is SQL?

According to Wikipedia, SQL is a domain-specific programming language used for managing data held in relational databases.

So what is Serverless SQL?

Serverless SQL is a distributed computing tool that enables us to process distributed data using SQL language without managing the databases servers themselves. If we have data, or “big data” in one or more of our data lake/storages, like AWS S3 or Azure blob storage, we can run SQL query on that data without the need to build a pipeline or inject the data into distributed databases such as Cassandra or MongoDB.

This is a huge advantage, especially when we would like to interact with offline data without creating the pipelines or peek at completely fresh new data that was just sunk into the storage before replicating/transforming and saved into a dedicated Database.

As with all cloud services, there is a need to understand the cost model. with serverless, we pay-per-use. That means that we pay for the amount of processed data.

If we ran a select * query on 2TB, the prices would be calculated by the 2TB data processed. This can be more cost-efficient than having a dedicated pool of servers, depends on the company/group needs.

Let’s take a look at a use case where explore data with Serverless SQL

We will look into the famous NYC Taxi dataset from the open datasets and will try to understand what is the state of taxis in the city, what is the time during the day that people take taxies, and more. This information will help us with the exploration of the data phase.

For Serverless SQL, I use Azure Synapse workspace, you can use it as well, or take a look at Big Query or explore other tools.

First thing, let’s take a look at the data, select top 100 will return the first 100 rows from the parquet file:

This is the select operation results in Table

We can look at it in a table format or chart format. currently, the chart will not give us much information as our query was a simple select.

But, we can start exploring the data using the category column, labels, and much more.

Let’s take a look at rides per year, in between the years 2014 and 2019, this will give us the time series insights into what is happening along the years:

Query
Chart outcome

Looking at the chart, we can see that there is a significant decrease in how many people are taking taxis in NYC over the years. Starting with 165 M taxi rides in 2014 and decreasing to 44 M taxi rides in 2019.

That can be because of better city transport in the city like metros, buses, or new ride-sharing companies that came into that market during that period of time like Uber and Lyft.

Now, let’s take a look at a single year:

Running the query and looking at the results, under the chart, we can see a pattern where during the weekends there are fewer rides, we will achieve that by changing the views of the chart according to our needs:

It might be related to the fact that during public holidays, people are less likely to take taxis, we will join the table of NYC Taxi data with United State Holiday data.

All this data is under the open-source datasets and is available for you to query and use.

This query can be divided into 3 queries

  1. Define taxi_ride temp table

Here we are pulling data from the data lake, selecting, casting date information and counting the rides for 2016, and grouping rides per day.

2. Define public_holiday temp table:

Here we are pulling data from the data lake, but from a different table, selecting the holiday for the United-States during 2016.

3. Combining it:

Using the withoperator the beginning of the query and the 3rd query referring to both new temp tables, we can run join on the table with the date column as a key.

Running the query on our Serverless SQL engine will give us this chart, where we can tweak the columns to better explore and understand the data:

As long as the data is available for use in a data lake, either public or by our company, we can easily do the exploration process using Serverless SQL.

We can later save the view/output that we created and share it with our colleagues or download the image using the save as image button.

Save as Image button

Curious to learn more?

Watch the free workshop on it here.

We will look into:

  • running queries
  • cost
  • SQL functionality
  • SQL optimizations

Notes

  • Serverless SQL is not built for Real-Time analytics and workloads that require millisecond response.
  • The serverless SQL cost model is pay by usage

Thank you for reading so far. Happy to take your question, input, and feedback on @adipolak.

--

--

👩‍💻 Software Engineer 📚 Author of Scaling Machine Learning with Spark (O'Reilly) 🗣️ Keynote Speaker 💫 Databricks ambassador