Chaining Multiple SQL Queries in BigQuery

Execute multiple queries one after another in sequence in BigQuery

muffaddal qutbuddin
Towards Data Science

--

By Dakub, Souce Pixabay

Bigquery is a fantastic tool! It lets you do really powerful analytics works all using SQL like syntax.

But it lacks chaining the SQL queries. We cannot run one SQL right after the completion of another. There are many real-life applications where the output of one query depends upon for the execution of another. And we would want to run multiple queries to achieve the results.

Here is one scenario, suppose you are doing RFM analysis using BigQuery ML. Where first you have to calculate the RFM values for all the users then apply the k-means cluster to the result of the first query and then merge the output of the first query and second query to generate the final data table.

In the above scenario, every next query depends upon the output of the previous query and the output of each query also needs to be stored in data for other uses.

I this guide I will show how to execute as many SQL queries as you want in BigQuery one after another creating a chaining effect to gain the desire results.

Let me know if you need any help with BigQuery

Methods

I will demonstrate two approaches to chaining the queries

  1. The First using cloud pub/sub and cloud function: This is a more sophisticated method as it ensures that the current query is finished executing before executing the next one. This method also required a bit of programming experience so better to reach out to someone with a technical background in your company.
  2. The second using BigQuery’s own scheduler: However, the query scheduler cannot ensure the execution of one query is complete before the next is triggered so we will have to hack it using query execution time. More on this later.

And If you want to get your hands dirty yourself then here is an excellent course to start with.

Note: We will continue with the RFM example discussed above to get you the idea of the process. But the same can be applied for any possible scenario where triggering multiple SQL queries is needed.

Method 1

Method 1 uses the combination of cloud functions and pub/subs to chain the entire flow. The process starts by query scheduler which after executing the first query sends a message to pub/sub topic that triggers a cloud function responsible to trigger 2nd query and once completed sends a message to another pub/sub topic to start yet another cloud function. The process continues until the last query is executed by the cloud function.

Let’s understand the process with our RFM analysis use case.

Suppose we have three queries that are needed to be run one after another to perform RFM analysis.
First, that calculates RFM values, we will call it RFM Values.
Second, that creates the model, we will call itRFM Model.
Third, that merges model output with users RFM values, we will call it RFM Final.

Here is how the data pipeline looks like:

Chaining query in BigQuery data pipeline, by Muffaddal

Note: I will assume that tables for all three queries have already been created.

1- We start by first creating a Pub/Sub topic as it will be needed while creating RFM Values query schedular. I have named it RFM_Model_Topic as it will trigger the cloud function responsible for executing our model query (i.e RFM Model).

RFM_Model_Topic Pub/sub topic, by Muffaddal

Copy the topic name that is needed while creating RFM Values schedular.

2- Next, go to BigQuery, paste the RFM Values query that calculates RFM values for our users, in the query editor, and click the ‘Schedule query’ button to create a new query schedular.

create a scheduled query, by Muffaddal

3- Enter the required values in the scheduler creation menu to create the scheduler

query schedular creation menu, by Muffaddal

What this scheduler will do is it will execute on the specified time to calculate users' recency, frequency, and monetary values and store it in the mentioned BigQuery table. Once the schedule is done executing the query it will send a message to our RFM_Model_Topic which will trigger a cloud function to trigger our model query. So next let’s create a cloud function.

4- Go to RFM_Model_Topicpub/sub topi and click ‘Trigger Cloud Function’ Button at the top of the screen.

create cloud function from pub/sub topic, by Muffaddal

5- Enters settings as shown below and name the cloud function as RFM_Model_Function

cloud function settings, by Muffaddal

6- And paste below code in index.js file

Cloud function to trigger RFM_Model Query, by Muffaddal

Once the query is executed cloud function sends a publish message to a new pub/sub topic named RFM_Final which triggers cloud function responsible for the last query that combines both RFM values and model results in one data set.

7- Therefore, next, create RFM_Model topic in pub/sub and a cloud function as we did in the previous step. Copy-paste below code in cloud function so that it can run the last query.

Cloud function to trigger RFM_Final Query, by Muffaddal

And that is it!

We can use as many pub/sub and cloud functions as we want to chain as many SQL queries as we want.

Method 2

Now the first approach is robust but requires a bit of programming background and says it is not your strong suit. You can use method 2 to chain the BigQuery queries.

BigQuery’s query scheduler can be used to run the queries one after another.

Idea is that we start the process the same as we did in method 1, i.e. trigger the first query using a scheduler and estimate its time for completion. Let’s say the first query takes 5 minutes to complete. What we will do is trigger the 2nd query 10 mints after the first query start time. This way we are ensured that the second query is triggered after the first query is completely executed.

Let’s understand this by example

Chain queries using query scheduler, by Muffaddal

Suppose we scheduled the first query at 12:30 am. It takes 10 mints to complete. So we know at 12:40 am the first query should be completed. We will set the second query scheduler to execute at 12:50 am (keeping a 10 mint gap between two schedulers just in case). And we will trigger the third query at 1:10 am and so on.

Method 2 can also work with a combination of method 1. Meaning that we trigger Query Schedular using the cloud function to do the trick. Doing so also has the added benefit that you don’t have to run into the execution time limit of cloud function and your query can take as long as it needs to execute.
PS: Triggering on-demand query scheduler with cloud function was suggested by
Simon Thomsen.

Note: Since the query scheduler doesn’t work with BigQuery ML, therefore, method 2 won’t work for our RFM analysis case but It should get you the idea of how to use the scheduler to chain queries.

Other Methods

If you can handle more sophisticated pipelines Sébastien Morand proposed two more methods that you can employ.

Summary

Executing queries one after another helps to achieve really great results especially when the result of one query depends on the output of another and all the query results are also needed as table format as well. BigQuery out of the box doesn’t support this functionality but using GCP’s component we can streamline the process to achieve the results.

In this article, we went through two of the method to do this. First using the cloud pub/sub and cloud function and another using BigQuery’s own query scheduler.

With this article, I hope I was able to convey the idea of the process for you to pick it up and tailor it for your particular business case.

Read More on BigQuery:

--

--