Send Cloud SQL Data to BigQuery

Automated data pipeline to move Cloud SQL data to BigQuery

muffaddal qutbuddin
Towards Data Science

--

source pixabay, by Dimhou

Cloud SQL is a database service from Google Cloud Platform that makes it easy to set up, maintain, and manage relational databases.

However, It isn’t built for heavy analytics operations. You need a warehouse service such as BigQuery to do advanced analytics and machine learning tasks such as performing RFM analytics using BigQuery ML. For this, you need to move your cloud SQL data to BigQuery.

In this guide, we will build a data pipeline to send the cloud SQL data to BigQuery in an automated fashion.

Let me know if you need help with sending data to Bigquery.

Data Pipeline

For our data pipeline, we will make the use of two GCP components
1. Federated Query
2. Query Scheduler

This is how the flow of data would be

Data flow illustration, by Muffaddal

A query scheduler will execute a federated query that will fetch SQL data in the BigQuery environment and append it to a BigQuery table.

Need help with Data Warehouse? Let’s talk!.

Cloud SQL to BigQuery Connection

Before we go into details on how to move cloud SQL data to BigQuery we have to first establish the connection between both components. Follow the below steps to do so.

1- BigQuery Connection API

Enable Bigquery’s Connection API to be able to connect it to cloud SQL.

BigQuery Connection API, by Muffaddal

2- Cloud SQL Public IP

For BigQuery to connect to SQL we have to make sure that cloud SQL is set to public network.

Goto your SQL instance > connection tab and enable public IP > Hit Save

make instance public, by Muffaddal

Even if your instance is public it still will remain inaccessible from public internet because you haven’t provided an authorized address.

3- CLoud SQL Connection in BigQuery

Goto BigQuery interface and add an external data source.

steps to create external connection, by Muffaddal

This will open a prompt to enter the details about your SQL. Enter the appropriate information to establish a connection with the cloud SQL.

SQL detail prompt , by Muffaddal

4- Manage Connection

Next, provide the email address and role for users to access this newly added external database.

share connection button, by Muffaddal

This also lets the other also use the external source in BigQuery.

Doing the above we are ready to import Cloud SQL data to BigQuery.

Federated Queries

Federation queries enable us to query data residing in Cloud SQL in real-time, without copying or moving data directly to BigQuery. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.

It employs the function EXTERNAL_QUERY to do the trick. It takes two arguments. First the instance Id and second the actual SQL query.

EXTERNAL_QUERY( <instance id>, <sql query> )

Here is a very basic query to process SQL data in the BigQuery interface.

federated query example, by Muffaddal

The query within the EXTERNAL_QUERY function processes the data within the cloud SQL interface and passes that to BigQuery engine for it to handle.

This means two things:
First, Using the query within the EXTERNAL_QUERY function we can do filters, joins, subquery to get only our desire data from cloud SQL.

Second, once passed to the BigQuery engine, the outer select in the example query above, we can merge it with other existing BigQuery tables to transform and build our data set as desire in BigQuery.

We will employ these two functionalities to ensure we only get the data that doesn’t exist in our BigQuery final table. The final table is where we are storing cloud SQL data in BigQuery.

Query to Move Data

We will automate our pipeline to get cloud SQL data every hour.

Therefore, we only need previous hour data from our SQL table. Hence, we have to ensure that our query only fetches the subset of the data from SQL instance.

federated query example, by Muffaddal

Whenever the above query executes it fetches everything from previous hours. posting_timestamp in where clause is the DateTime column in the table. Replace it with your date column name.

Automate Data Insertion

Now we only have to rerun our federate queries every hour and append the result in BigQuery. For this, Query Scheduler comes into play.

1- In the BigQuery console, paste the query in the query editor, and click the ‘Schedule query’ button to create a new scheduler.

create a new scheduled query, by Muffaddal

2- Next, enter the required values in the scheduler creation menu to create the scheduler.

query schedular creation menu, by Muffaddal

Don't forget to select ‘Append to table’ option in the query schedular dialog.

query schedular creation dialog, by Muffaddal

This scheduler will execute every hour to fetch the last hour data from cloud SQL and append it to the BigQuery table.

And that is it. You will get your SQL data moved to BigQuery automatically on an hourly basis.

Few Things to Keep in Mind

There are a few things that you have to keep in mind while implementing the solution

1- Many of the data types of SQL are not supported by BigQuery, such as UUID generator type, in that case, convert the data type to BigQuery readable data type in the federated query.

2- Not all cloud SQL instance region supports federated queries so you will have to move your SQL instance to where federated query support is available. See below screenshot is for the Americas region.

Federated Query Region Support, source GCP

See here for more details.

Similar Read You May Like

--

--