The world’s leading publication for data science, AI, and ML professionals.

MySQL data connector for your data warehouse solution

How to build one and export millions of rows in chunks, stream, capture real-time data changes or extract data and save it to the Cloud

Photo by Rita Seneca on Unsplash
Photo by Rita Seneca on Unsplash

Imagine you are a Data engineer and you were tasked to sync data from one of your MySQL database instances and your data warehouse. This is a very common scenario in which you need to connect of the most popular data sources and your data warehouse solution, i.e. Snowflake, Redshift or BigQuery.

This article is a detailed summary of How To extract and save data from relational database (MySQL) without using any 3rd paty apps. It’s a detailed tutorial of Node.js serverless application built with AWS Lambda function. This micro service will extract data from MySQL database and can be ran locally on your machine or/and from AWS account on a schedule. There is also a more advanced example of the same application which creates Node.JS streams to extract and save data from MySQL to your AWS S3 datalake.

Outline

You will learn how to:

  • create a simple Node.js app with AWS Lambda.
  • use Node.js streams to optimise memory consumption.
  • extract data and save it locally in CSV and JSON formats.
  • export it into the Cloud Storage.
  • use yaml config for your queries.
  • deploy and schedule it.

$ npm run test command would export a hundred million rows in chunks to Cloud Storage.

Image by author 💡 Mike Shakhomirov
Image by author 💡 Mike Shakhomirov

This tutorial might be useful for data engineers and everyone who works with MySQL databases or want to learn how to connect various arbitrary data sources and data warehouse solutions.

Prerequisites, Libraries and setup

TOOLS

  • Node.js and Node package manager installed
  • Basic understanding of cloud computing (Amazon Web Services account), AWS CLI and AWS SDK
  • Google Cloud Platform or AWS account depending on which Cloud Storage you’re using.
  • Shell (Command line interface) commands and scripting (Advanced).

TECHNIQUES

  • Good knowledge of Node.JS (intermediate). You will create a Lambda Function.
  • You must understand Node.JS basic concepts, i.e. async funcitons, Node packages and how the code works.
  • basic debugging (consoles, print statements)
  • loops: i.e. for
  • branches: if, if/else, switches
  • Shell commands and scripting as you would want to deploy your Lambda using AWS CLI from command line and be able to test it locally.

How to export data in chunks using MySQL native features

You would want to use [SELECT * INTO 'file.csv'](https://dev.mysql.com/doc/refman/5.7/en/select-into.html) to achieve this:

With a bit of tweaking and changing you’ll achieve the desired outcome.

However, this is a manual operation…

Keep in mind that you would want to limit your range in where clause using indexed column. Otherwise your database will be having hard times.

Very basic MySQL data connector with Node.JS

Skip this part if you are after more advanced examples with node.js streaming, data transformation and S3 upload.

  • Create a new folder for your Lamda micro service: $ mkdir mysql-connector $ cd mysql-connector $ npm init

  • Use npm to install mysql package for it: $npm i [email protected]

  • Install run-local-lambda package. We will use it to trigger and test Lambda locally: $ npm i [email protected]

  • In your Lambda root directory create a config file to define the pipeline configuration and tables to use:

  • Create an async function in your main application file ./app.js to run SQL query from ./config.json file.

Your application folder now should look like this:

Your main application file ./app.js would look like this:

This is the gist of how to export data from MySQL programmatically. Very simple.

Run $ npm run test in your command line and that would export data from MySQL database. Make sure your credentials and database host address are correct.

Now you can change SQL query in your./config.json file and run it programmatically.

How to use Node.js stream and export data from MySQL

First of all, why use Node.js stream?

Well if your dataset is more than your memory then you would want to extract data in chunks like we did in step 1. That what stream is for. It can help you to optimise your app’s memory and export data without loading all of the rows into memory.

If you use AWS Lambda or GCP Cloud functions it helps to save money and not to over provision the memory resource.

Very often 128 Mb of allocated memory is enough to export a few millions rows.

So with Node.js stream you can connect MySQL to your data warehouse more efficiently.

The gist of it

Let’s say you want to extract data from MySQL row by row and save it locally as a CSV. The example below would work with npm packages mysql2 and csv-stringify.

Try to create a function and run it with your lambda’s npm run test command.

How to export data from MySQL efficiently with stream and save locally as CSV

All you need is a queryDbAndSave() function. Try to add this async example below into your processEvent() function.

Add this to your processEvent() function like so:

Next in your command line run: $ npm run test.

Got the idea? Cool. Let’s continue to more examples.

How to export data from MySQL and pipe that stream to GCP’s Cloud Storage or AWS S3

This example will save your SQL query results to AWS S3 or GCP Cloud Storage and no need to over provision memory resources.

In my scenario I would want to stream query results to my S3 datalake bucket and transform JSON into ndJSON.

In this way I could easily trigger further data ingestion with some other service when file has been created in Cloud Storage. If you want to learn how to do it check my tutroial on how to handle data loading into your data warehouse.

How to Handle Data Loading in BigQuery with Serverless Ingest Manager and Node.js

To stream-save query results you would want to add a new branch to your queryDbAndSave() function:

This example will work for the same modules but you would want to include these ones too:

  • npm i [email protected] the that’s used in AWS Lambda environments at the moment of writing this article. Theck current evironments here so you could then simple exclude it from deployment artifacts.
  • [email protected]" a wrapper library making easier to construct streams.

So now if add another MySQL pipe to your ./config.json:

Run $ npm run test in your command line.

It will export the data from MySQL and save it as one file in stream mode to your Cloud Storage.

How export data as a stream and save it in chunks to Cloud Storage

You also might want to save data locally or to AWS S3 in chunks, i.e. in batch mode. The snippet below explains how to do it.

You would want to declare an output file batch size in rows at the top of your ./app.js: const BATCH_SIZE = process.env.BATCH_SIZE || 3000;

Then you would want to evaluate SQL query size. You could use another async function for that:

Add another branch to your queryDbAndSave() function and export data in chunks each time checking if it’s a time to finish:

Final solution for processEvent() function in ./app.js: Don’t forget to run$npm i moment. ./app.js will use to construct file keys to save objects.

How to use yaml config

The final solution which can be found in this repository use npm config and yaml definitions for your MySQL pipes.

I prefer using yaml simply because it is easier to read when you add those long SQL queries.

Sample ./config/staging.yaml would usually look:

It is also more intuitive in my opinion when you need to separate live and staging environments. In your command line run $ npm i config. So your final application folder with ./config/ instead ./config.json would look like:

How to deploy the solution

There are three ways to do it.

  1. Beginners would probably choose to use web UI either deploy the solution as AWS Lambda or GCP Cloud function
  2. More advanced users would probably want to create a deployment shell script ./deploy.sh like this one below. Run $ ./deploy.sh in your command line and it will deploy the Lambda:

  3. Deploy with Infrastructure as code using either Tarraform or AWS Cloudformation. Handy AWS Cloudformation template can be found in this project’s repository on github.

During the deployment make sure you configured all access roles and Security groups properly. For example, when your MySQL database is in AWS then your Lambda function must be deployed in the same VPC to be able to access it. Once you enable VPC support in Lambda your function no longer has access to anything outside your VPC, which includes S3. With S3 specifically you can use [VPC Endpoints][15] to resolve this.

Conclusion

This is a simple and reliable data export solution which allows you to extract data from MySQL database programmatically with low memory usage. This way you can create a fast and reliable MySQL data connector with some awesome features:

  • Extract MySQL data programmatically by just running $npm run test from your command line.
  • Use yaml definitions to describe your MySQL pipes.
  • Perform dryRun to evaluate SQL size.
  • Export data with Node.js streams to avoid memory over provisioning.
  • Save data exports on your local device.
  • Save data to the Cloud as one file or in chunks.
  • One click deployment and scheduling with Infrastructure as code

Check Github repository for more information.

Some real-time integrations might be expensive and often it depnds on the size of the data you have. Just imagine each row inserted into MySQL would trigger the Lambda to export and insert it into your data warehouse. There is a better way to monitor and control data ingestion like that.

Just a few days ago I used it to export 56 million of rows from MySQL. Then I saved them in chunks to my data lake in AWS S3. For example, I have data loading manager and every time file lands in data lake it sends data to my BigQuery data warehouse.

How to Handle Data Loading in BigQuery with Serverless Ingest Manager and Node.js

So exporting my MySQL data and loading it into my data warehouse was quick and easy.

Resources


Originally published at https://mydataschool.com.


Related Articles