AWS Glue 101: All you need to know with a full walk-through

What is Glue? A Full ETL Pipeline Explained

Published in

Towards Data Science

10 min readNov 3, 2020

Ever wondered how major big tech companies design their production ETL pipelines? Interested in knowing how TB, ZB of data is seamlessly grabbed and efficiently parsed to the database or another storage for easy use of data scientist & data analyst?

In this post, I will explain in detail (with graphical representations!) the design and implementation of the ETL process using AWS services (Glue, S3, Redshift). Anyone who does not have previous experience and exposure to the AWS Glue or AWS stacks (or even deep development experience) should easily be able to follow through. The walk-through of this post should serve as a good starting guide for those interested in using AWS Glue.

Before we dive into the walkthrough, let’s briefly answer three (3) commonly asked questions:

What actually is AWS Glue?
What are the features and advantages of using Glue?
What is the real-world scenario?

What is AWS Glue?

So what is Glue? AWS Glue is simply a serverless ETL tool. ETL refers to three (3) processes that are commonly needed in most Data Analytics / Machine Learning processes: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse. And AWS helps us to make the magic happen. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.

Components of AWS Glue

Data catalog: The data catalog holds the metadata and the structure of the data.
Database: It is used to create or access the database for the sources and targets.
Table: Create one or more tables in the database that can be used by the source and target.
Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
Development endpoint: It creates a development environment where the ETL job script can be tested, developed, and debugged.

Why use AWS Glue?

How Glue benefits us? Here are some of the advantages of using it in your own workspace or in the organization

AWS Glue scan through all the available data with a crawler
Final processed data can be stored in many different places (Amazon RDS, Amazon Redshift, Amazon S3, etc)
It’s a cloud service. No money needed on on-premises infrastructures.
It’s a cost-effective option as it’s a serverless ETL service
It’s fast. It gives you the Python/Scala ETL code right off the bat.

A Production Use-Case of AWS Glue

Here is a practical example of using AWS Glue.

A game software produces a few MB or GB of user-play data daily. The server that collects the user-generated data from the software pushes the data to AWS S3 once every 6 hours (A JDBC connection connects data sources and targets using Amazon S3, Amazon RDS, Amazon Redshift, or any external database).

We, the company, want to predict the length of the play given the user profile. To perform the task, data engineering teams should make sure to get all the raw data and pre-process it in the right way. Glue offers Python SDK where we could create a new Glue Job Python script that could streamline the ETL. The code runs on top of Spark (a distributed system that could make the process faster) which is configured automatically in AWS Glue. Thanks to spark, data will be divided into small chunks and processed in parallel on multiple machines simultaneously.

Extract — The script will read all the usage data from the S3 bucket to a single data frame (you can think of a data frame in Pandas)

Transform — Let’s say that the original data contains 10 different logs per second on average. The analytics team wants the data to be aggregated per each 1 minute with a specific logic.

Load — Write the processed data back to another S3 bucket for the analytics team.

Project walkthrough

For the scope of the project, we will use the sample CSV file from the Telecom Churn dataset (The data contains 20 different columns. The objective for the dataset is a binary classification, and the goal is to predict whether each person would not continue to subscribe to the telecom based on information about each person. Description of the data and the dataset that I used in this demonstration can be downloaded by clicking this Kaggle Link).

1. Create an IAM role to access AWS Glue + EC2 + CloudWatch + S3

You need an appropriate role to access the different services you are going to be using in this process. An IAM role is similar to an IAM user, in that it is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. When you get a role, it provides you with temporary security credentials for your role session. You can find more about IAM roles here

Open Amazon IAM console
Click on Roles → Create Role.
Choose Glue service from “Choose the service that will use this role”
Choose Glue from the “Select your use case” section
Select “AWSGlueServiceRole” from the Attach Permissions Policies section.
Click on Next: Tags. Leave the Add tags section blank. Create role.
Your role now gets full access to AWS Glue and other services

2. Upload source CSV files to Amazon S3

On the Amazon S3 console, click on the Create a bucket where you can store files and folders.

Enter a bucket name, select a Region and click on Next

The remaining configuration settings can remain empty now. Click Next to create your S3 bucket.
Create a new folder in your bucket and upload the source CSV files

(Optional) Before loading data into the bucket, you can try to compress the size of the data to a different format (i.e Parquet) using several libraries in python

3. Start the AWS Glue Database

In order to add data to a Glue data catalog, which helps to hold the metadata and the structure of the data, we need to define a Glue database as a logical container.

So we need to initialize the glue database

4. Create and Run Glue Crawlers

As we have our Glue Database ready, we need to feed our data into the model. So what we are trying to do is this: We will create crawlers that basically scan all available data in the specified S3 bucket. The crawler identifies the most common classifiers automatically including CSV, JSON, and Parquet.

On the left pane in the AWS Glue console, click on Crawlers -> Add Crawler
Click the blue Add crawler button.
Make a crawler a name, and leave it as it is for “Specify crawler type”

In Data Store, choose S3 and select the bucket you created. Drill down to select the read folder

In IAM role, choose the role you created above
Leave the Frequency on “Run on Demand” now. You can always change to schedule your crawler on your interest later.
In Output, specify a Glue database you created above (sampledb)

Then, a Glue Crawler that reads all the files in the specified S3 bucket is generated
Click the checkbox and Run the crawler by clicking Run Crawler
Once it’s done, you should see its status as ‘Stopping’. And ‘Last Runtime’ and ‘Tables Added’ are specified.

Then, Databases → Tables on the left pane let you verify if the tables were created automatically by the crawler.

5. Define Glue Jobs

With the final tables in place, we know create Glue Jobs, which can be run on a schedule, on a trigger, or on-demand. The interesting thing about creating Glue jobs is that it can actually be an almost entirely GUI-based activity, with just a few button clicks needed to auto-generate the necessary python code. However, I will make a few edits in order to synthesize multiple source files and perform in-place data quality validation. By default, Glue uses DynamicFrame objects to contain relational data tables, and they can easily be converted back and forth to PySpark DataFrames for custom transforms.

Note that at this step, you have an option to spin up another database (i.e. AWS RedShift) to hold final data tables if the size of the data from the crawler gets big. For the scope of the project, we skip this and will put the processed data tables directly back to another S3 bucket

In the left pane, click on Jobs, then click on Add Job

Give a name and then select IAM role previously created for AWS Glue
Select Spark for the Type and select Spark 2.4, Python 3 for Glue Version
You can edit the number of DPU (Data processing unit) values in the Maximum capacity field of Security configuration, script libraries, and job parameters (optional).
The remaining configuration is optional

Choose a data source table from Choose a data source section. You can choose only a single data source.

Add a JDBC connection to AWS Redshift. We need to choose a place where we would want to store the final processed data. You can choose your existing database if you have one. Or you can re-write back to the S3 cluster. For this tutorial, we are going ahead with the default mapping. The business logic can also later modify this.

Open the Python script by selecting the recently created job name. Click on Action -> Edit Script.
The left pane shows a visual representation of the ETL process. The right-hand pane shows the script code and just below that you can see the logs of the running Job.
Save and execute the Job by clicking on Run Job.

We get history after running the script and get the final data populated in S3 (or data ready for SQL if we had Redshift as the final data storage)

Conclusion

To summarize, we’ve built one full ETL process: we created an S3 bucket, uploaded our raw data to the bucket, started the glue database, added a crawler that browses the data in the above S3 bucket, created a GlueJobs, which can be run on a schedule, on a trigger, or on-demand, and finally updated data back to the S3 bucket.

The additional work that could be done is to revise a Python script provided at the GlueJob stage, based on business needs.
(i.e improve the pre-process to scale the numeric variables)

Overall, the structure above will get you started on setting up an ETL pipeline in any business production environment.

Additional Resource

For more details on learning other data science topics, below Github repositories will also be helpful. It contains easy-to-follow codes to get you started with explanations.

AWS SageMaker in Production
End-to-End examples that show how to solve business problems using Amazon SageMaker and its ML/DL algorithm.
PySpark
Functions and utilities with Real-world Data examples. Can be used to build a complete ETL process of data modeling
Recommendation System
Production-level Implementations of Recommender System in Pytorch. Clone repo and start training by running ‘main.py’
Natural Language Processing (NLP)
Full implementation examples of several Natural Language Processing methods in Python. Ordered in a level of complexity in learning

About the Author

HyunJoon is a Data Geek with a degree in Statistics. He enjoys sharing data science/analytics knowledge. Message him on LinkedIn for connection.

Reference:
[1] Jesse Fredrickson, https://towardsdatascience.com/aws-glue-and-you-e2e4322f0805
[2] Synerzip, https://www.synerzip.com/blog/a-practical-guide-to-aws-glue/, A Practical Guide to AWS Glue
[3] Sean Knight, https://towardsdatascience.com/aws-glue-amazons-new-etl-tool-8c4a813d751a, AWS Glue: Amazon’s New ETL Tool
[4] Mikael Ahonen, https://data.solita.fi/aws-glue-tutorial-with-spark-and-python-for-data-developers/, AWS Glue tutorial with Spark and Python for data developers