Table of Contents
∘ Introduction ∘ Problem Statement ∘ Data ∘ AWS Architecture ∘ Data Storage with AWS S3 ∘ Designing the Schema ∘ ETL with AWS Glue ∘ Data Warehousing with AWS Redshift ∘ Extracting Insights with AWS Redshift ∘ Visualizing data with Power BI ∘ Future Steps ∘ Conclusion ∘ References
Introduction
Air travel has become an integral part of our lives. It is a means for businesses to network and conduct commerce and for families to visit loved ones or travel.
Despite its influence, the aviation industry is known for facing turbulence. It is subject to continuous change due to external factors like economic busts and booms, climate change, the Covid-19 pandemic, and a push to rely more on renewable energy sources.
To be cognizant of such changes and their impact on air travel, it is worth tracking these flights over time. Such an endeavor requires a robust strategy for data warehousing, data analysis, and data visualization.
Problem Statement
This project has 2 primary objectives. The first is to utilize the resources provided by Amazon Web Services (AWS) to build a data pipeline that facilitates the storage, transformation, and analysis of U.S. flight data.
The second is to build a visualization tool with Power BI that can effectively illustrate the key findings from the data.
Data
The dataset used for this project is obtained from the Bureau of Transportation Statistics. It primarily reports the number of total flights, delays, and cancellations in airports and carriers from 2003 to 2023.
The following is a preview of the dataset:
At a glance, there are a few issues with the raw dataset.
Firstly, the information in the airport_name
field comprises multiple pieces of information. It not only presents the name of the airport, but also the city and the state. To easily access this information, this field will have to be split into 3 separate fields.
Secondly, this data currently adopts a flat model (i.e., 1 table). However, this is not the optimal setup since the data contains multiple entities with relationships.
These shortcomings will have to be addressed before the data can be used for analysis or visualization.
AWS Architecture
Let’s discuss the AWS architecture needed to build the data pipeline.
The required resources are best illustrated with the following diagram:
The cloud solution uses Amazon S3 to store the raw data as well as the transformed data, AWS Glue to create an ETL job that facilitates the transformation of the data, and AWS Redshift to create the cloud data warehouse that enables users to extract insights from the data with SQL.
Finally, Power BI is used to display the key metrics provided by the data in the form of a dashboard.
Data Storage with AWS S3
The project utilizes two S3 buckets: flights-data-raw
and flights-data-processed
.
The flights-data-raw
bucket contains the raw dataset.
The flights-data-processed
bucket will contain the data after it has been transformed (it is currently empty).
Designing the Schema
Next, it is important to determine a suitable schema for this data. The raw data is stored in a flat file, which contains a single table:
Unfortunately, this schema only has a single table, which comprises multiple entities, such as date, airport, and carrier. To optimize the database for faster retrieval of data, this flat schema can be converted into a star schema with the use of dimensional modeling:
In this new schema, the flights
table serves as the fact table, while the date
, carrier
, and airport
tables serve as the dimension tables.
ETL with AWS Glue
An ETL job created with AWS Glue can transform the raw data into the fact and dimension tables and load them into the flights-data-processed
bucket.
The ETL job uses an imported Python script to carry out the dimensional modeling.
The script uses boto3, the Python SDK, to extract the raw dataset in the flights-data-raw
bucket, create the 4 tables in the star schema, and loads them into the flights-data-processed
bucket as csv files.
For instance, the following snippet is used to create the carrier
table.
The entire script used to create the 4 tables in the schema can be accessed in the GitHub repository.
The ETL job runs without any issues:
The dataset has been converted into one fact table and 3 dimension tables in the form of csv files, which are all stored in the flights-data-processed
bucket.
Data Warehousing with AWS Redshift
With AWS Glue, the data that was initially in a flat model can now be represented with a more fitting star schema in a data warehouse.
The cloud data warehouse for this data will be created with AWS Redshift Serverless. This entails creating a namespace named flights-namespace
as well as a database named dev
. In addition, it requires a workgroup named flights-workgroup
, which will be used to write SQL queries.
Note: The workgroup has been configured to allow devices outside of the VPC to access the database. This will be useful when creating the visualization with Power BI
Now, we can open the query editor in Redshift and start creating the fact and dimension tables in the dev
database.
First, the 4 tables in the schema need to be created in the warehouse using the following commands:
The four tables are now in the data warehouse, but they are all empty since the data is still in the flights-data-processed
bucket.
The data can be copied into this data warehouse using the COPY
command.
For instance, the data in flights.csv
can be copied into the flights
table using the following command syntax:
Note: the
iam_role
variable should be assigned whatever iam role is was selected when creating the workgroup.
By executing the COPY
command for each of the csv files in the flights-data-processed
bucket, the 4 tables should be filled with the necessary data.
As an example, here is a preview of the airport table:
Extracting Insights with AWS Redshift
Now that all tables are loaded with the data, we can now perform analyses with SQL queries!
Since the data has previously been transformed into a star schema with dimensional modeling, it is easy to retrieve data efficiently with little run time, thus making this set up ideal for ad hoc analyses.
Here are some examples of questions that can be answered with SQL queries.
- Which airports had the most flights in 2022?
2. What type of delay contributed the most to the total delay from 2019 onwards?
3. In the John F. Kennedy airport, what is percent change in delays for each year?
Visualizing data with Power BI
The current cloud data warehouse enables users to answer key questions with little time and cost.
However, we can take this a step further by creating a visualization that end-users can use to answer similar questions.
One way to achieve this is to create a dashboard with Power BI, a highly popular BI tool.
Although there are many metrics that can be unearthed through visualizations, the dashboard will focus on examining the following:
- a summary of the number of flights, delays, and cancellations
- tracking the number of flights, delays, and cancellations over time
- identifying the most used airports and carriers
- a breakdown of the different types of delays
In addition, the dashboard will include filters that allow users to target specific times and locations.
Altogether, these features can be combined in the form of the following dashboard:
Using such a tool, users without access to the data or without knowledge of SQL can answer key questions with little effort.
Such questions can include:
- Which carrier sees the most flights in the JFK airport?
2. How many flights have been canceled in California from 2019 to 2022?
3. What type of delay contributes the most to total delay in the American Airlines?
Future Steps
The current setup in AWS and Power BI promotes quick and inexpensive data analysis and visualization. However, it’s worth considering new applications for the data for the future.
- Incorporating new data sources
If new data sources are to be included, the schema will have to be modified accordingly. Moreover, additional ETL jobs will have to be created to seamlessly integrate the data from these sources into the existing data warehouse.
2. Performing time series analysis
The data provided by BTS is a time series. Thus, there is merit in considering the use of time series analysis and building forecasting models to predict the demand for air travel in the future.
Conclusion
Datasets with copious records, such as the one provided by the BTS, can be difficult to manage. However, with the resources provided by AWS, one can build a data pipeline that processes the data and structures it in a form that enables users to extract insights in a cost-effective manner.
Furthermore, visualizations like the created Power BI dashboard are an effective method for contextualizing certain metrics and creating an impactful story for the audience.
For the code used to build the ETL job in AWS Glue or the SQL queries used to create the tables and carry out the analysis, please visit the GitHub repository:
GitHub – anair123/Tracking-U.S.-Flights-With-AWS-and-Power-BI
Thank you for reading!
References
- Airline On-Time Statistics and Delay Causes. BTS. (n.d.). https://www.transtats.bts.gov/ot_delay/OT_DelayCause1.asp?20=E