
∘ Introduction ∘ Problem Statement ∘ [Understanding the Data](#c793) Pipeline ∘ Understanding the Data ∘ Phase 1 – Setting Up The AWS Environment ∘ Phase 2— Facilitating AWS Operations with Boto3 ∘ Phase 3— Setting Up the Airflow Pipeline ∘ Phase 4— Running the Pipeline ∘ Examining the Results ∘ Advantages of This Data Pipeline ∘ Disadvantages of This Data Pipeline ∘ Conclusion
Introduction
YouTube has become a major medium of exchange for information, thoughts, and ideas, with an average of 3 million videos being uploaded each day. The video streaming platform always has a new topic of conversation prepared for its audience with its diverse content, ranging from somber news stories to upbeat music videos.
That being said, with a constant influx of video content, it’s difficult to gauge what types of content attract the attention of the fickle YouTube audience the most.
In general, what types of videos get the most likes and comments? Do YouTube users from different countries favor different types of content? Does interest in specific content categories fluctuate throughout time?
Answering these types of questions requires a systematic approach toward collecting, processing, and storing YouTube data for subsequent analysis.
Problem Statement
The goal of the project is to create a data pipeline that collects YouTube data for the 5 countries with the most YouTube users: Brazil, India, Indonesia, Mexico, and the United States. This pipeline should collect data on a daily basis in order to keep the information up to date.
The data collected with the pipeline will be used to answer the following questions:
- What are the most popular video categories in each country?
- Does the most popular video category change over time? If so, how?
Developing and running the pipeline will require a large number of steps. Thus, the project will be split into 4 phases:
1. Setting up the AWS environment
2. Facilitating AWS operations with boto3
3. Setting up the Airflow pipeline
4. Running the Airflow pipeline
Understanding the Data Pipeline
Before beginning the project, it’s worth discussing the setup of the pipeline, which can be represented by the following diagram:

There’s a lot going down, so let’s break it down.
Amazon Web Services (AWS) will provide all of the compute and storage services needed for this project, whereas Apache Airflow will schedule the relevant processes to run daily.
All of the processes carried out within the AWS environment will be facilitated by Python scripts that leverage boto3, the AWS software development kit (SDK) for Python. These scripts are to be scheduled with an Apache Airflow Directed Acyclic Graph (DAG), which is deployed into an EC2 instance.
The DAG stored in the EC2 instance will pull data using the YouTube API and store it in an S3 bucket as csv files. The program will then use Amazon Athena to query the data in that bucket. The results of the query will be stored within the same S3 bucket.
Understanding the Data
Next, understanding the YouTube data provided by the API will provide some insight into what processes it needs to be subject to in the pipeline.
The response of the API call contains information on up to 50 of the most popular videos in a given country at a given time. The raw data comes in a JSON format:
{
"kind": "youtube#videoListResponse",
"etag": etag,
"nextPageToken": string,
"prevPageToken": string,
"pageInfo": {
"totalResults": integer,
"resultsPerPage": integer
},
"items": [
video Resource
]
}
For more information, feel free to visit the YouTube documentation.
In order to query this with SQL later on, this raw data will be transformed into csv format before being stored in AWS. The processed data will also omit features that are not relevant to this use case.
This transformation from JSON data to tabular data is summarized by the following:

Moreover, an API call only reveals information on a single country. In other words, obtaining data on popular videos in N countries will require N API calls. Since the goal is to extract YouTube data for 5 countries, each iteration will require making 5 API calls.
After storing the data in AWS S3 as a csv file, the data will be queried to determine the most popular video category for each country and date.
The query output will contain the following fields: date, country, video category, and number of videos.

Phase 1 – Setting Up the AWS Environment

Let’s first create/deploy the AWS resources that will be required to store and query the YouTube data.
1. An IAM User
An IAM user will be created just for this project. This user will gain permission to use the Amazon EC2, Amazon S3, and Amazon Athena.
The user account will also be provided with access keys, which will be needed to use these resources with Python scripts.
2. EC2 instance
We will set up a t2-small instance with Ubuntu AMI. This is where the Airflow pipeline will be deployed.
The instance has a key pair file named "youtube_kp.pem", which will be needed to access the EC2 instance with secure shell (SSH) and copy files to the instance with secure copy protocol (SCP).
In addition, this instance will have an inbound rule added to enable the user to view the Airflow webserver (the user interface).
After connecting to the instance with SSH, the following installations will be made:
sudo apt-get update
sudo apt install python3-pip
sudo apt install python3.10-venv
Next, a virtual environment named youtube
, where the project will be run in, is set up.
python -m venv youtube
After that, a directory named airflow
will be added to the instance. This is where all of the required files including the DAG will be stored.
mkdir airflow
2. S3 Bucket
The S3 bucket will store the data pulled with the YouTube API as well as the outputs of the queries that are executed. The bucket will be named youtube-data-storage
.

There is very little that needs to be done in terms of configuration. The bucket will simply contain two folders, named data
and query-output
, which will contain the pulled data and the query outputs, respectively.

3. AWS Athena
Next, AWS Athena will be configured to query the data stored in the data
folder and store the output in the query-output
folder. By default, it leaves us an empty database with no tables.

First, a new database named youtube
is created in the query editor.
CREATE DATABASE youtube;
Next, an iceberg table named youtube_videos
is created. The table’s fields should match those of the csv files that will be loaded into the S3 bucket. It should also specify the location of the csv files that will be queried.
CREATE EXTERNAL TABLE youtube_videos (
date_of_extraction STRING,
country STRING,
video_id STRING,
video_name STRING,
channel_id STRING,
channel_name STRING,
category STRING,
view_count INT,
like_count INT,
favorite_count INT,
comment_count INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = ''
)
LOCATION 's3://youtube-data-storage/data/'
TBLPROPERTIES ('skip.header.line.count'='1')
Finally, the query result location will be set to the query-output
subdirectory in the youtube-data-storage
bucket in the settings.

Phase 2— Facilitating AWS Operations with Boto3

With the AWS environment set up, we can now focus on developing the scripts that will facilitate the transfer and processing of data.
First, we pull the data from the YouTube API, convert it to a tabular format, and store it in the S3 bucket as a csv file with the following function named pull_youtube_data.py
.
Note: It might be worth remembering the names of these Python functions, as it will make it easier to follow along when the Airflow DAG is configured.
The only parameter of this function is
region_code
, which is the 2-digit code that denotes the country of interest.
For instance, pull_youtube_data('US')
will return data on the most popular videos in the United States. To obtain information on the most popular videos in the 5 countries, this function will need to be run 5 times.
Next, we write a function that uses boto3 to analyze the data in the S3 bucket with Amazon Athena. This entails writing an SQL query on the data
folder and storing the result in the query-output
folder.
These steps are carried out in the function named run_query
.
Phase 3— Setting Up the Airflow Pipeline

After creating the Python functions, we can build the Airflow pipeline and schedule the functions to run on a daily basis. There are a number of steps in this phase, so brace yourself.
- First, we write the Python script named
youtube_dag.py
that instantiates the DAG, defines the tasks, and establishes the dependencies between the tasks.In short, the DAG is configured to execute the
pull_data
function 5 times (once for each country) before running therun_query
function. The DAG runs are carried out on a daily basis.
Now, we can deploy the Airflow DAG in the created EC2 instance.
2. Access the EC2 instance with SSH
ssh -i "youtube_kp.pem" [email protected]
3. Enter the created virtual environment
source youtube/bin/activate
4. Install the required dependencies (including Apache Airflow)
pip install -r requirements.txt
python3 -m pip install cryptography==38.0.4
5. Specify the location of the airflow home directory
On Ubuntu, this can be done with the following command:
sudo vim /etc/environment
To set the Airflow home directory to the created airflow folder, add the following line to the text editor:
AIRFLOW_HOME = '/home/ubuntu/airflow'
6. Initialize an Airflow database
airflow init db
7. Store the YouTube DAG (and other dependencies) in the required location
The airflow.cfg file will indicate the directory that the DAG has to be in to be detected.

The youtube_dag.py
, along with all other Python files are copied into this directory from the local machine using secure copy (SCP). The command for executing the copy has the following format.
scp -i </path/key-pair-name.pem> </path/file_to_copy> <username>@<instance_public_dns>:<destination_path>
After executing the SCP commands, the dags
directory should have all the necessary files.

Finally, Airflow should have access to the created DAG. To confirm this, we can use the command:
airflow dags list

Phase 4— Running the Pipeline
Finally, we have set up the AWS resources, the Python scripts, and the Airflow DAG! The data pipeline is now ready to run!
We can look at the workflow using the Airflow webserver.
airflow webserver -p 8080
Although the DAG is scheduled to run automatically, we can run trigger it manually for the first iteration by running the following in the command line interface:
airflow dags trigger <dag_id>
Alternatively, it can be triggered to run by clicking the "play" button in the webserver.

For a better grasp of the constituents of the workflow, we can view the DAG’s graph:

As shown in the graph, the YouTube data containing the most popular videos for Brazil, Indonesia, India, Mexico, and the United States are first pulled with the YouTube API and stored in the S3 bucket. After the data for all countries is ingested, AWS Athena runs the pre-defined query to determine the most popular categories for each country on each given day.
Furthermore, the green edges around all of the tasks signify that the tasks have been run successfully.
Note: There is no need to manually trigger the DAG every time; it will run on the pre-defined schedule, which is daily in this case.
Examining the Results
Let’s see what the stored CSV files and the query output looks like after the pipeline is run twice (i.e., two dates for each country).

We can see 10 csv files (1 for each country and date) in the data folder in the youtube-data-storage
bucket.

The query-output
directory has a single csv file that contains the query output. The contents of the file are in the following:

While there are many video categories within the YouTube platform, it seems that the most popular videos in Brazil, Indonesia, India, Mexico, and the United States fall into only 4 categories: "Music", "Entertainment", "People & Blogs", and "Sports".
While this is an interesting finding, there is very little that can be inferred from it since it only comprises 2 days’ worth of information. However, after running the Airflow DAG repeatedly, the information will adequately capture how the audiences’ preference in videos for each country changes over time.
Advantages of This Data Pipeline
Collecting YouTube data in such a pipeline is beneficial for a few reasons.
1. Harnessing cloud technologies: AWS provides us with the means to securely store our data and respond to any sudden changes in demand, all at an inexpensive rate.
2. Automation: With the deployed Airflow pipeline, YouTube data will be pulled, transformed, stored, and queried without any manual effort.
3. Logging: Airflow’s webserver allows us to access the log and examine all of the DAG runs as well as the current status of all tasks.
4. Modifiable: Elements can be added to or removed from the AWS environment without hampering the operations in the current pipeline. Users can easily add new data sources, write new queries, or launch other AWS resources if needed.
Disadvantages of This Data Pipeline
That being said, the constructed pipeline does have limitations.
1. Lack of notifications: Currently, the data pipeline doesn’t incorporate a notification service that alerts users if a task in the Airflow DAG doesn’t run as expected. This can delay response time.
2. Lack of a data warehouse: The pipeline currently generates reports by directly querying the S3 bucket, which serves as the data lake. This is feasible with the current circumstances, but if other data sources were to be added, the pipeline would lack the tools needed to efficiently perform complex joins or aggregation needed for subsequent analyses.
Conclusion

Ultimately, tracking the most popular YouTube content in different regions over time is a massive endeavor, which requires a continuous collection of data that is suited for analysis. Although the operations needed to obtain this data can be executed manually with on-premise infrastructure, such an approach becomes more inefficient and infeasible as the volume of data increases and as the workflow gets more complex.
Thus, for this use case, it’s worth leveraging a cloud platform and a job scheduler to ensure that the ingestion and processing of data can be automated.
With AWS, we can create a scalable and cost-effective solution that collects, transforms, stores, and processes YouTube data. With Apache Airflow, we have the means to monitor complex workflows and debug them when necessary.
For the source code used in this project, please visit the GitHub repository:
GitHub – anair123/Building-a-Youtube-Data-Pipeline-With-AWS-and-Airflow
Thank you for reading!