This is the 4th article in a larger series on Full Stack Data Science (FSDS). In the last post, I shared a concrete example of how to build data pipelines for machine learning projects. One limitation of the example, however, was that the data pipeline had to be run manually. While this might be fine for some applications, more often than not, it’s better to automate than delegate this process to a person. In this article, I will walk through a simple way to do this using Python and GitHub Actions.

A few years ago, Andrej Karpathy gave a talk describing "Operation Vacation" [1]. This was what Tesla’s full self-driving engineering team called their goal of completely automating the improvement of the self-driving model.
Although this goal was somewhat facetious, it illustrates an aspiration I’ve seen in most data scientists and engineers: the desire to build a system that can operate autonomously (so they can go on vacation).
Here, I will discuss how to automate a key element of any machine learning system—the data pipeline.
2 Ways to Automate Data Pipelines
While there are countless ways to build and automate data pipelines, here I’ll categorize the approaches into two buckets: using an orchestration tool and not using an orchestration tool.
Way 1: Orchestration Tool
Orchestration tools allow developers to manage workflows with hundreds (and even thousands) steps.
Airflow
One of the most popular orchestration tools is Airflow, which can manage complex workflows using Python. This has made it a standard among data engineers managing enterprise data pipelines.
A downside of Airflow, however, is that its setup and maintenance can be complicated. Consequently, it requires a strong technical understanding of how it works, which can take time to develop.
Airflow Wrappers
The complexity of Airflow has led to the rise of Airflow wrappers, which make its core functionality easier to use. Popular examples of these tools include Dagster, Mage, Astronomer, and Prefect (not a wrapper, but comparable).
While all these tools provide scalable and reliable ways to manage sophisticated data workflows, they might be overkill for many ML applications. So, let’s take a step back and ask how we can build pipelines without orchestration tools.
Way 2: Python + Triggers
As discussed in the previous article, data pipelines consist of three basic tasks: extraction, transformation, and loading. All of these can be implemented using Python.
For example, if we wanted to pull data from a single data source and store it in a single database, the workflow would look like this.

While we could surely use an orchestration tool for this, nothing stops us from consolidating the ETL scripts into a single Python file and running that.

Automating with Triggers
The above abstraction simplifies the execution of this pipeline, but it’s still not automated since we manually have to run the etl.py script. To take the next step, we need to introduce a trigger.
A trigger runs a command when a specific criterion is satisfied. For example, the time is 12:00 AM, or a new file appears in a directory.
This final piece allows us to fully automate the data pipeline so we can spend our time doing more productive things e.g. read Medium articles 😉 .
Here’s what our example pipeline would look like if running every day at midnight.

GitHub Actions
We can implement the simple workflow described above via GitHub Actions (GA). GA is a CI/CD (continuous integration, continuous delivery) platform, which is a fancy way of saying it helps you automate software testing and updating.
While data may traditionally be seen to sit outside of software, these are inseparable when it comes to machine learning. This is because ML uses data to "write" software (i.e. train models).
The biggest upside of using GA is GitHub provides free computing to run actions for public repositories, which is great for poor developers (like me) and simple proof-of-concept projects.
Example Code: Automating ETL Pipeline for YouTube Video Transcripts
Let’s walk through a concrete example of automating a simple ETL people using GitHub Actions. I’ll do this for the example code from the previous article in this series, where I extracted transcripts from all the videos from my YouTube channel.
Example code is freely available at the GitHub repository.
Create ETL Python Script
The first step is consolidating the entire ETL (Extract, Transform, Load) pipeline into a single Python script.
I do that here by defining three files: _functions.py, datapipeline.py, and requirements.txt. In functions.py, each step of the pipeline is defined as a function. These are then called in sequential order in _data_pipeline.py. requirements.txt_ lists all the Python dependencies for the pipeline.

# data_pipeline.py
from functions import *
import time
import datetime
print("Starting data pipeline at ", datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
print("----------------------------------------------")
# Step 1: extract video IDs
t0 = time.time()
getVideoIDs()
t1 = time.time()
print("Step 1: Done")
print("---> Video IDs downloaded in", str(t1-t0), "seconds", "n")
# Step 2: extract transcripts for videos
t0 = time.time()
getVideoTranscripts()
t1 = time.time()
print("Step 2: Done")
print("---> Transcripts downloaded in", str(t1-t0), "seconds", "n")
# Step 3: Transform data
t0 = time.time()
transformData()
t1 = time.time()
print("Step 3: Done")
print("---> Data transformed in", str(t1-t0), "seconds", "n")
# Step 4: Generate text emebeddings
t0 = time.time()
createTextEmbeddings()
t1 = time.time()
print("Step 4: Done")
print("---> Embeddings generated in", str(t1-t0), "seconds", "n")
I added time printouts for each step in the pipeline to help with code observability and debugging. I won’t get into the guts of each step here since those were described in the previous article and this video, but those interested can review the code on GitHub.
Create GitHub Repo
Next, we create a GitHub repo. You can do this from the command line or the web interface. I’ll use the latter.
To do this, I go to my GitHub repositories, hit "New", and fill out the following fields.

Then, we can clone the repo locally via the following command.
>> git clone https://github.com/ShawhinT/data-pipeline-example.git
Create Workflow .yml File
We can now create our workflow via a .yml file. To do that, we will create a new folder called .github/workflows and a new file called data-pipeline.yml.

The .yml file looks like this.
name: data-pipeline-workflow
on:
push: # run on push
schedule:
- cron: "35 0 * * *" # run every day at 12:35AM
workflow_dispatch: # manual triggers
jobs:
run-data-pipeline:
runs-on: ubuntu-latest
steps:
- name: Checkout repo content
uses: actions/checkout@v4
with:
token: ${{ secrets.PERSONAL_ACCESS_TOKEN }} # Use the PAT instead of the default GITHUB_TOKEN
- name: Setup python
uses: actions/setup-python@v5
with:
python-version: '3.9'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run data pipeline
env:
YT_API_KEY: ${{ secrets.YT_API_KEY }} # import API key
run: python data_pipeline.py # run data pipeline
- name: Check for changes # create env variable indicating if any changes were made
id: git-check
run: |
git config user.name 'github-actions'
git config user.email '[email protected]'
git add .
git diff --staged --quiet || echo "changes=true" >> $GITHUB_ENV
- name: Commit and push if changes
if: env.changes == 'true' # if changes made push new data to repo
run: |
git commit -m "updated video index"
git push
While this may seem overwhelming to those new to GitHub Actions, it consists of 4 key elements: name, triggers, jobs, and steps.
Starting with the name, this is the name of our workflow. Here, I call it "data-pipeline-workflow".
Next, we define triggers for the workflow using the "on:" syntax. Here, I define three separate triggers. 1) on push, meaning the workflow will execute when new code is pushed to the repo. 2) on schedule, meaning it will execute via a cron job. 3) on workflow_dispatch, this will create a button in our repo we can manually click to run the workflow.
Workflows are made up of jobs. In this example, we have a single job called "run-data-pipeline". We can set the operating system for the job (I use Ubuntu here).
Jobs are then made up of steps. Here, our job consists of 6 steps, which are listed below.
- Checkout repo content: uses the pre-built checkout action to pull code from the repo. I also provide a Personal Access Token (PAT) to allow the action to push code to the repo, which we will create momentarily.
- Setup Python: install the desired Python version with the pre-built action.
- Install dependencies: install Python libraries listed in requirements.txt.
- Run data pipeline: executes the _datapipeline.py script. I add a secret environment variable to allow the action to use my YouTube API token without exposing it on a public repo. I will show you how to do this next.
- Check for changes: checks if any changes were made to the repo. This is necessary because if we try to push code to the repo when there are no differences, it will throw an error, and the job will fail to run.
- Commit and push if changes: if there are changes to the repo, they will be committed and pushed.
One last thing I’ll do is create a folder called "data" where we can save the final data files.

Add Repo Secrets
Notice in the data-pipeline.yml file I referenced two strange-looking variables, e.g., _${{ secrets.PERSONAL_ACCESSTOKEN }} and _${{ secrets.YT_APIKEY }}.
These are repository secrets that are accessible to GitHub Actions as environment variables. To create them, we go to our repository settings, click Secrets and Variables, select Actions, and click "New repository secrets".

To create the _PERSONAL_ACCESSTOKEN variable, click on your profile icon in the top right-hand corner > open Settings in a new tab > scroll to the bottom, and select "Developer settings" > click "Personal access tokens" > click "Generate new token" (classic). This will allow you to create a token that gives the Actions write access to your repo.
Here are the details I used.
- Name: data-pipeline-example-PAT
- Expiration: No expiration (feel free to set this to end at some point)
- Select scopes: repo
Then click "Generate token" at the bottom of the page. This will display a long string of text, which you can copy and paste into your GitHub repository secret.

I then create another secret variable called _YT_APIKEY similarly.
Push and commit
With our secrets in place, we can commit and push our code.
>> git add .
>> git commit -m "adding data pipeline code"
>> git push
Once pushed, we can go to the repo’s "Actions" tab and watch the workflow run!
GitHub – ShawhinT/data-pipeline-example: Example data pipeline automation with GitHub Actions
What’s next?
While orchestration tools are commonplace in Data Engineering, these might be overkill for some ML use cases. Here, we saw a free and simple way to automate a data pipeline using Python and GitHub Actions.
In the next article of this series, we will continue going down the data science tech stack and discuss how we can integrate this data pipeline into a semantic search system for my YouTube videos.
More in this series 👇
Resources
Connect: My website | Book a call
Socials: YouTube 🎥 | LinkedIn | Twitter
Support: Buy me a coffee ☕️