
Often, we need to collect some data within a certain period of time. It can be data from the IoT sensor, statistical data from social networks, or something else. As an example, the YouTube Data API allows us to get the number of views and subscribers for any channel at the current moment, but the analytics and historical data are available only to the channel owner. Thus, if we want to get weekly or monthly summaries about these channels, we need to collect this data ourselves. In the case of the IoT sensor, there may be no API at all, and we also need to collect and save data on our own. In this article, I will show how to configure Apache Airflow on a Raspberry Pi, which allows running tasks for a long period of time without involving any cloud provider.
Obviously, if you’re working for a large company, you will probably not need a Raspberry Pi. In that case, if you need an extra cloud instance, just create a Jira ticket for your MLOps department 😉 But for a pet project or a low-budget startup, it can be an interesting solution.
Let’s see how it works.
Raspberry Pi
What is actually a Raspberry Pi? For those readers who have never been interested in hardware for the last 10 years (the first Raspberry Pi model was introduced in 2012), I can briefly explain that this is a single-board computer running full-fledged Linux. Usually, a Raspberry Pi has a 1GHz, 2–4-core ARM CPU and 1–8 MB of RAM. It is small, cheap, and silent; it has no fans and no disk drive (the OS is running from a Micro SD card). A Raspberry Pi needs only a standard USB power supply; it can be connected via Wi-Fi or Ethernet to a network and run different tasks within months or even years.
For my Data Science pet project, I wanted to collect the YouTube channel statistics within 2 weeks. For a task that requires only 30–60 seconds twice per day, a serverless architecture can be a perfect solution, and we can use something like Google Cloud Function for that. But every tutorial from Google started with the phrase "enable billing for your project". There is free first credit and free quotas provided by Google, but I did not want to have another headache of monitoring how much money I spent and how many requests I made, so I decided to use a Raspberry Pi for that. It turned out that the Raspberry Pi is an excellent data science tool for collecting data. This $50 single-board computer has only 2W of power consumption; it is small, silent, and can be placed anywhere. Last but not least, I already had a Raspberry Pi at home, so my cost was literally zero. I just plugged a power supply into the socket, and the problem of cloud computing was solved 😉
There are different Raspberry Pi models on the market, and at the time of writing this article, the Raspberry Pi 4 and Raspberry Pi 5 are the most powerful. But for tasks that do not require a lot of requests or "heavy" postprocessing, the earliest models will also do the job.
Apache Airflow
For my pet project, I had a list of 3,000 YouTube channels, and the YouTube Data API limit is 10,000 requests per day. So, I decided to collect data twice per day. If we want to run Python code twice per day, we can use a CRON job or just add time.sleep(12*60*60)
to the main application loop. But a much more effective and fun way is to use Apache Airflow for that. Apache Airflow is a professional workflow management platform, and it is also open-source and free to use.
It is easy to install Airflow on a Raspberry Pi using a pip command (here, I used Apache Airflow 2.7.1 and Python 3.9):
sudo pip3 install "apache-airflow==2.7.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.1/constraints-3.9.txt"
It’s generally not recommended to use sudo with pip, but on my Raspberry Pi, Python airflow libraries were not found when I started airflow as a service (services are running as root), and using sudo pip3
was the easiest fix for that.
When Apache Airflow is installed, we need to initialize it and create a user:
cd ~
mkdir airflow && cd airflow
export AIRFLOW_HOME=/home/pi/airflow
airflow db init
airflow users create --role Admin --username airflow --password airflow --email admin --firstname admin --lastname admin
mkdir dags
Now, the Airflow is installed, but I wanted it to run automatically as a service after the boot.
First, I created a /etc/systemd/system/airflow-webserver.service
file to run the Apache Airflow web server as a service:
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/home/pi/airflow/env
User=pi
Group=pi
Type=simple
ExecStart=/bin/bash -c 'airflow webserver --pid /home/pi/airflow/webserver.pid'
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
In the same way, I created a /etc/systemd/system/airflow-scheduler.service
file for an Airflow scheduler:
[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
EnvironmentFile=/home/pi/airflow/env
User=pi
Group=pi
Type=simple
ExecStart=/usr/bin/bash -c 'airflow scheduler'
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
We also need a /home/pi/airflow/env
file:
AIRFLOW_CONFIG=/home/pi/airflow/airflow.cfg
AIRFLOW_HOME=/home/pi/airflow/
Now, we can start new services, and Apache Airflow is ready to use:
sudo systemctl daemon-reload
sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-webserver.service
sudo systemctl start airflow-scheduler.service
If everything was done correctly, we can make a login to the Apache Airflow web panel (credentials – "airflow", "airflow"):

Apache Airflow DAG
A DAG (Directed Acyclic Graph) is a core concept of Apache Airflow. Combining different tasks into a graph allows us to organize pretty complex data processing pipelines. A DAG itself is created in the form of a Python file, and it has to be placed in the "dags" folder of Apache Airflow (this path is specified in the dags_folder
parameter of the "airflow.cfg" file). During the start, or after pressing the "Refresh" button, Apache Airflow imports these Python files and gets all the needed information from them.
In my case, I created a process_channels
method located in a get_statistics.py
file:
from pyyoutube import Api
def process_channels(requests_limit: int,
data_path: str):
""" Get data for YouTube channels and save it in CSV file """
...
(getting the YouTube data itself is out of the scope of this article; it can just be any method we want to run periodically)
A DAG file for running our code in Apache Airflow is simple:
from airflow import DAG
from airflow.decorators import task
from airflow.models import Variable
from datetime import datetime, timedelta
data_path = "/home/pi/airflow/data/"
default_args={
"depends_on_past": False,
"email": [],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=60),
}
def create_dag():
""" Create a DAG object """
return DAG(
"dag_youtube",
default_args=default_args,
description="YouTube Retreive",
schedule_interval=timedelta(hours=12),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=["youtube"]
)
@task(task_id="collect_channels_stats_gr1")
def get_channels_stats_gr1():
import get_statistics as gs
limit = int(Variable.get("RequestLimit"))
ret = gs.process_channels(limit, data_path)
return f"GR1: {ret} channels saved"
@task(task_id="collect_channels_stats_gr2")
def get_channels_stats_gr2():
import get_statistics as gs
limit = int(Variable.get("RequestLimit"))
ret = gs.process_channels(limit, data_path)
return f"GR2: {ret} channels saved"
@task(task_id="collect_channels_stats_gr3")
def get_channels_stats_gr3():
import get_statistics as gs
limit = int(Variable.get("RequestLimit"))
ret = gs.process_channels(limit, data_path)
return f"GR3: {ret} channels saved"
# Create the DAG
with create_dag() as dag:
get_channels_stats_gr1()
get_channels_stats_gr2()
get_channels_stats_gr3()
As we can see, here I made the create_dag
method, and the most important part is the schedule_interval
parameter, which is equal to 12 hours. In total, I have 3 tasks in my DAG; they are represented by three almost identical get_channels_stats_gr1..3
methods. Each task is isolated and will be executed separately by Apache Airflow. I also created a variable RequestLimit
. A YouTube API is limited to 10,000 requests per day, and during the debugging, it makes sense to make this parameter low. Later, this value can be changed at any time by using the "Variables" control panel of Apache Airflow.
Running the DAG
Our task is ready. We can press the "Refresh" button, and a new DAG will appear in the list and will be executed according to our programmed schedule. As we can see, installing Apache Airflow and creating a DAG is not rocket science, but it still requires some effort. What is it for? Can we just add one line to the CRON job instead? Well, even for a simple task like this, Apache Airflow provides a lot of functionality.
- We can see the task status, the number of completed and failed tasks, the time for the next run, and other parameters:

- If the task failed, it is easy to click it and see what is going on and when the incident happened:

- I can even click on the failed task and see its crash log. In my case, a crash occurred during the retrieval of YouTube channel data. One of the channels was probably removed or disabled by the owner, and no data was available anymore:

- I can see a calendar with a pretty detailed log of previous and future tasks:

- I can also see a duration log that can give some insights about the task execution time:

So, using Apache Airflow is way better in functionality compared to adding a simple CRON job. Last but not least, knowledge of Airflow is also a nice skill that is often required in the industry 😉
Conclusion
In this article, I installed and configured Apache Airflow and was able to run it on a Raspberry Pi. Apache Airflow is a professional workflow management platform; according to 6sense.com, it has 29% of the market share and is used by large companies such as Ubisoft, SEB, or Hitachi. But as we can see, even on a "nano" scale, Apache Airflow can be successfully used on microcomputers like the Raspberry Pi.
Those who are interested in using a Raspberry Pi for data science-related projects are welcome to read my other TDS articles:
If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. If you want to get the full source code for this and my next posts, feel free to visit my Patreon page.
Thanks for reading.