Airflow in Docker Metrics Reporting
Use Grafana on top of the official Apache Airflow image to monitor queue health and much more.
An unsettling yet likely familiar situation: you deployed Airflow successfully, but find yourself constantly refreshing the webserver UI to make sure everything is running smoothly.
You rely on certain alerting tasks to execute upon upstream failures, but if the queue is full and tasks are stalling, how will you be notified?
One solution: deploying Grafana, an open source reporting service, on top of Airflow.
The Proposed Architecture
To start, I’ll assume basic understanding of Airflow functionality and containerization using Docker and Docker Compose. More resources can be found here for Airflow, here for Docker, and here for Docker Compose.
Reference the code to follow along: https://github.com/sarahmk125/airflow-docker-metrics
Now, the fun stuff.
Used Services
To get Airflow metrics into a visually appealing dashboard that supports alerting, the following services are spun up in Docker containers declared in the docker-compose.yml
file:
- Airflow: Airflow runs tasks within DAGs, defined in Python files stored in the
./dags/
folder. One sample DAG declaration file is already there. Multiple containers are run, with particular nuances accounting for using the officialapache/airflow
image. More on that later. - StatsD-Exporter: The StatsD-Exporter container converts Airflow’s metrics in StatsD format to Prometheus format, the datasource for the reporting layer (Grafana). More information on StatsD-Exporter found here. The container definition includes the command to be executed upon startup, defining how to use the ports exposed.
statsd-exporter:
image: prom/statsd-exporter
container_name: airflow-statsd-exporter
command: "--statsd.listen-udp=:8125 --web.listen-address=:9102"
ports:
- 9123:9102
- 8125:8125/udp
- Prometheus: Prometheus is a service commonly used for time-series data reporting. It is particularly convenient when using Grafana as a reporting UI since Prometheus is a supported datasource. More information on Prometheus found here. The volumes mounted in the container definition indicate how the data flows to/from Prometheus.
prometheus:
image: prom/prometheus
container_name: airflow-prometheus
user: "0"
ports:
- 9090:9090
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/volume:/prometheus
- Grafana: Grafana is a reporting UI service that is often used to connect to non-relational databases. In the code described, Grafana uses Prometheus as a datasource for dashboards. The container definition includes an admin user for the portal, as well as the volumes defining datasources and dashboards that are already pre-configured.
grafana:
image: grafana/grafana:7.1.5
container_name: airflow-grafana
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: password
GF_PATHS_PROVISIONING: /grafana/provisioning
ports:
- 3000:3000
volumes:
- ./grafana/volume/data:/grafana
- ./grafana/volume/datasources:/grafana/datasources
- ./grafana/volume/dashboards:/grafana/dashboards
- ./grafana/volume/provisioning:/grafana/provisioning
Make It Go
To start everything up, the following tools are required: Docker, docker-compose, Python3, Git.
Steps (to be run in a terminal):
- Clone the repository:
git clone https://github.com/sarahmk125/airflow-docker-metrics.git
- Navigate to the cloned folder:
cd airflow-docker-metrics
- Startup the containers:
docker-compose -f docker-compose.yml up -d
(Note: they can be stopped or removed by running the same command except withstop
ordown
at the end, respectively)
The result:
- Airflow webserver UI: http://localhost:8080
- StatsD metrics list: http://localhost:9123/metrics
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (login: username=admin, password=password)
The repository includes an Airflow Metrics dashboard, that can be setup with alerts, showing the number of running and queued tasks over time:
Steps Explained
How does Prometheus actually get the metrics?
Prometheus is configured upon startup in the ./prometheus/prometheus.yml
file which is mounted as a volume:
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
monitor: 'codelab-monitor'scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['airflow-prometheus:9090']
- job_name: 'statsd-exporter'
static_configs:
- targets: ['airflow-statsd-exporter:9102']
tls_config:
insecure_skip_verify: true
In particular, the scrape_configs
section declares a destination (the airflow-prometheus
container) and a source (the airflow-statsd-exporter
container) to scrape.
How are dashboards and alerts created in Grafana?
Provisioning is your friend!
Provisioning in Grafana means using code to define datasources, dashboards, and alerts to exist upon startup. When starting the containers, there is a Prometheus
datasource already configured in localhost:3000/datasources and an Airflow Metrics
dashboard listed in localhost:3000/dashboards.
How to provision:
- All the relevant data is mounted as volumes onto the
grafana
container defined in thedocker-compose.yml
file (described above) - The
./grafana/volume/provisioning/datasources/default.yaml
file contains a definition of all data sources:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
- The
./grafana/volume/provisioning/dashboards/default.yaml
file contains information on where to mount dashboards in the container:
apiVersion: 1
providers:
- name: dashboards
folder: General
type: file
editable: true
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /grafana/dashboards
foldersFromFilesStructure: true
- The
./grafana/volume/dashboards/
folder contains.json
files, each representing a dashboard. The airflow_metrics.json file results in the dashboard shown above.
The JSON can be retrieved from the Grafana UI by following these instructions.
Alerts in the UI can be setup as described here; there is also an excellent Medium article here on setting up Grafana alerting with Slack. Alerts can be provisioned in the same way as dashboards and datasources.
Bonus Topic: The Official Airflow Image
Before there was an official Docker image, Matthieu “Puckel_” Roisil released Docker support for Airflow. Starting with Airflow version 1.10.10, the Apache Software Foundation released an official image on DockerHub which is the only current and continuously updated image. However, many still rely on the legacy and unofficial docker-airflow
repository.
Why is this a problem? Well, relying on the legacy repository means capping Airflow at version 1.10.9. Airflow 1.10.10 began supporting some cool features such as running tasks on Kubernetes. The official repository will also be where the the upcoming (and highly anticipated) Airflow 2.0 will be released.
The new docker-compose
declaration found in the described repository for the webserver
looks something like this:
webserver:
container_name: airflow-webserver
image: apache/airflow:1.10.12-python3.7
restart: always
depends_on:
- postgres
- redis
- statsd-exporter
environment:
- LOAD_EX=n
- EXECUTOR=Local
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
- AIRFLOW__SCHEDULER__STATSD_ON=True
- AIRFLOW__SCHEDULER__STATSD_HOST=statsd-exporter
- AIRFLOW__SCHEDULER__STATSD_PORT=8125
- AIRFLOW__SCHEDULER__STATSD_PREFIX=airflow
-AIRFLOW__CORE__SQL_ALCHEMY_CONN= postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
-AIRFLOW__CORE__FERNET_KEY= pMrhjIcqUNHMYRk_ZOBmMptWR6o1DahCXCKn5lEMpzM=
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
- AIRFLOW__CORE__LOAD_EXAMPLES=False
- AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
- AIRFLOW__WEBSERVER__WORKERS=2
- AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800
volumes:
- ./dags:/opt/airflow/dags
ports:
- "8080:8080"
command: bash -c "airflow initdb && airflow webserver"
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
A few changes from the puckel/docker-airflow
configuration to highlight:
- Custom parameters such as the
AIRFLOW__CORE__SQL_ALCHEMY_CONN
that were previously found in theairflow.cfg
file are now declared as environment variables in thedocker-compose
file. - The
airflow initdb
command to initialize the backend database is now declared as a command in thedocker-compose
file, as opposed to an entrypoint script.
Voila!
There you have it. No more worrying if your tasks are infinitely queued and not running. Airflow running in Docker, with dashboards and alerting available in Grafana at your fingertips. The same architecture can be run on an instance deployed in GCP or AWS for 24/7 monitoring just like it was run locally.
The finished product can be found here: https://github.com/sarahmk125/airflow-docker-metrics
It’s important to note, there’s always room for improvement:
- This monitoring setup does not capture container or instance failures; a separate or extended solution is needed to monitor at the container or instance level.
- The current code runs using the LocalExecutor, which is less than ideal for large workloads. Further testing with the CeleryExecutor can be done.
- There are many more metrics available in StatsD that were not highlighted (such as DAG or task duration, counts of task failures, etc.). More dashboards can be built and provisioned in Grafana to leverage all the relevant metrics.
- Lastly, this article focuses on a self-hosted (or highly configurable cloud) deployment for Airflow, but this is not the only option for deploying Airflow.
Questions? Comments?
Thanks for reading! I love talking data stacks. Shoot me a message.