Airflow in Docker Metrics Reporting

Use Grafana on top of the official Apache Airflow image to monitor queue health and much more.

Sarah Krasnik Bedell
Towards Data Science

--

An unsettling yet likely familiar situation: you deployed Airflow successfully, but find yourself constantly refreshing the webserver UI to make sure everything is running smoothly.

You rely on certain alerting tasks to execute upon upstream failures, but if the queue is full and tasks are stalling, how will you be notified?

One solution: deploying Grafana, an open source reporting service, on top of Airflow.

Photo by Markus Spiske on Unsplash

The Proposed Architecture

Image by Author

To start, I’ll assume basic understanding of Airflow functionality and containerization using Docker and Docker Compose. More resources can be found here for Airflow, here for Docker, and here for Docker Compose.

Reference the code to follow along: https://github.com/sarahmk125/airflow-docker-metrics

Now, the fun stuff.

Used Services

To get Airflow metrics into a visually appealing dashboard that supports alerting, the following services are spun up in Docker containers declared in the docker-compose.yml file:

  • Airflow: Airflow runs tasks within DAGs, defined in Python files stored in the ./dags/ folder. One sample DAG declaration file is already there. Multiple containers are run, with particular nuances accounting for using the official apache/airflow image. More on that later.
  • StatsD-Exporter: The StatsD-Exporter container converts Airflow’s metrics in StatsD format to Prometheus format, the datasource for the reporting layer (Grafana). More information on StatsD-Exporter found here. The container definition includes the command to be executed upon startup, defining how to use the ports exposed.
statsd-exporter:
image: prom/statsd-exporter
container_name: airflow-statsd-exporter
command: "--statsd.listen-udp=:8125 --web.listen-address=:9102"
ports:
- 9123:9102
- 8125:8125/udp
  • Prometheus: Prometheus is a service commonly used for time-series data reporting. It is particularly convenient when using Grafana as a reporting UI since Prometheus is a supported datasource. More information on Prometheus found here. The volumes mounted in the container definition indicate how the data flows to/from Prometheus.
prometheus:
image: prom/prometheus
container_name: airflow-prometheus
user: "0"
ports:
- 9090:9090
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/volume:/prometheus
  • Grafana: Grafana is a reporting UI service that is often used to connect to non-relational databases. In the code described, Grafana uses Prometheus as a datasource for dashboards. The container definition includes an admin user for the portal, as well as the volumes defining datasources and dashboards that are already pre-configured.
grafana:
image: grafana/grafana:7.1.5
container_name: airflow-grafana
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: password
GF_PATHS_PROVISIONING: /grafana/provisioning
ports:
- 3000:3000
volumes:
- ./grafana/volume/data:/grafana
- ./grafana/volume/datasources:/grafana/datasources
- ./grafana/volume/dashboards:/grafana/dashboards
- ./grafana/volume/provisioning:/grafana/provisioning

Make It Go

To start everything up, the following tools are required: Docker, docker-compose, Python3, Git.

Steps (to be run in a terminal):

  • Clone the repository: git clone https://github.com/sarahmk125/airflow-docker-metrics.git
  • Navigate to the cloned folder: cd airflow-docker-metrics
  • Startup the containers: docker-compose -f docker-compose.yml up -d (Note: they can be stopped or removed by running the same command except with stop or down at the end, respectively)

The result:

Image by Author
Image by Author
Image by Author
  • Grafana: http://localhost:3000 (login: username=admin, password=password)
    The repository includes an Airflow Metrics dashboard, that can be setup with alerts, showing the number of running and queued tasks over time:
Image by Author

Steps Explained

How does Prometheus actually get the metrics?

Prometheus is configured upon startup in the ./prometheus/prometheus.yml file which is mounted as a volume:

global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_timeout: 10s
external_labels:
monitor: 'codelab-monitor'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['airflow-prometheus:9090']

- job_name: 'statsd-exporter'
static_configs:
- targets: ['airflow-statsd-exporter:9102']

tls_config:
insecure_skip_verify: true

In particular, the scrape_configs section declares a destination (the airflow-prometheus container) and a source (the airflow-statsd-exporter container) to scrape.

How are dashboards and alerts created in Grafana?

Provisioning is your friend!

Provisioning in Grafana means using code to define datasources, dashboards, and alerts to exist upon startup. When starting the containers, there is a Prometheus datasource already configured in localhost:3000/datasources and an Airflow Metrics dashboard listed in localhost:3000/dashboards.

How to provision:

  • All the relevant data is mounted as volumes onto the grafana container defined in the docker-compose.yml file (described above)
  • The ./grafana/volume/provisioning/datasources/default.yaml file contains a definition of all data sources:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
  • The ./grafana/volume/provisioning/dashboards/default.yaml file contains information on where to mount dashboards in the container:
apiVersion: 1
providers:
- name: dashboards
folder: General
type: file
editable: true
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /grafana/dashboards
foldersFromFilesStructure: true
  • The ./grafana/volume/dashboards/ folder contains .json files, each representing a dashboard. The airflow_metrics.json file results in the dashboard shown above.

The JSON can be retrieved from the Grafana UI by following these instructions.

Alerts in the UI can be setup as described here; there is also an excellent Medium article here on setting up Grafana alerting with Slack. Alerts can be provisioned in the same way as dashboards and datasources.

Bonus Topic: The Official Airflow Image

Before there was an official Docker image, Matthieu “Puckel_” Roisil released Docker support for Airflow. Starting with Airflow version 1.10.10, the Apache Software Foundation released an official image on DockerHub which is the only current and continuously updated image. However, many still rely on the legacy and unofficial docker-airflow repository.

Why is this a problem? Well, relying on the legacy repository means capping Airflow at version 1.10.9. Airflow 1.10.10 began supporting some cool features such as running tasks on Kubernetes. The official repository will also be where the the upcoming (and highly anticipated) Airflow 2.0 will be released.

The new docker-compose declaration found in the described repository for the webserver looks something like this:

webserver:
container_name: airflow-webserver
image: apache/airflow:1.10.12-python3.7
restart: always
depends_on:
- postgres
- redis
- statsd-exporter
environment:
- LOAD_EX=n
- EXECUTOR=Local
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
- AIRFLOW__SCHEDULER__STATSD_ON=True
- AIRFLOW__SCHEDULER__STATSD_HOST=statsd-exporter
- AIRFLOW__SCHEDULER__STATSD_PORT=8125
- AIRFLOW__SCHEDULER__STATSD_PREFIX=airflow
-AIRFLOW__CORE__SQL_ALCHEMY_CONN= postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
-AIRFLOW__CORE__FERNET_KEY= pMrhjIcqUNHMYRk_ZOBmMptWR6o1DahCXCKn5lEMpzM=
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
- AIRFLOW__CORE__LOAD_EXAMPLES=False
- AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
- AIRFLOW__WEBSERVER__WORKERS=2
- AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800
volumes:
- ./dags:/opt/airflow/dags
ports:
- "8080:8080"
command: bash -c "airflow initdb && airflow webserver"
healthcheck:
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3

A few changes from the puckel/docker-airflow configuration to highlight:

  • Custom parameters such as the AIRFLOW__CORE__SQL_ALCHEMY_CONN that were previously found in the airflow.cfg file are now declared as environment variables in the docker-compose file.
  • The airflow initdb command to initialize the backend database is now declared as a command in the docker-compose file, as opposed to an entrypoint script.

Voila!

There you have it. No more worrying if your tasks are infinitely queued and not running. Airflow running in Docker, with dashboards and alerting available in Grafana at your fingertips. The same architecture can be run on an instance deployed in GCP or AWS for 24/7 monitoring just like it was run locally.

The finished product can be found here: https://github.com/sarahmk125/airflow-docker-metrics

It’s important to note, there’s always room for improvement:

  • This monitoring setup does not capture container or instance failures; a separate or extended solution is needed to monitor at the container or instance level.
  • The current code runs using the LocalExecutor, which is less than ideal for large workloads. Further testing with the CeleryExecutor can be done.
  • There are many more metrics available in StatsD that were not highlighted (such as DAG or task duration, counts of task failures, etc.). More dashboards can be built and provisioned in Grafana to leverage all the relevant metrics.
  • Lastly, this article focuses on a self-hosted (or highly configurable cloud) deployment for Airflow, but this is not the only option for deploying Airflow.

Questions? Comments?

Thanks for reading! I love talking data stacks. Shoot me a message.

--

--

GTM, analytics, builder, & advisor. Growth @ Prefect, prev. lead data engineer @ Perpay. BA Johns Hopkins, MS Northwestern. http://sarahsnewsletter.substack.com