A journey to Airflow on Kubernetes

… or how I got it to work, piece by piece, in a logical way

Marcelo Rabello Rossi
Towards Data Science

--

When the Apache Airflow task landing chart looks more like a Pollock’s painting (source: author)

A brief introduction

My humble opinion on Apache Airflow: basically, if you have more than a couple of automated tasks to schedule, and you are fiddling around with cron tasks that run even when some dependency of them fails, you should give it a try. But if you are not willing to just accept my words, feel free to check these posts.¹ ² ³ Delve into Airflow concepts and how it works is beyond the scope of this article. For that matter, please check these other posts.

Long story short, its task definitions are code based, what means that they could be as dynamic as you wish. You can create tasks and define your task dependencies based on variables or conditionals. It has plenty of native operators (definitions of task types) that integrate your workflow with lots of other tools and allow you to run from the most basic shell scripts to parallel data processing with Apache Spark, and a plethora of other options. Contributor operators are also available for a great set of commercial tools and the list keeps growing every day. These operators are python classes, so they are extensible and could be modified to fit your needs. You can even create your own operators from scratch, inheriting from the BaseOperator class. Also, it makes your workflow scalable to hundreds or even thousands of tasks with little effort using its distributed executors such as Celery or Kubernetes.

We are using Airflow in iFood since 2018. Our first implementation was really nice, based on docker containers to run each task in an isolated environment. It underwent a lot of changes since then, from a simple tool to serve our team’s workload to a task scheduling platform to serve the more than 200 people with a lot of abstractions on the top of it. At the end, it does not matter if you are a software engineer with years of experience or a business analyst with minimal SQL knowledge, you can schedule your task using our platform writing a yaml file with three simple fields: the ID of your task, the path of the file containing your queries and the name of its table dependencies (i.e. to run my task I depend on the tables orders and users), and voilà, you have your task scheduled to run daily. But this, unfortunately, is a topic for a future article.

Piece of a giant DAG with more than 1000 tasks. Most of them are Apache Spark / Hive jobs scheduled by business analysts or data analysts (source: author)

Why Kubernetes?

It was obvious that we would need to scale our application from an AWS t2.medium EC2 instance to something more powerful. Our first approaches were to scale vertically to an r4.large instance, and then to an r4.xlarge, but the memory usage was constantly increasing.

Our company grows fast. There are dozens of tasks being created every day and suddenly we would be running on an r4.16xlarge instance. We needed a way to scale the application horizontally and, more than that, to upscale it considering the peak hours and to downscale it at dawn to minimize needless costs. At that point, we were migrating all our platforms to run on a Kubernetes cluster, so why not Airflow?

I searched on the internet, from the official Apache Airflow documentation to Medium articles, digging for information on how to run a simple implementation of Airflow on Kubernetes with the KubernetesExecutor (I was aware of the CeleryExecutor existence, but it would not fit our needs, considering that you need to spin your workers upfront, with no native auto-scaling). I found a lot of people talking about the benefits of running Airflow on Kubernetes, the architecture behind it and a bunch of Helm charts, but little information on how to deploy it, piece by piece, in a logical way for a Kubernetes beginner. And that is the main point of this article. Assuming that you know Apache Airflow, and how its components work together, the idea is to show you how you can deploy it to run on Kubernetes leveraging the benefits of the KubernetesExecutor, with some extra information on the Kubernetes resources involved (yaml files). The examples will be AWS-based, but I am sure that with little research you can port the information to any cloud service you want or even run the code on-prem.

Considerations

To fully understand the sections below and get things running, I am assuming that you have:

  1. An AWS EKS cluster available, or another type of Kubernetes cluster, locally or in a cloud environment.
  2. Basic “hands-on” knowledge on Kubernetes and kubectl tool. At least on how to deploy resources and check their descriptions and logs.
  3. Solid knowledge on Apache Airflow, and their units (configuration, scheduler, webserver, database, DAGs and tasks)

If you do not, I recommend that you play a little with Kubernetes and Airflow locally. You can find awesome tutorials in the internet, even at the official websites. For Kubernetes, you can start with the Katacoda tutorials. Regarding Apache Airflow, this was the first tutorial I ever tried.

The starting point

As a starting point, I found a way to get the Kubernetes resource yaml files from the official Helm chart available at the Airflow git repository.¹⁰ That brought me a lot of resources, some of them came empty (probably because I used the base values.yaml to fill the templates used by Helm) and some of them were useless to the KubernetesExecutor approach (i.e. I do not need a Redis cluster, or a Flower resource, or a result back-end, as these are specific to Celery). Removing those useless resources, I ended up with something around 15 resource files and some of them I did not even know at that time. Kinda overwhelming! I also removed all resources that were related to the PostgreSQL instance (i.e. pgbouncer), because I knew that I would use an AWS RDS instance, external to the Kubernetes cluster.

How to export the Kubernetes resource yaml files from Apache Airflow helm chart

Obs: I had these charts locally, so when I executed the helm template command, helm whined about not finding the PostgreSQL charts (it will not happen if you are using the Helm repositories). If that is your case, just create the path charts/ inside the folder containing your helm chart and put the postgresql/ helm chart folder inside of it (available at the official Helm charts github repository). It is also important to notice that the Apache Airflow helm chart available at https://github.com/helm/charts will bring you a different set of resources when compared to the chart I used.¹⁰

After all the cleaning, I ended up with these 12 resources:

resources
├── configmap.yaml
├── dags-persistent-volume-claim.yaml
├── rbac
│ ├── pod-launcher-rolebinding.yaml
│ └── pod-launcher-role.yaml
├── scheduler
│ ├── scheduler-deployment.yaml
│ └── scheduler-serviceaccount.yaml
├── secrets
│ └── metadata-connection-secret.yaml
├── statsd
│ ├── statsd-deployment.yaml
│ └── statsd-service.yaml
├── webserver
│ ├── webserver-deployment.yaml
│ └── webserver-service.yaml
└── workers
└── worker-serviceaccount.yaml

Thinking of volumes

Most of the articles I found describe two ways to store DAG information: storing the DAGs on a persistent volume accessible from multiple AWS availability zones, such as the AWS Elastic File System (EFS), or syncing them from a git repository to an ephemeral volume mounted inside the cluster. If that pod dies, when another one is created, it will sync with the repository again to get the last modifications.

Due to our present workflow, we need to build our DAGs dynamically from lots of tasks written in yaml files, meaning that our DAGs are not ready when the files are versioned on a git repository. A simple git-sync to bring information would not work for us, but it could be a starting point. Considering that we also needed some kind of persistence for our logs, we decided to go for the EFS approach too, using some kind of hybrid of what we found online: git-sync our yaml files to a PersistentVolume mounted on the top of an EFS, and to have another pod processing it and throwing the freshly-built DAGs into the folder that the scheduler and the webserver are constantly watching to fill the DagBag.

PersistentVolume configuration to store Apache Airflow DAG files
PersistentVolumeClaim configuration to store Apache Airflow DAG files
StorageClass configuration for an AWS EFS-based storage

As shown above, to mount the EFS inside the EKS cluster, I used the official AWS CSI driver,¹¹ that must be installed in the cluster. And beyond the driver, this approach accounts for five Kubernetes resources:

  • 2 PersistentVolume(s): DAGs, logs
  • 2 PersistentVolumeClaim(s): DAGs, logs (analogous to the previous ones)
  • 1 StorageClass

These resources were not present in the initial list, as the original deployments used emptyDir volumes instead.

Where should I store my airflow.cfg file?

Anyone that worked with Apache Airflow for some time knows that the airflow.cfg file (and maybe webserver_config.py file) is pretty important to set the things up. But throwing it inside the EFS volume did not seem wise, because of the sensitive information it contains (database passwords, fernet key). Then, I found out that the Kubernetes way to store configuration files is to use ConfigMap, a kind of "volume" that you mount inside the pods to expose a configuration file for them. And there is the Kubernetes Secret too, to store sensitive data. They work together, so I can reference a Secret inside a ConfigMap, or even pass a Secret to an environment variable. Mission accomplished!

As you learn a little bit more about Kubernetes, you will notice that the “plain” secrets are somewhat unsafe to version in a repository. They contain base64 strings that can be readily “decrypted” in your terminal using the base64 -d command. Take a look on this ExternalSecrets API,¹² to store your secrets on AWS Parameter Store and retrieve them from there.

If you check the list of files above, you will notice that the ConfigMap is already there, you just have to customize it.

OK, but what about the deployments?

My little experience with Kubernetes was enough at that point to make me think that I would need at least two deployments: one for the scheduler and one for the webserver. And they were there, lying inside the scheduler and webserver folders generated by the Helm Chart explosion. There was a third deployment, of a statsd application, that I found later to be related to metrics collection inside the application. Cool, one thing less to worry! Prometheus will be happy to scrape it.

I opened the files and noticed that they have some familiar environment variables, related to the fernet key and the database connection string. I filled them with the data retrieved by the Kubernetes secrets. I needed to tweak the volume part a little bit, to match my EFS PersistentVolume and PersistentVolumeClaim.

Deployment configuration for the Airflow scheduler

It is easy to notice these shell scripts being executed as init containers. They are related to the database migrations that happen when Airflow starts. The scheduler pod runs the migrations as soon as it starts, and the webserver pod keeps waiting for it to finish before starting the webserver container. The webserver deployment has a very similar structure, so I took the liberty of omitting it.

Webserver is a… server! There should be a service related to it!

There was. A Kubernetes service resource exposing the port 8080 of the container. Later I included an Ingress resource to give it an AWS Route53 friendly DNS.

Service configuration for Airflow webserver

The statsd application also listens at an endpoint and has a service associated to it. Both the services were included in the files exported by the helm template command.

Well, it should be working, right?

I tried to apply those configurations to the cluster. Scheduler and webserver were up, both connected to my external RDS PostgreSQL database. I thought: “If I throw some DAGs into the dags folder, then it should work, right?” And it kinda did! I created a simple DAG with one task based on the KubernetesPodOperator, using a container image stored at the AWS Elastic Container Registry. I double checked if my pod would be allowed to access the ECR repository.

Then, I triggered the DAG, but it failed (you really didn’t think it would be that easy, right?). Checking the logs I noticed that it happened due some kind of permission issue. My scheduler did not have the permission to spawn new pods. And then I understood the need for that ServiceAccount resources scattered among the folders, and the ClusterRole and ClusterRoleBinding stuff. These guys are there to allow your resources to spawn new resources. After all the configuration, I could make my task run successfully. The KubernetesPodOperator also has the service_account_name parameter, that should be filled with a ServiceAccount resource name able to spawn pods, because that is what it will do: spawn another pod with the image you passed as an argument to the image parameter. That pod will be in charge of running your task.

ServiceAccount configuration for Airflow scheduler

If you want to run tasks directly from your webserver, clicking on that “Run” button inside the task menu, you must give your webserver ServiceAccount the permissions to watch and spawn pods too. If you forget that, your tasks will be triggered, but they will never run.

Service account as trusted entities

If you are running your stuff on AWS, you need to make sure your pods will be able to access all AWS resources, such as S3, DynamoDB tables, EMR, and so on. To do so, you need to bind your ServiceAccount resource to an AWS IAM role with IAM policies attached, to grant you all the access you need. Just give your IAM role an assume role policy:

Policy to allow Kubernetes ServiceAccounts to assume an IAM role with the required permissions

The ServiceAccount for your worker and tasks should be linked to the IAM Role attached to the policy above. You can do it using annotations:

Example of a ServiceAccount resource annotating an AWS IAM Role

And now, let there be light!

If you are following this journey as a tutorial, after all the tweaking you can just create all the above resources in your cluster:

kubectl apply -f resources/ --recursive

But, wait! Do not apply them yet! If you are a watchful reader, you noticed that most of the resources above makes reference to the airflow-on-k8s namespace. A Namespace is a way to tell Kubernetes that all resources in the same namespace are somewhat related (i.e. they are part of the same project) and is a nice way to organize things inside the cluster. You should declare your Namespace resource inside the resources/ folder and apply it before applying everything else, otherwise you will get an error.

Declaration of the airflow-on-k8s namespace. Piece of cake, isn’t it?

FYI, not every resource on Kubernetes is namespaced (i.e. PersistentVolume, StorageClass and other low-level resources), and that is why some of them do not have any reference to a namespace.

Architecture

A minimalistic representation of how this approach works (source: author)

Epilogue

And that was a fast-forward take on my journey to deploy Airflow on Kubernetes. I tried to cover all kinds of resources generated by the helm chart export, but feel free to ask your questions in the comments section if you think I left something behind.

These yaml resources above were taken from a functional deployment I made. Some of them I built from scratch, and others I adapted from the ones I exported. I suggest that you take your time understanding them and making changes for better organization and performance.

There is a lot more you can do to get the most of this implementation. You can set the limits and requests fields for your containers inside the deployments, to make sure they will have the necessary resources available for them to work properly. Going further on the benefits of Kubernetes, you will see that the KubernetesPodOperator allows you to label your pods and pass a lot of Kubernetes configurations to it, such as affinities, tolerations and a bunch of other stuff. If you have tainted nodes, you can assure that just some specific pods will run on them, reserving the most powerful nodes to the most critical tasks.

If you tried this setup and have something to add, something that worked like a charm or turned out to be a bad choice, please tell us in the comments.

The Apache Airflow logo is either registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. The Kubernetes logo files are licensed under a choice of either Apache-2.0 or CC-BY-4.0 (Creative Commons Attribution 4.0 International).

--

--