A journey to Airflow on Kubernetes
… or how I got it to work, piece by piece, in a logical way
A brief introduction
My humble opinion on Apache Airflow: basically, if you have more than a couple of automated tasks to schedule, and you are fiddling around with cron tasks that run even when some dependency of them fails, you should give it a try. But if you are not willing to just accept my words, feel free to check these posts.¹ ² ³ Delve into Airflow concepts and how it works is beyond the scope of this article. For that matter, please check these other posts.⁴ ⁵
Long story short, its task definitions are code based, what means that they could be as dynamic as you wish. You can create tasks and define your task dependencies based on variables or conditionals. It has plenty of native operators (definitions of task types)⁶ that integrate your workflow with lots of other tools and allow you to run from the most basic shell scripts to parallel data processing with Apache Spark, and a plethora of other options. Contributor operators⁷ are also available for a great set of commercial tools and the list keeps growing every day. These operators are python classes, so they are extensible and could be modified to fit your needs. You can even create your own operators from scratch, inheriting from the BaseOperator
class. Also, it makes your workflow scalable to hundreds or even thousands of tasks with little effort using its distributed executors such as Celery or Kubernetes.
We are using Airflow in iFood since 2018. Our first implementation was really nice, based on docker containers to run each task in an isolated environment. It underwent a lot of changes since then, from a simple tool to serve our team’s workload to a task scheduling platform to serve the more than 200 people with a lot of abstractions on the top of it. At the end, it does not matter if you are a software engineer with years of experience or a business analyst with minimal SQL knowledge, you can schedule your task using our platform writing a yaml file with three simple fields: the ID of your task, the path of the file containing your queries and the name of its table dependencies (i.e. to run my task I depend on the tables orders
and users
), and voilà, you have your task scheduled to run daily. But this, unfortunately, is a topic for a future article.
Why Kubernetes?
It was obvious that we would need to scale our application from an AWS t2.medium
EC2 instance to something more powerful. Our first approaches were to scale vertically to an r4.large
instance, and then to an r4.xlarge
, but the memory usage was constantly increasing.
Our company grows fast. There are dozens of tasks being created every day and suddenly we would be running on an r4.16xlarge
instance. We needed a way to scale the application horizontally and, more than that, to upscale it considering the peak hours and to downscale it at dawn to minimize needless costs. At that point, we were migrating all our platforms to run on a Kubernetes cluster, so why not Airflow?
I searched on the internet, from the official Apache Airflow documentation to Medium articles, digging for information on how to run a simple implementation of Airflow on Kubernetes with the KubernetesExecutor
(I was aware of the CeleryExecutor
existence, but it would not fit our needs, considering that you need to spin your workers upfront, with no native auto-scaling). I found a lot of people talking about the benefits of running Airflow on Kubernetes, the architecture behind it and a bunch of Helm charts, but little information on how to deploy it, piece by piece, in a logical way for a Kubernetes beginner. And that is the main point of this article. Assuming that you know Apache Airflow, and how its components work together, the idea is to show you how you can deploy it to run on Kubernetes leveraging the benefits of the KubernetesExecutor
, with some extra information on the Kubernetes resources involved (yaml files). The examples will be AWS-based, but I am sure that with little research you can port the information to any cloud service you want or even run the code on-prem.
Considerations
To fully understand the sections below and get things running, I am assuming that you have:
- An AWS EKS cluster available, or another type of Kubernetes cluster, locally or in a cloud environment.
- Basic “hands-on” knowledge on Kubernetes and
kubectl
tool. At least on how to deploy resources and check their descriptions and logs. - Solid knowledge on Apache Airflow, and their units (configuration, scheduler, webserver, database, DAGs and tasks)
If you do not, I recommend that you play a little with Kubernetes and Airflow locally. You can find awesome tutorials in the internet, even at the official websites. For Kubernetes, you can start with the Katacoda tutorials.⁸ Regarding Apache Airflow, this was the first tutorial I ever tried.⁹
The starting point
As a starting point, I found a way to get the Kubernetes resource yaml files from the official Helm chart available at the Airflow git repository.¹⁰ That brought me a lot of resources, some of them came empty (probably because I used the base values.yaml
to fill the templates used by Helm) and some of them were useless to the KubernetesExecutor
approach (i.e. I do not need a Redis cluster, or a Flower resource, or a result back-end, as these are specific to Celery). Removing those useless resources, I ended up with something around 15 resource files and some of them I did not even know at that time. Kinda overwhelming! I also removed all resources that were related to the PostgreSQL instance (i.e. pgbouncer), because I knew that I would use an AWS RDS instance, external to the Kubernetes cluster.
Obs: I had these charts locally, so when I executed the helm template
command, helm
whined about not finding the PostgreSQL charts (it will not happen if you are using the Helm repositories). If that is your case, just create the path charts/
inside the folder containing your helm chart and put the postgresql/
helm chart folder inside of it (available at the official Helm charts github repository). It is also important to notice that the Apache Airflow helm chart available at https://github.com/helm/charts will bring you a different set of resources when compared to the chart I used.¹⁰
After all the cleaning, I ended up with these 12 resources:
resources
├── configmap.yaml
├── dags-persistent-volume-claim.yaml
├── rbac
│ ├── pod-launcher-rolebinding.yaml
│ └── pod-launcher-role.yaml
├── scheduler
│ ├── scheduler-deployment.yaml
│ └── scheduler-serviceaccount.yaml
├── secrets
│ └── metadata-connection-secret.yaml
├── statsd
│ ├── statsd-deployment.yaml
│ └── statsd-service.yaml
├── webserver
│ ├── webserver-deployment.yaml
│ └── webserver-service.yaml
└── workers
└── worker-serviceaccount.yaml
Thinking of volumes
Most of the articles I found describe two ways to store DAG information: storing the DAGs on a persistent volume accessible from multiple AWS availability zones, such as the AWS Elastic File System (EFS), or syncing them from a git repository to an ephemeral volume mounted inside the cluster. If that pod dies, when another one is created, it will sync with the repository again to get the last modifications.
Due to our present workflow, we need to build our DAGs dynamically from lots of tasks written in yaml files, meaning that our DAGs are not ready when the files are versioned on a git repository. A simple git-sync to bring information would not work for us, but it could be a starting point. Considering that we also needed some kind of persistence for our logs, we decided to go for the EFS approach too, using some kind of hybrid of what we found online: git-sync our yaml files to a PersistentVolume
mounted on the top of an EFS, and to have another pod processing it and throwing the freshly-built DAGs into the folder that the scheduler and the webserver are constantly watching to fill the DagBag
.
As shown above, to mount the EFS inside the EKS cluster, I used the official AWS CSI driver,¹¹ that must be installed in the cluster. And beyond the driver, this approach accounts for five Kubernetes resources:
- 2 PersistentVolume(s): DAGs, logs
- 2 PersistentVolumeClaim(s): DAGs, logs (analogous to the previous ones)
- 1 StorageClass
These resources were not present in the initial list, as the original deployments used emptyDir
volumes instead.
Where should I store my airflow.cfg file?
Anyone that worked with Apache Airflow for some time knows that the airflow.cfg
file (and maybe webserver_config.py
file) is pretty important to set the things up. But throwing it inside the EFS volume did not seem wise, because of the sensitive information it contains (database passwords, fernet key). Then, I found out that the Kubernetes way to store configuration files is to use ConfigMap
, a kind of "volume" that you mount inside the pods to expose a configuration file for them. And there is the Kubernetes Secret
too, to store sensitive data. They work together, so I can reference a Secret
inside a ConfigMap
, or even pass a Secret
to an environment variable. Mission accomplished!
As you learn a little bit more about Kubernetes, you will notice that the “plain” secrets are somewhat unsafe to version in a repository. They contain base64 strings that can be readily “decrypted” in your terminal using the base64 -d
command. Take a look on this ExternalSecrets API,¹² to store your secrets on AWS Parameter Store and retrieve them from there.
If you check the list of files above, you will notice that the ConfigMap
is already there, you just have to customize it.
OK, but what about the deployments?
My little experience with Kubernetes was enough at that point to make me think that I would need at least two deployments: one for the scheduler and one for the webserver. And they were there, lying inside the scheduler
and webserver
folders generated by the Helm Chart explosion. There was a third deployment, of a statsd
application, that I found later to be related to metrics collection inside the application. Cool, one thing less to worry! Prometheus will be happy to scrape it.
I opened the files and noticed that they have some familiar environment variables, related to the fernet key and the database connection string. I filled them with the data retrieved by the Kubernetes secrets. I needed to tweak the volume part a little bit, to match my EFS PersistentVolume
and PersistentVolumeClaim
.
It is easy to notice these shell scripts being executed as init containers. They are related to the database migrations that happen when Airflow starts. The scheduler pod runs the migrations as soon as it starts, and the webserver pod keeps waiting for it to finish before starting the webserver container. The webserver deployment has a very similar structure, so I took the liberty of omitting it.
Webserver is a… server! There should be a service related to it!
There was. A Kubernetes service resource exposing the port 8080 of the container. Later I included an Ingress
resource to give it an AWS Route53 friendly DNS.
The statsd
application also listens at an endpoint and has a service associated to it. Both the services were included in the files exported by the helm template
command.
Well, it should be working, right?
I tried to apply those configurations to the cluster. Scheduler and webserver were up, both connected to my external RDS PostgreSQL database. I thought: “If I throw some DAGs into the dags
folder, then it should work, right?” And it kinda did! I created a simple DAG with one task based on the KubernetesPodOperator
, using a container image stored at the AWS Elastic Container Registry. I double checked if my pod would be allowed to access the ECR repository.
Then, I triggered the DAG, but it failed (you really didn’t think it would be that easy, right?). Checking the logs I noticed that it happened due some kind of permission issue. My scheduler did not have the permission to spawn new pods. And then I understood the need for that ServiceAccount
resources scattered among the folders, and the ClusterRole
and ClusterRoleBinding
stuff. These guys are there to allow your resources to spawn new resources. After all the configuration, I could make my task run successfully. The KubernetesPodOperator
also has the service_account_name
parameter, that should be filled with a ServiceAccount
resource name able to spawn pods, because that is what it will do: spawn another pod with the image you passed as an argument to the image
parameter. That pod will be in charge of running your task.
If you want to run tasks directly from your webserver, clicking on that “Run” button inside the task menu, you must give your webserver ServiceAccount
the permissions to watch and spawn pods too. If you forget that, your tasks will be triggered, but they will never run.
Service account as trusted entities
If you are running your stuff on AWS, you need to make sure your pods will be able to access all AWS resources, such as S3, DynamoDB tables, EMR, and so on. To do so, you need to bind your ServiceAccount
resource to an AWS IAM role with IAM policies attached, to grant you all the access you need. Just give your IAM role an assume role policy:
The ServiceAccount
for your worker and tasks should be linked to the IAM Role attached to the policy above. You can do it using annotations:
And now, let there be light!
If you are following this journey as a tutorial, after all the tweaking you can just create all the above resources in your cluster:
kubectl apply -f resources/ --recursive
But, wait! Do not apply them yet! If you are a watchful reader, you noticed that most of the resources above makes reference to the airflow-on-k8s
namespace. A Namespace
is a way to tell Kubernetes that all resources in the same namespace are somewhat related (i.e. they are part of the same project) and is a nice way to organize things inside the cluster. You should declare your Namespace
resource inside the resources/
folder and apply it before applying everything else, otherwise you will get an error.
FYI, not every resource on Kubernetes is namespaced (i.e. PersistentVolume
, StorageClass
and other low-level resources), and that is why some of them do not have any reference to a namespace.
Architecture
Epilogue
And that was a fast-forward take on my journey to deploy Airflow on Kubernetes. I tried to cover all kinds of resources generated by the helm chart export, but feel free to ask your questions in the comments section if you think I left something behind.
These yaml resources above were taken from a functional deployment I made. Some of them I built from scratch, and others I adapted from the ones I exported. I suggest that you take your time understanding them and making changes for better organization and performance.
There is a lot more you can do to get the most of this implementation. You can set the limits
and requests
fields for your containers inside the deployments, to make sure they will have the necessary resources available for them to work properly. Going further on the benefits of Kubernetes, you will see that the KubernetesPodOperator
allows you to label your pods and pass a lot of Kubernetes configurations to it, such as affinities, tolerations and a bunch of other stuff. If you have tainted nodes, you can assure that just some specific pods will run on them, reserving the most powerful nodes to the most critical tasks.
If you tried this setup and have something to add, something that worked like a charm or turned out to be a bad choice, please tell us in the comments.
[1]: 10 Benefits to using Airflow: https://medium.com/analytics-and-data/10-benefits-to-using-airflow-33d312537bae
[2]: Why we switched to Apache Airflow: https://www.solita.fi/en/blogs/why-we-switched-to-apache-airflow/
[3]: Why Robinhood uses Airflow: https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8
[4]: Airflow: How and when to use it: https://towardsdatascience.com/airflow-how-and-when-to-use-it-2e07108ac9f5
[5]: Getting started with Apache Airflow: https://towardsdatascience.com/getting-started-with-apache-airflow-df1aa77d7b1b
[6]: Airflow native operators: https://airflow.apache.org/docs/stable/_api/airflow/operators/index.html
[7]: Airflow contrib operators: https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/index.html
[8]: Kubernetes on Katacoda: https://www.katacoda.com/courses/kubernetes
[9]: hgrif Airflow Tutorial: https://github.com/hgrif/airflow-tutorial
[10]: Airflow Helm Chart: https://github.com/apache/airflow/tree/master/chart
[11]: Amazon EFS CSI driver: https://github.com/kubernetes-sigs/aws-efs-csi-driver
[12]: Go-daddy Kubernetes External Secrets: https://github.com/godaddy/kubernetes-external-secrets
The Apache Airflow logo is either registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. The Kubernetes logo files are licensed under a choice of either Apache-2.0 or CC-BY-4.0 (Creative Commons Attribution 4.0 International).