Apache Airflow : 10 Rules to Make It Work ( scale )

Airflow is by default very permissive and without strict rules you are likely to create a chaotic code base that is impossible to scale and administrate.

raphaelauv
Towards Data Science

--

if you are not careful your shortcuts will cost you a lot afterwards

Airflow permissive approach will let you schedule any custom code (jobs) but you will create a spaghetti stack if you do not follow very strict SEPARATION OF CONCERN design between the airflow dags and your jobs.

Airflow allow you to run your jobs without isolation with the framework itself

At the origin Airflow was sort of a “super cron”, so the way to run jobs was heavily coupled to the framework itself. Today the biggest challenge that you must overcome is remove coupling between scheduling and running jobs.

Nowadays (2023) GCP, AWS , AZURE provide managed Airflow v2. It’s proof enough that if you correctly setup and use the framework, it works!

The 10 rules :

so you can scale your Airflow deployment only depending on the number of dags and tasks to run (and not depending of what you run)

1) Airflow is an orchestration framework, not an execution framework:

For your jobs you should use only operators that trigger computation out of Airflow in a separate processing solution like :

  • a container: K8S, GKE, EKS, ECS, Cloud RUN …
  • a serverless function: AWS Lambda, Cloud Function
  • a spark job: EMR, DATAPROC …
  • a query: BigQuery, Redshift , ClickHouse , Trino , Databend …

Because running directly your jobs (performing substantial CPU, Memory or IO operations) inside Airflow would put pressure on your airflow_workers. Airflow should just run operators launching and asserting tasks running in a separate processing solution.

2) Do not use the PythonOperator (and other magic) for your jobs:

If you use a PythonOperator then only run very very simple code, that must only do simple IO operations (like transform a small XCOM), otherwise run your job with an operator of the rule 1.

Do not use the VirtualVenvOperator or ExternalPythonOperator or BashOperator or the DockerOperator (except if you trigger the run in a host not part of your airflow deployment), you should use an operator like a KubernetesPodOperator , EmrCreateJobFlowOperator …

Do not use taskflow decorators ( @task.kubernetes ...) that are nice magic for POC but it will lack of separation of concerns ( having your job logic (code) inside the scheduling logic (code) is high coupling that will reduce your ability of administrating your airflow DAGS

3) Choose wisely the operators:

3.1) Check the existing operators before creating one

In 99% of cases you should not create a new Airflow operator, look very carefully for the existing operators. Check whether what you are trying to do is possible with a combination of existing operators (example, first task is a SubmitXXOperator, second task is a SensorXXOperator)

If you really feel the need for a new operator, ask the Airflow community on GitHub or slack, in most cases they will show you a way with existing operators.

If you absolutely need to customize an existing operator, extend an existing class, do not recreate everything from zero.

In my current company we created only 1 operator a BigQueryInsertJobToXCOMOperator (to store the small first line result of a query inside the XCOM return_value)

3.2) Do not use every existing operators ( many are inefficient )

Some open source airflow operators are just helpers (running directly python code and not triggering the operation to an external solution)

example the SimpleHttpOperator or GCSToGoogleDriveOperator run the operation directly with python code on your airflow_worker(s) if you are not using the KubernetesExecutor or CeleryKubernetesExecutor.

In that case it’s better to use a KubernetesPodOperator (running a container doing the desired operation) and not used theses kind of operators.

3.3) Use Deferrable mode (AsyncOperators)

All the long running operators ( KubernetesPodOperator … ) are using a airflow_worker slot during all the run of the triggered task. So if you want run 50 KubernetesPodOperator at the same time and your airflow_worker can run 6 operators at maximum then you are going to need 9 airflow_worker nodes.

But if you use 50 KubernetesPodOperator in deferrable mode then you only need 1 airflow_triggerer node and 1 airflow_worker node

4) Do not install any custom dependency in your Airflow deployment:

The only allowed dependencies are the Airflow community supported providers apache-airflow-providers-XXX

like :

- apache-airflow-providers-cncf-kubernetes
- apache-airflow-providers-google
- apache-airflow-providers-slack

Because these are the only packages that the community of Airflow ensures the good compatibility with Airflow itself.
installing any other dependency will put the deployment in a dangerous state and can result in dependency hell when upgrading.

I do not recommend installing custom Airflow provider that are not part of the official list ( also because most of them are inefficient helper operators running code directly in your airflow_worker(s) ) :

- https://pypi.org/project/airflow-dbt/
- https://github.com/great-expectations/airflow-provider-great-expectations

5) Airflow is NOT a data lineage solution:

Airflow is a scheduler running tasks defined in operators, currently Airflow does have very limited (in beta) lineage capabilities. These allow Airflow to integrate with third party solutions using the Open Lineage standard (such as Marquez).
You should absolutely avoid re-implementing a custom/home made solution for this, no matter how tempting.

6) Airflow is NOT a data storage solution:

Airflow operators can return data that Airflow will store in its internal database airflow_db (backed by a traditional RDBS such as Postgresql). We call the data stored in the airflow_db a XCOM.

But the airflow_DB is not supposed to store custom data but only very small metadata (like our BigQueryToXCOMOperator usually returning a single value, like a timestamp).

The best practice is to NOT return data from operators. If you have a special cases where the operator must return data (like a complex json of multiples lines), you need to use a custom xcom_backend to write the data not in airflow_db but in another place (like S3 or GCS )

Your custom_xcom_backend must only be used for the task that you explicitly choose (be careful by default a xcom_backend will be use by Airflow for all XCOM)

In all cases if you run a job with an external processing solution like a KubernetesPodOperator then the job is responsible to write itself the result to a storage (like S3, GCS) at a path depending on the context provided by the scheduler.

7) Do not put secrets in Airflow variables or connections:

Airflow can store variables that DAG’s can access with templating, so the DAG will always retrieve the latest version of the variable at running time, same for connection.

But if a variable or a connection stores a secret value (like a private key), do NOT register this variable or connection directly in Airflow, you MUST register that variable or connection in a secret store (Vault , GCP secret-manager …) and use a secret_backend in Airflow.

Your custom_secret_backend must only retrieve variable or connection that you defined (be careful by default Airflow will try first to find the secret for all variables and connections in the custom_secret_backend giving warnings log for normal Airflow variables or connections)

8) Your jobs must be at maximum agnostic of scheduling:

If you rerun a job it must be idempotent and you must also be capable of running your job outside Airflow.

Example: if you have a small job crawling an API and writing the result on S3, you must be able to run that job in a container on your computer because the job takes all the context from env_vars or args. This ensures complete decoupling from airflow and the actual computation (clear boundaries between the scheduler and the task implementation).

A good practice is to code your jobs in a separate repository from your dags (because the dags only need to know the name and version of the container implementing the jobs).

Then in the jobs repository : associate to every job a small docker-compose to be able to run the job in a fixed context.

9) Do not deploy together the different elements of Airflow:

If you deploy Airflow in K8S with the official HELM file, it’s the default behavior.

But if you go for manual deployment then you must isolate the airflow_scheduler, the airflow_webserver and airflow_workers so you can scale them depending your use-cases.

10) LocalExecutor and KubernetesExecutor are not flexible:

The LocalExecutor is only vertically scalable because you will run tasks at the same location than the scheduler ( having HA with multiple schedulers doesn’t mean you should have N schedulers to horizontally scale your Airflow stack). Thus, it is discouraged for production environments.

The KubernetesExecutor will put the pressure directly on the K8S cluster (a new pod for every task ! very overkill ), it’s harder to detect and solve problems of scaling with that executor.

The CeleryExecutor is a good trade-off , all operators will run directly in airflow_workers and only create pods in your K8S if you use KubernetesPodOperator.

More details on the -> Airflow executors documentation

Thank you and a big thanks to all Apache Airflow committers
My Github -> https://github.com/raphaelauv

* All images by Author *

--

--