Airflow design pattern to manage multiple Airflow projects

Explore data engineering techniques and code to continuously deploy multiple projects on an Airflow instance or as an ECS service

Bhavin
Towards Data Science

--

Photo by Ian Dooley on Unsplash

Airflow is a great tool to schedule, visualize, and execute a data pipeline. But if you are like me, who has to manage about 100+ different pipelines, you would quickly realize that developing and managing these pipelines would require a bit of engineering.

We will explore 3 aspects of a design pattern that will help us make the development process simple and manageable.

  1. Separate repository for each project DAG.
  2. Use CI/CD to deploy code.
  3. Use containers to execute code.

1. Project Separation: How to maintain individual repositories per pipeline?

Airflow will automatically initialize all the DAGs under the DAG’s Home directory specified in the config. But to initialize DAG.py from separate projects, We will have to install DAGFactory.

DAGFactory: DAGFactory will collect and initialize DAGs from the individual projects under a specific folder, in my case it was @ airflow/projects.

airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
├── dags
│ └─ DAGFactory.py
├── global_operators
├── logs
├── projects
│ └── test_project
│ └── DAG.py
├── requirements.txt
├── scripts
├── tests
└── unittests.cfg
  • install it under airflow/dags/DAGFactory.py

Airflow will initialize all the dags under airflow/dags dir. So we will install DAGFactory here.

airflow
├── READEME.md
├── airflow-scheduler.pid
├── airflow.cfg
└── dags
└─ DAGFactory.py

Code for DAGFactory.py : Traverse through all the projects under airflow/projects and load the DAGS variable from DAG.py into airflow’s global namespace.

This will allow us to have your code in a separate repository. All we have to do is

  1. Have a DAG.py file at the root level of your repo and
  2. Have a list by the nameDAGS with all the main dags in it. (Yes, you can pass multiple DAGs. Each dag will appear as a separate DAG on the UI. And All DAGs can be maintained from the same code base.)

2. How do you get a project’s code into production Airflow service

  1. If you run airflow on a VM. you can git checkout the project under airflow/projects
  2. You can use a config management tool like chef to deploy new projects on airflow VM.
  3. You can use code-build or code-pipeline on AWS to build and deploy your Kubernetes service. If you run airflow as a service on EKS. (or Kubernetes engine on GCP)

I will write an article on each of these soon.

3. Create complex DAGs v/s write complex code and execute in containers.

As the number of DAGs increase and their complexity increases. It takes more resources to execute them, and that means if your production airflow environment is on VM you will have to scale up to keep up with the increased demand. (If you use Kubernetes to execute or any other executor this won’t apply to you).

Now to overcome this, Instead of executing all the DAGs on a single m/c. I recommend creating separate containers for each project and executing them in ECS via airflow.

This approach lets me keep my airflow instance size to a minimum and containers get executed on FARGATE in ECS and I only pay for the time they need to run.

As I can keep my codebase separate by using DAGFactory, I can have DAG.py and dockerfile in the same repository. And I can put DAG.py on airflow instance and Docker containers in ECR.

A drawback of this approach is that you won’t be able to use airflow operators inside for your ETL or other processing. But you can still build your library of modules and use them across different pipelines.

So why keep airflow?

Well, Airflow comes with a greater UI and a scheduler. Which makes it easy to track job failures, kick off a job (a container in this case), and look at the logs in one place.

In short, no need to log in to an instance to debug issues.

--

--