The worldโ€™s leading publication for data science, AI, and ML professionals.

Why you should try something else than Airflow for data pipeline orchestration

Let's evaluate AWS step functions, Google workflows, Prefect next to Airflow

Fan[Digital image] by rajat sarki, https://unsplash.com/photos/Gx2SU87s4WY
Fan[Digital image] by rajat sarki, https://unsplash.com/photos/Gx2SU87s4WY

While Airflow has dominated the market in terms of usage and community size as a data orchestrator pipeline, it’s pretty old and wasn’t designed initially to meet some of the needs we have today. Airflow is still a great product, but the article’s goal is to raise awareness on the alternative and what the perfect orchestration tool would be for your data use case. Let’s evaluate AWS step functions, Google workflows, Prefect next to Airflow.

So what are the criteria for a good data orchestrator tool nowadays?

API-First design โš™

As the Cloud providers are API-First, you want your orchestrations tool to be the same. Ideally, you want to be able to do a couple of things through the API :

  • Create/delete workflows
  • Easy DAG serialization & deserialization for non-static /evolving workflows.
  • Run parameterized workflows
  • Handling access management
  • Deploy the orchestration tool (if not serverless) through IaC frameworks (Terraform/Pulumi)

All these features will enable you to connect to all your existing cloud services while using event-driven pipelines to its maximum potential.

Airflow DAGs creation is pretty static and the API is still quite limited compared to the other tools. While you can have a strategy to automate the deployments of the dags, you are still tight to generate a static file somewhere at the end.

Prefect seems not fully adapted for dynamic dag creation and has a bit of the same pattern as Airflow for DAG creation, see the issue here.

Serverless & Separation of concern with Runtime โ˜๏ธ

There’s always a paradox in serverless. On one hand, we don’t want to manage services and would rather focus on our use case. On the other hand, when something goes wrong or we need a custom feature, it’s a black-box hell.

Nevertheless, managing an Airflow cluster in the past has been a pain and Kubernetes with Airflow v2 have solved many issues. Still, we should not underestimate the maintenance cost of a Kubernetes cluster. Aside from that, you will still need to add a couple of things to make sure it’s working smoothly, for example, authentification, secrets management, and monitoring of the K8s cluster. Using a serverless orchestration tool from the cloud provider you are in, this is pretty smooth and built-in. With a Kubernetes Cluster, you are on your own to maintain or enable these.

Another thing with the serverless orchestrator tool is that you are forced to have a clear separation of concerns and use that one ONLY for orchestrating tasks, not for actually running them. A dangerous path with Airflow is to use it as runtime. Again, Kubernetes helped a lot to solve this (see article here) but still, it’s on your cluster and the maintenance depends on the tool you have put in place to monitor this one.

Integration capabilities โ›“

What do you want to trigger? Is there any "connector" that enables you to trigger the target runtime without any custom layer?

Airflow has a lot of operators. And if you don’t find what you need, you can always build your own. Prefect has also a good list of integrations.

Step Functions has a couple of integrations with AWS services, offering even sync job or wait for callback.

Google workflows started also to add connectors with GCP services.

UI features ๐Ÿ”ฎ

When running complex workflows, it’s essential to have a clear place to observe what went wrong and quickly take action.

You also would like to easily roll back or retry on a specific task/sub-task especially in a data pipeline context. Note that best practice should make your data idempotent.

However, nowadays, pipeline dashboarding is not even enough. The problem is that you may have a silent failure, and you may need other alerts/info to be feed in a central tool. For example, you have a pipeline that always discards any data that doesn’t fit the schema. In such a case, you need to have another monitoring tool, such as a data quality tool for data observability.

Testing ๐Ÿ—

As a developer, you want an easy way to test your pipelines and have a development cycle as small as possible. At first thought, we may think that data orchestrator that can run anywhere (like Airfow/Prefect) would be the one that provides the best and smooth testing experience right? Well not really because running them locally will probably torture your laptop’s CPU and boost your fan generating insane air flows (sorry for the joke, I had to ๐Ÿ˜ ).

With a managed Airflow (AWS/Astronomer) you can have the possibility to create Airflow Instance on the fly (and automate it through code) for development reasons but the startup time is not negligible. Yes, even a minute or 2 is a lifetime for a developer.

So at the very end, having full serverless orchestrators like AWS Step Functions/Workflows enable you to test rapidly your pipelines if you leverage IaC frameworks. Besides, you are testing it directly in the target environment to have little to no side effects.

Note that AWS provides an emulator of their Step Functions for testing purposes here.

In conclusion, let’s put some stars! ๐ŸŒŸ

This table is just a high-level evaluation. Of course, you should consider other factors like the current knowledge of your team, your existing infrastructure, and make your own benchmark! While I really like and I have been (and still am) a long user of Airflow, I must say that my go-to for most of the new use cases would be AWS Step Functions / GCP Workflows depending on the use case.

Dagster is another great tool to be considered but having no prior experience with it and as they don’t provide a cloud-hosted version (though it’s in progress according to their website), I didn’t take time to invest in it.

Mehdi OUAZZA aka mehdio ๐Ÿงข

Thanks for reading! ๐Ÿค— ๐Ÿ™Œ If you enjoyed this, follow me on ๐ŸŽฅ Youtube, โœ๏ธ Medium, or ๐Ÿ”— LinkedIn for more data/code content!

Support my writing โœ๏ธ by joining Medium through this link


Related Articles