Dagster, Airflow and Prefect: A Deep Dive

Pedram Navid
Towards Data Science
8 min readJan 10, 2022
Photo by Joshua Sortino on Unsplash

editor’s note: this post was updated on May 17 2024

One of the great things about the Modern Data Stack is the interoperability with all the different components that make up the stack, but the unfortunate truth is that orchestration is still a tricky problem. Data pipelines still involve custom scripts and logic that don’t fit perfectly into the ETL workflow. Whether it’s custom internal services, or something as simple as downloading a file, unzipping it, and reading its contents, there’s still a need for orchestration tools.

While the Modern Data Stack provides a lot of flexibility with its various components that play well together, orchestrating workflows across the stack still isn’t a solved problem.

Enter Orchestration tools like Dagster, Apache Airflow, and Prefect. These tools are the bread and butter of data engineering teams. Apache Airflow, the oldest of the three, is a battle-tested and reliable solution that was born out of Airbnb and created by Maxime Beauchemin. Data engineering was a different world back then, largely focused on regularly scheduled batch jobs that often involved ugly systems with words like Hive and Druid. You still see a lot of this heritage in Airflow today.

Prefect and Dagster are newer products supported by their cloud offerings, Prefect Cloud and Dagster Plus. Prefect Cloud is free to start and hosts a scheduler, while the hybrid architecture allows you to run your tasks locally or on your infrastructure. Dagster Plus offers both hybrid and serverless options, which allows you to have the flexibility of either running workers on your internal clusters or offloading that work to Dagster Plus.

What is Airflow and what are its top alternatives?

Airflow is a workflow orchestration tool used for orchestrating distributed applications. It works by scheduling jobs across different servers or nodes using DAGs (Directed Acyclic Graphs). Apache Airflow provides a rich user interface that makes it easy to visualize the flow of data through the pipeline. You can also monitor the progress of each task and view logs.

If you’ve ever been confused by how dates work in Airflow, you’ll have seen some of that heritage. It’s rare that someone new to Airflow doesn’t get confused by what an execution date is and why their DAG isn’t running when they expect it to. All of this bred from the days of running daily jobs sometime after midnight.

A look at Airflow 1.4 from many years ago, source: Airflow on Github

Airflow is now an Apache project, and its adoption by the Apache Foundation has cemented the project's status as the status-quo open-source orchestration and scheduling tool. Today, thousands of companies use Airflow to manage their data pipelines, and you’d be hard-pressed to find a major company that doesn’t have a little Airflow in their stack somewhere. Companies like Astronomer and AWS even provide managed Airflow as a Service, so that the infrastructure around deploying and maintaining an instance is no longer a concern for engineering teams.

That said, with the changing data landscape, Airflow often encounters hurdles when it comes to testing, non-scheduled workflows, data modeling, parameterization, data transfer, and storage abstraction.

The Benefits of Airflow

Before we dig into some of those pitfalls, it’s important to mention some of the benefits of Airflow: Without a doubt, a project that has been around for over a decade, has the support of the Apache Foundation, is entirely open-source, and used by thousands of companies is a project worth considering.

For example, there are two orders of magnitude more questions on Stack Overflow for Airflow than any of the other competitors. Odds are if you’re having a problem, you’re not alone, and someone else has hopefully found a solution. There are also Airflow Providers for nearly any tool you can think of, making it easy to create pipelines with your existing data tools.

The Problems with Airflow

As the data landscape continued to evolve, data teams were doing much more with their tools. They were building out complex pipelines to support data science and machine learning use cases, ingesting data from various systems and endpoints to collect them into warehouses and data lakes, and orchestrating workflows for end-user tools across multiple data systems. Airflow was the only real orchestration tool available for a while, so many teams tried to squeeze their increasingly complex demands into Airflow, often hitting a brick wall.

The main issues we’ve seen with Airflow deployments fall into one of several categories:

  • local development, testing, and storage abstractions
  • one-off and irregularly scheduled tasks
  • the movement of data between tasks
  • dynamic and parameterized workflows

We’ll dive into each one of these issues by exploring how two alternative tools, Dagster and Prefect address these.

A Look at Dagster

Dagster is a relatively young project, started in April 2018 by Nick Schrock, who previously co-created GraphQL at Facebook. Similarly, Prefect was founded in 2018 by Jeremiah Lowin, who used his learnings as a PMC member of Apache Airflow to design Prefect.

Both projects are approaching a common problem but with different driving philosophies. Dagster takes a first-principles approach to data engineering. It is built with the full development lifecycle in mind, from development to deployment to monitoring and observability.

Prefect, on the other hand, adheres to a philosophy of negative engineering, built on the assumption that the user knows how to code. It makes it as simple as possible to build that code into a distributed pipeline, backed by its scheduling and orchestration engine. However, that simplicity can sometimes come at a cost when trying to address more complex data pipelines.

Both projects have gained considerable momentum and are rapidly improving. Let’s examine how these two projects handle some of the challenges Airflow struggles with.

Local Development and Testing

With Airflow, local development and testing can be a nightmare. If your production Airflow instance uses Kubernetes as an execution engine, then your local development will need Kubernetes locally. A task written with the S3Operator requires a connection to AWS S3 to run, which is not ideal for local development.

import pandas as pd
from io import StringIO
from dagster import Definitions, asset
from dagster_aws.s3 import S3Resource
@asset
def upload_to_s3(s3: S3Resource):
path_file = "test.csv"
df = pd.read_csv(path_file)
csv_buffer = StringIO()
df.to_csv(csv_buffer)
bucket = s3.Bucket("my-bucket")
bucket.put_object(Body=csv_buffer.getvalue(), Key="test.csv")
defs = Definitions(assets=[upload_to_s3], resources={"s3": S3Resource()})

With Dagster, you define your pipelines using an asset-centric approach rather than as tasks. Dagster’s modular architecture lets you provide custom resources as inputs into your pipelines that can be swapped based on environments. You may want to use S3 in production but minio in development for faster iteration.

Prefect also supports a level of abstraction on storage although through RunConfigs.

from prefect import Flow
from prefect.run_configs import KubernetesRun
# Set run_config as part of the constructor
with Flow("example", run_config=KubernetesRun()) as flow:
...

However, this doesn’t provide the same level of abstraction as Dagster, which can make local development more tricky. For Prefect, parametrization is the focus of local development. By being able to parametrize your Flows, you can provide smaller datasets for local development and larger ones for production use.

Scheduling Tasks

In Airflow, off-schedule tasks can cause a lot of unexpected issues and all DAGs need some type of schedule. Running multiple runs of a DAG with the same execution time is not possible.

With Prefect, a Flow can be run at any time, as workflows are standalone objects. While we often wait 5–10 seconds for an Airflow DAG to run from the scheduled time due to the way its scheduler works, Prefect allows incredibly fast scheduling of DAGs and tasks by taking advantage of tools like Dask.

Similarly, Dagster allows a lot of flexibility for both manual runs and scheduled DAGs. You can even modify the behavior of a particular job based on the schedule itself, which can be incredibly powerful. For example, if you want to provide different run-time configurations on weekends vs weekdays.

@schedule(job=configurable_job, cron_schedule="0 0 * * *")
def configurable_job_schedule(context: ScheduleEvaluationContext):
scheduled_date = context.scheduled_execution_time.strftime("%Y-%m-%d")
return RunRequest(
run_key=None,
run_config={"ops": {"configurable_op": {"config": {"scheduled_date": scheduled_date}}}},
tags={"date": scheduled_date},
)

And running a job in Dagster is as simple as:

dagster job execute -f hello_world.py

Data Flow in Airflow, Prefect, and Dagster

One of the biggest pain points with Airflow is the movement of data between related tasks. Traditionally, each task would have to store data in some external storage device, pass information about where it is stored using XComs (let’s not talk about life before XComs), and the following task would have to parse that information to retrieve the data and process it.

In Dagster, the inputs and outputs of jobs can be made much more explicit.

import csvimport requests
from dagster import get_dagster_logger, job, op
@op
def download_cereals():
response = requests.get("<https://docs.dagster.io/assets/cereal.csv>")
lines = response.text.split("\\n")
return [row for row in csv.DictReader(lines)]
@op
def find_sugariest(cereals):
sorted_by_sugar = sorted(cereals, key=lambda cereal: cereal["sugars"])
get_dagster_logger().info(
f'{sorted_by_sugar[-1]["name"]} is the sugariest cereal'
)
@job
def serial():
find_sugariest(download_cereals())

The above example clearly shows that the download_cereals op returns an output and the find_sugariest op accepts an input. Dagster also provides an optional type hinting system to enhance the testing experience, something not possible in Airflow tasks and DAGs.

@op(out=Out(SimpleDataFrame))
def download_csv():
response = requests.get("<https://docs.dagster.io/assets/cereal.csv>")
lines = response.text.split("\\n")
get_dagster_logger().info(f"Read {len(lines)} lines")
return [row for row in csv.DictReader(lines)]
@op(ins={"cereals": In(SimpleDataFrame)})
def sort_by_calories(cereals):
sorted_cereals = sorted(cereals, key=lambda cereal: cereal["calories"])
get_dagster_logger().info(
f'Most caloric cereal: {sorted_cereals[-1]["name"]}'
)

In Prefect, inputs and outputs are also clear and easy to wire together.

with Flow("Aircraft-ETL") as flow:
reference_data = extract_reference_data()
live_data = extract_live_data()
transformed_live_data = transform(live_data, reference_data) load_reference_data(reference_data)
load_live_data(transformed_live_data)

The transform function accepts the outputs from both reference_data and live_data. For large files and expensive operations, Prefect even offers the ability to cache and persist inputs and outputs, improving development time when debugging.

Dynamic Workflows

Another great feature of both Dagster and Prefect that is missing in Airflow is an easy interface for creating dynamic workflows.

In Prefect, parameters can be explicitly specified in the Cloud Interface or provided to the Flow runner. This makes scaling out to large complex computations easy allowing for sane initial development as you work on your pipelines.

In Dagster, you can define a graph and then parametrize it to allow for dynamic configurations, which allows you to fully customize resources, configurations, hooks, and executors.

from dagster import graph, op
from dagster import ResourceDefinition
@op(required_resource_keys={"server"})
def interact_with_server(context):
context.resources.server.ping_server()
@graph
def do_stuff():
interact_with_server()
prod_server = ResourceDefinition.mock_resource()
local_server = ResourceDefinition.mock_resource()
prod_job = do_stuff.to_job(resource_defs={"server": prod_server})
local_job = do_stuff.to_job(resource_defs={"local": local_server})

Wrapping Up

I hope this was a helpful exploration of some of the new orchestration tools that have started to gain traction in the data landscape. Despite the shortcomings of Airflow, it is still a solid and well-architected platform that serves many people well. However, competition in this space will only help improve all the tools as they learn and improve from each other. I’m excited to see how this space develops and would love to know what you think.

Which scheduler are you using? Do you have plans on migrating off Airflow? Let me know by commenting below or reaching out on Twitter or by email!

--

--