This topic is not new. I’ve seen quite a few articles about generating different levels of abstractions for Airflow DAGs. For example, this article explains how to generate DAGs programmatically, and this library allows you to generate DAGs using YAML. However, both of those abstractions are still for data practitioners to handle. I was looking for an abstraction that would remove the complexities and technicalities of creating Airflow DAGs and allow the author to concentrate on data-source specific configurations. To do this, the DAG factory needs to make a lot of assumptions and be highly opinionated, so that’s exactly what I did.
This post will show you how I generated Airflow factories to handle different data sources that allow me to create DAGs using explicit YAML files. My goal was to create an abstraction that could potentially enable stakeholders to generate Airflow DAGs to extract their data and load it into our data warehouse.
Let’s talk about Google Sheets
The concept I will show you applies to any Data source, and if you implement it, you will ripe the benefits regardless of the data source. However, a niche example of this is Google Sheets. Whether you like it or not, most businesses run on either Excel or Google Sheets (or both!) and sooner rather than later, you will encounter stakeholders that will request to load data from a Google Sheets file into your data warehouse. Now, this is not a blog post about data philosophy, and I have no intention to discuss this (much sensitive) topic of Excels and Google Sheets. I will say, though, that if you opt to go for the route of allowing your stakeholders to load Google Sheets to your data warehouse, you’ll pretty soon see many requests coming in: for changing a column name, a column type, adding a column, removing a column, changing the sheet name and so on. Imagine an ideal world in which your stakeholders have the power of creating their pipelines themselves. In this world, you as a data practitioner would be free from all those requests and stop being a bottleneck for your stakeholders. At the same time, with great power comes great responsibility, and as your stakeholders get the power of managing their own Google Sheets pipelines, they become owners for that data. If changing a column breaks the ingestion, you are not accountable for that anymore (sort of), it is their responsibility, and they have the power to fix it. I know what you are thinking, and yes, this is, in fact, a double-edged sword. The ability to create their own Google Sheets pipeline could mean that your data warehouse becomes a black hole of endless Google Sheets’ data with poor structure, meaning and/or naming conventions. Again, I’m not arguing for or against here. If you implement this, I trust that you will maintain high governance over what gets pushed to production. Your critical thinking will ultimately set you free.
Are you not buying onto the idea of letting stakeholders load their Google Sheets files? Fine. Abstract a bit from the specific use-case, and think about DAG factories as a way to help your team be more productive, be DRY and avoid mistakes. So it doesn’t matter if it’s loading Google Sheets or some other data from transactional APIs; your team could probably benefit from using Airflow DAG factories.
With all of that rant out of the way, we can explore how to load Google Sheets data using Airflow DAG factories. Now, it’s also important to mention that this is definitely not the only way you can load data from a Google Sheets file. For example, Gitlab implemented their own tool they call SheetLoad (seriously, great name), and there are also tools like Fivetran and Stitch. However, if your team is already working with Airflow or if you want an extra dose of customization, this post will be helpful.
I digress. The real point about this is that it’s not about Google Sheets; it’s about ease of use. This is important, even if it’s just your team that’s using this tool. For example, I rather write YAML files for loading data from Salesforce than write Python scripts for every single object out there. Essentially, I’m just advocating for ease-of-use and DRY (don’t repeat yourself).
What is a DAG factory anyway?
So at this point, you might be wondering, what in the world is a DAG factory? In a more general way, a DAG factory is a script that generates Airflow DAGs programmatically. However, in this case, I add another condition. The factory generates DAGs programmatically from YAML configuration files.

More concretely, inside our /dags
directory in Airflow, I would like to have a /dags/gsheets
directory that will hold all of the YAML files. So we would have something like /dags/gsheets/invoices.yaml
and /dags/gsheets/products.yaml
and the goal is to have Airflow create corresponding DAGs for those configuration files. Using this structure, there should be a 1:1 relation between YAML configuration files and Airflow DAGs. However, nothing is stopping you from having a single YAML configuration file that generates many DAGs.
What does a configuration file look like?
Taking into account that ease-of-use is one of our main goals, this is a crucial question. The configuration file ends up being the interface with which analysts and other stakeholders will interact. In this case, the configuration file is the equivalent of the front-end in a web application. So we need to make sure it has a good "UX". In this sense, simplicity is vital, and any complication you can hideaway is highly appreciated.
So here’s what a configuration file might look like:
Is this the most straightforward configuration file we can have? Certainly not! We can simplify even more if we make certain assumptions. For example, does the creator of these config files need to decide where the data will be stored in the data warehouse, such as schema name, table, and database? This will all depend on your use case, of course.
However, let’s get straight to the point and address the elephant in the room. What is that mapper dictionary?
That’s a way we can deal with the renaming of columns and casting types. It will all become more evident when we look at the DAG factory, but its essence is that we can explicitly define the columns from the Google Sheets that we want to load and rename them to be SQL compliant (for example, rename them to snake_case). For instance, if in our Google Sheet file we have a column named Product Name
, it would be annoying to have a column with that name in our data warehouse, and it could cause all sorts of problems. We want to change the name of that column to something like product_name
. Using this mapper, we would do something like this:
Let’s build a factory
So it all starts with a good old fashion /dags/gsheets_factory.py
script. This file will import all of the configuration files from /dags/gsheets
, creating an Airflow DAG for each one of those.
That’s exactly how we start; we start by getting all of those YAML files:
Now we can iterate them and use them to create DAGs:
All we are doing here is iterating those files and opening them with PyYAML, which returns a dictionary version of the YAML file we discussed previously. We store that configuration dictionary in config
. Finally, we pass that configuration dictionary to a function called create_gsheets_dag
, and we add the DAG that function returns to the Airflow global scope.
As I said, create_ghseets_dag
returns an Airflow DAG, so this is essentially where the magic actually happens. Let’s take a look at it.
Here is the entire function:
But let’s break it down and understand what each part is doing.
The first part is simply defining the DAG’s properties, such as the ID, the schedule interval, etc. Notice that all of this is coming from the config
which is a dictionary representation of the YAML configuration file we discussed before.
For this task, we will use Airflow’s PythonOperator. This operator will take a Python function as an argument (python_callable
) and execute it. So we first need to define that function. In our case, we will call it _sync_data
.
The first thing this function does is get the data from Google Sheets
The next step is to clean the data we got:
We are calling a helper function that we wrote clean_data
, but this function’s details are not super relevant or exciting; it simply takes the mapper dictionary and renames the columns.
After that, we are ready to insert the data into our data warehouse. In our case, we use Snowflake, and we already have a function load_df
that takes a Pandas DataFrame and writes it to Snowflake. This is based on the Snowflake Python library that has a similar function. However, this part can be adapted to whatever destinations you are working with:
As described before, we then add a task to our DAG using the PythonOperator that will execute the function we just defined (_sync_data
).
Conclusion
Hopefully, I managed to show how we can drastically improve how we create Airflow DAGs by introducing a level of abstraction. It doesn’t matter if you are loading data from Google Sheets, Hubspot or Salesforce; these concepts still apply.