
Managing data models at scale is a common challenge for data teams using dbt (data build tool). Initially, teams often start with simple models that are easy to manage and deploy. However, as the volume of data grows and business needs evolve, the complexity of these models increases.
This progression often leads to a monolithic repository where all dependencies are intertwined, making it difficult for different teams to collaborate efficiently. To address this, data teams may find it beneficial to distribute their data models across multiple dbt projects. This approach not only promotes better organisation and modularity but also enhances the scalability and maintainability of the entire data infrastructure.
One significant complexity introduced by handling multiple dbt projects is the way they are executed and deployed. Managing library dependencies becomes a critical concern, especially when different projects require different versions of dbt. While dbt Cloud offers a robust solution for scheduling and executing multi-repo dbt projects, it comes with significant investments that not every organisation can afford or find reasonable. A common alternative is to run dbt projects using Cloud Composer, Google Cloud’s managed Apache Airflow service.
Cloud Composer provides a managed environment with a substantial set of pre-defined dependencies. However, based on my experience, this setup poses a significant challenge. Installing any Python library without encountering unresolved dependencies is often difficult. When working with dbt-core
, I found that installing a specific version of dbt within the Cloud Composer environment was nearly impossible due to conflicting version dependencies. This experience highlighted the difficulty of running any dbt version on Cloud Composer directly.
Containerisation offers an effective solution. Instead of installing libraries within the Cloud Composer environment, you can containerise your dbt projects using Docker images and run them on Kubernetes via Cloud Composer. This approach keeps your Cloud Composer environment clean while allowing you to include any required libraries within the Docker image. It also provides the flexibility to run different dbt projects on various dbt versions, addressing dependency conflicts and ensuring seamless execution and deployment.
With the complexities of managing multiple dbt projects addressed, we now move on to the technical implementation of deploying these projects at scale on Google Cloud. The diagram below outlines the process of containerising dbt projects, storing the Docker images in Artifact Registry, and automating the deployment with GitHub Actions. Additionally, it illustrates how these projects are executed on Cloud Composer using the open-source Python package, dbt-airflow
, which renders dbt projects as Airflow DAGs. The following section will guide you through each of these steps, providing a comprehensive approach to effectively scaling your dbt workflows.

Deploying containerised dbt projects on Artifact Registry with GitHub Actions
In this section, we will define a CI/CD pipeline using GitHub Actions to automate the deployment of a dbt project as a Docker image to Google Artifact Registry. This pipeline will streamline the process, ensuring that your dbt projects are containerised and consistently deployed on a Docker repo where Cloud Composer will then be able to pick them up.
First, let’s start with a high-level overview of how the dbt project is structured within the repository. This will help you follow along with the definition of the CI/CD pipeline since we will be working in certain sub-directories to get things done. Note that Python dependencies are managed via Poetry, hence the presence of pyproject.toml
and poetry.lock
files. The rest of the structure shared below should be straightforward to understand if you have worked with dbt in the past.
.
├── README.md
├── dbt_project.yml
├── macros
├── models
├── packages.yml
├── poetry.lock
├── profiles
├── pyproject.toml
├── seeds
├── snapshots
└── tests
With the project structure in place, we can now move on to defining the CI/CD pipeline. To ensure everyone can follow along, we’ll go through each step in the GitHub Action workflow and explain the purpose of each one. This detailed breakdown will help you understand how to implement and customise the pipeline for your own projects. Let’s get started!
Step 1: Creating triggers for the GitHub Action workflow
The upper section of our GitHub Action workflow defines the triggers that will activate the pipeline.
name: dbt project deployment
on:
push:
branches:
- main
paths:
- 'projects/my_dbt_project/**'
- '.github/workflows/my_dbt_project_deployment.yml'
Essentially, the pipeline is triggered by push events to the main
branch whenever there are changes in the projects/my_dbt_project/**
directory or modifications to the GitHub Action workflow file. This setup ensures that the deployment process runs only when relevant changes are made, keeping the workflow efficient and up-to-date.
Step 2: Defining some environment variables
The next section of the GitHub Action workflow sets up environment variables, which will be used throughout the subsequent steps:
env:
ARTIFACT_REPOSITORY: europe-west2-docker.pkg.dev/my-gcp-project-name/my-af-repo
COMPOSER_DAG_BUCKET: composer-env-c1234567-bucket
DOCKER_IMAGE_NAME: my-dbt-project
GCP_WORKLOAD_IDENTITY_PROVIDER: projects/11111111111/locations/global/workloadIdentityPools/github-actions/providers/github-actions-provider
GOOGLE_SERVICE_ACCOUNT: [email protected]
PYTHON_VERSION: '3.8.12'
These environment variables store critical information needed for the deployment process, such as the Artifact Registry repository, the Cloud Composer DAG bucket, the Docker image name, service account details and workload identity federation.
💡 At a high level, Google Cloud’s Workload Identity allows applications running on Google Cloud to authenticate and authorize their identities in a secure and scalable way.
For more details, refer to the Google Cloud documentation.
Step 3: Checking out the repository
The next step in the GitHub Action workflow is to check out the repository:
- uses: actions/[email protected]
This step uses the actions/checkout
action to pull the latest code from the repository. This ensures that the workflow has access to the most recent version of the dbt project files and configurations needed for building and deploying the Docker image.
Step 4: Authenticating to Google Cloud and Artifact Registry
The next step in the workflow involves Google Cloud authentication
- name: Authenticate to Google Cloud
id: google_auth
uses: google-github-actions/[email protected]
with:
token_format: access_token
workload_identity_provider: ${{ env.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ env.GOOGLE_SERVICE_ACCOUNT }}
- name: Login to Google Artifact Registry
uses: docker/[email protected]
with:
registry: europe-west2-docker.pkg.dev
username: oauth2accesstoken
password: ${{ steps.google_auth.outputs.access_token }}
First, the workflow authenticates with Google Cloud using the google-github-actions/auth
action. This step retrieves an access token by leveraging the provided workload identity provider and service account.
The access token from the previous authentication step is used to authenticate Docker with the specified registry (europe-west2-docker.pkg.dev
) in the . This login enables the workflow to push the Docker image of the dbt project to the Artifact Registry in subsequent steps.
Step 5: Creating a Python environment
The next set of steps involves setting up the Python environment, installing Poetry, and managing dependencies.
- name: Install poetry
uses: snok/[email protected]
with:
version: 1.7.1
virtualenvs-in-project: true
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/[email protected]
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'poetry'
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/[email protected]
with:
path: projects/my_dbt_project/.venv
key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('projects/my_dbt_project/poetry.lock') }}
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
working-directory: './projects/my_dbt_project/'
run: poetry install --no-ansi --no-interaction --sync
We first install poetry
, a dependency management tool that we will use to install Python dependencies. Then we installed the specified Python version and eventually load the virtual environment from cache. If we have a cache miss (i.e. some changes were made to the poetry lock file since the last time the workflow has executed), Python dependencies will be installed from scratch. Alternatively, if we have a cache hit, dependencies will be loaded from cache.
Step 6: Compiling the dbt project
The following step involves cleaning the dbt environment, installing dbt dependencies, and compiling the dbt project.
- name: Clean dbt, install deps and compile
working-directory: './projects/my_dbt_priject/'
run: |
echo "Cleaning dbt"
poetry run dbt clean --profiles-dir profiles --target prod
echo "Installing dbt deps"
poetry run dbt deps
echo "Compiling dbt"
poetry run dbt compile --profiles-dir profiles --target prod
This step, will also generate the manifest.json
file, a metadata file for the dbt project. This file is essential for the dbt-airflow
package, which will be used by Cloud Composer to automatically render the dbt project as an Airflow DAG.
Step 7: Building and Pushing Docker Image on Artifact Registry
The next step in the workflow is to build and push the Docker image of the dbt project to the Google Artifact Registry.
- name: Build and Push Docker Image
run: |
FULL_ARTIFACT_PATH="${ARTIFACT_REPOSITORY}/${DOCKER_IMAGE_NAME}"
echo "Building image"
docker build --build-arg project_name=my_dbt_project --tag ${FULL_ARTIFACT_PATH}:latest --tag ${FULL_ARTIFACT_PATH}:${GITHUB_SHA::7} -f Dockerfile .
echo "Pushing image to artifact"
docker push ${FULL_ARTIFACT_PATH} --all-tags
Note how we build the image with two tags, namely latest
as well as the short commit SHA. This approach serves two purposes; One is to be able to identify which docker image version is the latest and secondly, to be able to identify which commit is associated to each docker image. The latter can be extremely useful when debugging needs to take place for one reason or another.
Step 8: Syncing manifest file with Cloud Composer GCS bucket
The final step involves synchronizing the compiled dbt project, specifically the manifest.json
file, to the Cloud Composer DAG bucket.
- name: Synchronize compiled dbt
uses: docker://rclone/rclone:1.62
with:
args: >-
sync -v --gcs-bucket-policy-only
--include="target/manifest.json"
projects/my_dbt_project/ :googlecloudstorage:${{ env.COMPOSER_DAG_BUCKET }}/dags/dbt/my_dbt_project
This step uses the rclone
Docker image to synchronise the manifest.json
file with the bucket of Cloud Composer. This is crucial to ensure that Cloud Composer has the latest metadata available, so that it can then be picked up by dbt-airflow
package and render the latest changes made to the dbt project.
Step 9: Sending Slack alerts in case of a failure
The final step is to send a Slack alert if the deployment fails. All you need to do in order to replicate this step, is to issue a token (SLACK_WEBHOOK
) as specified in the documentation.
- name: Slack Alert (on failure)
if: failure()
uses: rtCamp/[email protected]
env:
SLACK_CHANNEL: alerts-slack-channel
SLACK_COLOR: ${{ job.status }}
SLACK_TITLE: 'dbt project deployment failed'
SLACK_MESSAGE: |
Your message with more details with regards to
the deployment failure goes here.
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
The full definition of the GitHub action is shared below. In case you have any troubles running it, please do let me know in the comments and will do my best to help you out!
name: dbt project deployment
on:
push:
branches:
- main
paths:
- 'projects/my_dbt_project/**'
- '.github/workflows/my_dbt_project_deployment.yml'
env:
ARTIFACT_REPOSITORY: europe-west2-docker.pkg.dev/my-gcp-project-name/my-af-repo
COMPOSER_DAG_BUCKET: composer-env-c1234567-bucket
DOCKER_IMAGE_NAME: my-dbt-project
GCP_WORKLOAD_IDENTITY_PROVIDER: projects/11111111111/locations/global/workloadIdentityPools/github-actions/providers/github-actions-provider
GOOGLE_SERVICE_ACCOUNT: [email protected]
PYTHON_VERSION: '3.8.12'
jobs:
deploy-dbt:
runs-on: ubuntu-22.04
permissions:
contents: 'read'
id-token: 'write'
steps:
- uses: actions/[email protected]
- name: Authenticate to Google Cloud
id: google_auth
uses: google-github-actions/[email protected]
with:
token_format: access_token
workload_identity_provider: ${{ env.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ env.GOOGLE_SERVICE_ACCOUNT }}
- name: Login to Google Artifact Registry
uses: docker/[email protected]
with:
registry: europe-west2-docker.pkg.dev
username: oauth2accesstoken
password: ${{ steps.google_auth.outputs.access_token }}
- name: Install poetry
uses: snok/[email protected]
with:
version: 1.7.1
virtualenvs-in-project: true
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/[email protected]
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'poetry'
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/[email protected]
with:
path: projects/my_dbt_project/.venv
key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('projects/my_dbt_project/poetry.lock') }}
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
working-directory: './projects/my_dbt_project/'
run: poetry install --no-ansi --no-interaction --sync
- name: Clean dbt, install deps and compile
working-directory: './projects/my_dbt_priject/'
run: |
echo "Cleaning dbt"
poetry run dbt clean --profiles-dir profiles --target prod
echo "Installing dbt deps"
poetry run dbt deps
echo "Compiling dbt"
poetry run dbt compile --profiles-dir profiles --target prod
- name: Build and Push Docker Image
run: |
FULL_ARTIFACT_PATH="${ARTIFACT_REPOSITORY}/${DOCKER_IMAGE_NAME}"
echo "Building image"
docker build --build-arg project_name=my_dbt_project --tag ${FULL_ARTIFACT_PATH}:latest --tag ${FULL_ARTIFACT_PATH}:${GITHUB_SHA::7} -f Dockerfile .
echo "Pushing image to artifact"
docker push ${FULL_ARTIFACT_PATH} --all-tags
- name: Synchronize compiled dbt
uses: docker://rclone/rclone:1.62
with:
args: >-
sync -v --gcs-bucket-policy-only
--include="target/manifest.json"
projects/my_dbt_project/ :googlecloudstorage:${{ env.COMPOSER_DAG_BUCKET }}/dags/dbt/my_dbt_project
- name: Slack Alert (on failure)
if: failure()
uses: rtCamp/[email protected]
env:
SLACK_CHANNEL: alerts-slack-channel
SLACK_COLOR: ${{ job.status }}
SLACK_TITLE: 'dbt project deployment failed'
SLACK_MESSAGE: |
Your message with more details with regards to
the deployment failure goes here.
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
Running dbt projects with Cloud Composer and dbt-airflow
In the previous section, we discussed and demonstrated how to deploy dbt project Docker images on Google Artifact Registry using GitHub Actions. With our dbt projects containerised and stored securely, the next crucial step is to ensure that Cloud Composer can seamlessly pick up these Docker images and execute the dbt projects as Airflow DAGs. This is where the dbt-airflow
package comes into play.
In this section, we will explore how to configure and use Cloud Composer alongside the dbt-airflow
package to automate the running of dbt projects. By integrating these tools, we can leverage the power of Apache Airflow for orchestration while maintaining the flexibility and scalability provided by containerised deployments.
To ensure the dbt-airflow
package can render and execute our containerised dbt project on Cloud Composer, we need to provide the following:
- Path to the manifest file: The location of the
manifest.json
file on the Google Cloud Storage (GCS) bucket, which was pushed during the CI/CD process by our GitHub Action - Docker Image details: The relevant details for the Docker image residing on Artifact Registry, enabling
dbt-airflow
to run it using theKubernetesPodOperator
Here’s the full definition of the Airflow DAG:
import functools
from datetime import datetime
from datetime import timedelta
from pathlib import Path
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import PythonOperator
from dbt_airflow.core.config import DbtAirflowConfig
from dbt_airflow.core.config import DbtProfileConfig
from dbt_airflow.core.config import DbtProjectConfig
from dbt_airflow.core.task_group import DbtTaskGroup
from dbt_airflow.operators.execution import ExecutionOperator
with DAG(
dag_id='test_dag',
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['example'],
default_args={
'owner': 'airflow',
'retries': 1,
'retry_delay': timedelta(minutes=2),
},
'on_failure_callback': functools.partial(
our_callback_function_to_send_slack_alerts
),
) as dag:
t1 = EmptyOperator(task_id='extract')
t2 = EmptyOperator(task_id='load')
tg = DbtTaskGroup(
group_id='transform',
dbt_airflow_config=DbtAirflowConfig(
create_sub_task_groups=True,
execution_operator=ExecutionOperator.KUBERNETES,
operator_kwargs={
'name': f'dbt-project-1-dev',
'namespace': 'composer-user-workloads',
'image': 'gcp-region-docker.pkg.dev/gcp-project-name/ar-repo/my-dbt-project:latest',
'kubernetes_conn_id': 'kubernetes_default',
'config_file': '/home/airflow/composer_kube_config',
'image_pull_policy': 'Always',
},
),
dbt_project_config=DbtProjectConfig(
project_path=Path('/home/my_dbt_project/'), # path within docker container
manifest_path=Path('/home/airflow/gcs/dags/dbt/my_dbt_project/target/manifest.json'), # path on Cloud Composer GCS bucket
),
dbt_profile_config=DbtProfileConfig(
profiles_path=Path('/home/my_dbt_project/profiles/'), # path within docker container
target=dbt_profile_target,
),
)
t1 >> t2 >> tg
Once the Airflow scheduler picks up the file, your dbt project will be seamlessly transformed into an Airflow DAG. This automated process converts your project into a series of tasks, visually represented in the Airflow UI.
As shown in the screenshot below, the DAG is structured and ready to be executed according to the defined schedule. This visualisation not only provides clarity on the flow and dependencies of your dbt tasks but also allows for easy monitoring and management, ensuring that your data transformations run smoothly and efficiently.

While this overview gives you a foundational understanding, covering the full capabilities and configuration options of the dbt-airflow
package is beyond the scope of this article.
For those eager to explore further and unlock the full potential of dbt-airflow
, I highly recommend reading the detailed article linked below. It covers everything you need to know to master the integration of dbt with Airflow, ensuring you can elevate your data transformation workflows.
Final Thoughts
By containerising your dbt projects, leveraging Artifact Registry, Cloud Composer, GitHub Actions, and the dbt-airflow
package, you can streamline and automate your data workflows in a scalable and effective way.
This comprehensive guide aimed to help you build deployment process, providing you with the tools and knowledge to efficiently manage and execute your dbt projects at scale. I truly hope it has been insightful and has equipped you with the confidence to implement these strategies in your own projects.