The world’s leading publication for data science, AI, and ML professionals.

Running dbt using Gitlab CI/CD

The easiest way to deploy, run and schedule dbt for free

Photo by Samuel Sianipar on Unsplash
Photo by Samuel Sianipar on Unsplash

My first implementation of dbt was while running the Data Team at Sirena and I immediately became a huge fan of it. We used it for running all of our transformations and creating our models in our data warehouse (Snowflake). It was literally a game-changer to be able to apply dimensional modelling to our data, especially having lots of sources with unstructured data. What’s better is that all of our models and business logic were version controlled, changes were fully visible and collaboration was extremely easy. Probably my favourite part though is the abstraction layer it creates between raw data and business entities. For example, we can create a dim_customers model and make that the Single Source of Truth for our company’s customers completely abstracted from the raw data that is generating the model. To our stakeholders, it doesn’t matter if that table they use in our BI tool is coming from our CRM, product DBs, our ERP, or our customer success software. All it matters is that in this table they can find all the information about our customers, it’s completely transparent to them. If things change upstream in our data sources, we as a data team can fix those through this abstraction layer. Even more surprising, and hopefully no one has to deal with this, it allows us to have dramatic changes in our data sources (such as a CRM migration or having multiple CRMs running at the same time) while minimising the impact on our BI layer (yes, I have a few battle scars). Oh, and also, all of this is powered by the simplicity of SQL. Just amazing!

As I mentioned, we version-controlled our Dbt project, and for that we used Gitlab. Which made it super interesting when I came across this video from the Gitlab data team explaining how they run dbt using Gitlab CI. At the same time, Gitlab being an open-source company makes all the code for their dbt project fully available. I suggest you spend a bit of time over there, lots of useful things.

In this post, I will go into the details about how to "deploy" your dbt project using Gitlab’s CI/CD. Note that a similar workflow is possible using Github Actions.

Getting Started

The CI/CD tool from Gitlab is extremely easy to use, all you need to do is create a file in the root location of your repository called .Gitlab-ci.yml. This file is basically a recipe for how Gitlab should execute pipelines. In this post we’ll go over the simplest workflow we can implement, with a focus on running the dbt models in production. I’ll leave it up to later posts to discuss how to do actual CI/CD (including testing), generate docs, and store metadata. Right now, our goal is to be able to run dbt build every day so all of our models are up-to-date.

This is an example of a .gitlab-ci.yml file for one of the easiest setups to run dbt using Gitlab’s CI/CD:

We start by defining the stages that we want to run in our pipeline. In this case, we will only have one stage called deploy-production. If we ignore the middle part of the .gitlab-ci.yml file for now and jump straight to the bottom, we can see that we define what this stage actually does:

Because we define the when parameter (docs) to be only manual, this stage will only run after it’s triggered through Gitlab’s UI. Besides that, the stage defines a variable called TARGET_NAME whose value is postgres (more about this later). When it comes to what this stage actually does, it simply extends the definition in .deploy. At the same time, .deploy extends from the definition of dbt_jobs which is .dbt_run. This is specified by this line: <<: *dbt_jobs which essentially inserts the dbt_jobs definition.

Looking at .dbt_run we see that we define the image in which we will run everything which in this case is the python:3.8.1-slim-buster Docker image (docs). After that, we run a few scripts to install dependencies and define a variable called CI_PROFILE_TARGET. For this, we use the TARGET_NAME variable we defined before. In this case, the CI_PROFILE_TARGET variable will look something like --profiles-dir . --target postgres. You can probably see where this is going. We can use this variable when running dbt commands to specify both the path to our profiles.yml and which target we want to use.

In terms of the profiles.yml, we can use a similar setup to the one used for running dbt using docker, getting our data warehouse credentials from env variables.

Setting up these variables is extremely easy using Gitlab’s CI/CD variables:

Adding CI/CD variables in Gitlab (image by author)
Adding CI/CD variables in Gitlab (image by author)

After that, we can finally take a look at .deploy. Thanks to the addition of the dbt build command introduced in 0.21.0, we can simply install dependencies and then run the dbt build command which will take care of running models, tests, snapshots and seeds in the order they should be run according to our dbt DAG.

That’s it, if we push our code to Gitlab, we should see a new pipeline being created that we can manually trigger (remember, because of the when: manual config). Once we do, our dbt project will run.

Checking CI/CD pipelines in Gitlab (image by author)
Checking CI/CD pipelines in Gitlab (image by author)

Scheduling dbt to run every day

Using this setup, there are at least two ways that I came across to schedule dbt to run every day. The first method is using Gitlab’s built-in pipeline schedules and the second method is using Gitlab’s API to trigger the pipeline.

Pipeline Scheduling

This is probably the easiest and most straightforward way to do it. Under the CI/CD section in Gitlab, there is a section for creating pipeline schedules.

You can follow the instructions to run the pipeline every day or however frequently your dbt models should be run.

Trigger Pipeline via API

Another way to schedule the pipeline is to use whichever orchestration tool you are currently using such as Airflow or Dagster. This has the added advantage of being integrated with the rest of your data pipeline. Using this you could trigger your dbt build whenever your data has finished loading into your warehouse.

You can find the API documentation for triggering pipelines here, but essentially all that is required is for you to create a trigger token and use that in a POST request to the URL: https://gitlab.example.com/api/v4/projects/<project_id>/trigger/pipeline. An amazing feature of this approach is that you can pass variables to pipelines. This means that you could trigger a build of dbt when different parts of your data pipeline have loaded.

Here’s an example of how to trigger the pipeline using Python:

Bear in mind that for this to work, you will need to remove the when: manual condition.

Further improvements

Hopefully, this post has shown you how Gitlab CI/CD (or Github Actions) can be used in combination with dbt. However, I hope that this serves as inspiration, as there are a great number of ways that this can be used. For example, at Sirena and Zenvia, we used this workflow to do actual CI/CD taking advantage of Snowflake zero-copy clones. Whenever a merge request (or pull request) is created, it would trigger a pipeline that would clone our Snowflake databases and run the code changes and tests on it. This gave us a lot of confidence that our changes were not introducing any issues. Another thing we implemented using this workflow was storing metadata and generating dbt docs to document our data warehouse. If this all sounds interesting consider following me on Medium as I will be creating more content about this in the future.


Related Articles