The world’s leading publication for data science, AI, and ML professionals.

5 dbt Modelling Tricks To Learn

Getting the most out of dbt (and a quick intro)

Photo by Kevin Ku on Unsplash
Photo by Kevin Ku on Unsplash

dbt is a tool which has gained a lot of popularity recently as part of the ‘Modern Data Stack’. It’s a fantastic tool which enables data transformations to be performed directly in SQL, enabling analysts and analytics engineers to focus on transforming business logic into models, and less on the pipelines, orchestration, and read/write nuances that data engineers tend to focus on.

It also supports the jinja templating language which as we’ll see below yields a whole host of additional options when creating our data models. In this article we’ll focus on Dbt features which can be used to streamline and enhance our data modelling project, creating useful and repeatable logic which can be used across a project.


New to dbt?

If you’re reading this article you’ve probably used or at least heard of dbt in some form before, but if you’re new to using dbt I’d recommend first reviewing and understanding their common best practices. These aren’t the be-all and end-all of dbt modelling, but following these will help the structure of your project and reduce the need for refactoring in the future.

I’d pay particular consideration in regards to:

  1. Folder management – this defines the end structure of your data models both in the file tree and in the database. Do you want ‘dim’ and ‘fact’ file prefixes? Folders based on schemas or product areas?
  2. Sql and style guide -create a convention for the SQL which makes up your models (casing, tabbing, joins, etc.), as well as for the dbt models (how CTEs and subqueries are used). If interested in automating these checks you can take a look at using something like sqlfluff.
  3. Model referencing – to enable dbt to do its work, you should reference models with the {{ ref() }} jinja so that dbt can infer model dependencies and build correctly from upstream tables. The {{ source() }} jinja should only be used once to select from the raw database, and direct model references should never be used.
  4. Reusing model logic – logic should be moved as far upstream as possible so as not to repeat it (e.g. column renaming). If you often reuse the same logic in multiple data models, maybe you can turn that logic itself into a single data model and move it upstream so it can simply be referenced with {{ ref() }} .
  5. Analytical layers – once a dbt project has developed you may run into errors due to ‘cyclical dependencies’, where two models depend on each other. The best way I’ve found to solve this is to have multiple model layers, where each layer only references models in the layer below it. There are some docs on how you might approach this.

Now onto the main dbt tricks you may not have yet encountered…

Photo by Samuel Regan-Asante on Unsplash
Photo by Samuel Regan-Asante on Unsplash

1. Snapshots

dbt snapshots are an easy way to implement type-2 slowly changing dimensions, enabling analysts to more easily perform data checks such as when and why a data value or analytical metric may have changed.

These changes are implemented via row snapshots with dbt_valid_from and dbt_valid_to columns added, and the currently valid row has dbt_valid_to=null . However, these columns are only updated each time the snapshot is ran, so if the snapshot is run every hour and the status changes multiple times in that period, only the latest value will be captured as the change. Therefore this doesn’t replace actual data monitoring solutions such as an audit database or writing product columns such as status_value_at to the database.

Here a snapshot is created from the product_sales table. Rather than the commonly used updated_at strategy, the check strategy (with all columns) enables the snapshot to include new table columns automatically. Conversely, if columns are removed from the source table, the snapshot will retain those columns.

2. Surrogate Key

The dbt-utils package has a useful surrogate key macro which implements the generation of a primary key. This is useful for both creating dbt tests for unique primary keys, and also when using this model in a BI tool such as Looker, which requires the primary key to be specifically defined in order to perform aggregations correctly.

e.g. If our product_sales table for which we created a snapshot for above had no id column, we could can easily create this primary key from the two grouping columns.

select
{{ dbt_utils.surrogate_key(['product_id', 'month']) }} as product_sale_id

3. Pivot

Again back to the dbt-utils package, this time for the pivot macro. In SQL we often use a CASE statement to segment totals for further analysis, such as when creating cohort views of product sales to looks at month-to-month performance.

select
  product_id,
  sum(case when month = 'jan' then amount else 0 end) as amount_jan,
  sum(case when month = 'feb' then amount else 0 end) as amount_feb,
  sum(case when month = 'mar' then amount else 0 end) as amount_mar
from product_sales
group by 1
order by product_id

but this syntax can quickly become cumbersome as the number of columns we want to pivot increases. If we have 20 months I don’t really want to copy and paste the same SQL logic for every column, and if the logic changes then that’s 20 lines that I need to update.

We can instead use the dbt pivot util to do this in just a few lines. The downside is that this doesn’t work on CTEs, only on materialized data. Therefore if the model we want to reference doesn’t already exist we need to either create it or use another solution (like Snowflake’s PIVOT function).

Here the rows are selected (product_id), then the dbt_utils.pivot function takes the arguments for the column name to pivot and the row values to turn into columns (month). This macro is implemented via a CASE WHEN statement so which defaults to then 1 else 0 end, but here we set the then_value so that the result is then amount else 0 end .

4. Tests

dbt comes with four generic tests which help to cover the general data integrity tests your might perform. These are unique, not_null, accepted_values and relationship tests. Both generic tests and documentation are easily implemented via the project yaml files, and the tests will be checked alongside the data model each run.

Here’s a generic primary key check which should generally be applied to every data model.

version: 2
models:
  - name: product_sales
    description: |
      Area: Product
      Table of product sales each month.
    columns:
      - name: product_sale_id
        description: "Primary key"
        tests:
          - unique
          - not_null

What can cause issues is that dbt models won’t be built if a dbt test fails. This could mean that we wouldn’t be able to add sanity tests such as accepted_values or not_null on non-primary key columns if not for being able to also set the test severity.

columns:
  - name: month
    description: "Month of sale."
    tests:
      - not_null
      - accepted_values:
          values: ['jan', 'feb', 'mar]
          config:
            severity: warn

5. Macros

Aside from the super useful dbt-utils package we can also create our own macros. To use them make sure to add them to the dbt project path. This is useful for creating pieces of code which can be reused multiple times, as we can see for the implementation of the dbt-utils pivot function.

An example for such use cases is to get all non-deleted rows for a table. Maybe it’s a snapshot, with the raw data synced by Fivetran and has a deleted_at column. It would be better to implement this with {{ active_table(table) }} rather than having to add 3 different WHERE clauses in multiple places in the code.

Another example is that although min/max returns the min/max ignoring nulls in Snowflake, for dates greatest/least both return null if any of the arguments are null. This behaviour can be annoying if we want to e.g. get the first of many dates to get a customer’s first activity date. To alleviate this we’d have to coalesce every argument with a dummy value…or we can create a macro to do this for us.

The dummy value for a similar dbt_least value would be something like '12/31/9999', so it would work for at least another 7977 years at the time of writing this!


Final Thoughts

dbt is a powerful analytics engineering solution for building data models, enabling the user to focus on creating robust business logic instead of worrying about the specifics of loading and writing data. We went through a quick introduction to dbt and some important initial project considerations, the moved onto some helpful features and macros in the dbt-utils package. This helps to improve general syntax, avoiding repeating blocks of code and reducing possible errors in our final data models.


If you enjoyed this article you can find more articles and drop me a follow on my profile!


Related Articles