The world’s leading publication for data science, AI, and ML professionals.

What is dbt?

Your guide to analytics engineering and the tool that created it

Photo by Randy Fath on Unsplash
Photo by Randy Fath on Unsplash

A little over a year ago I transferred into a new data engineering role in a different part of the company I was at. I knew going into it that I would be working with a lot more Python and SQL than I did in my previous role. Being a lover of data and Analytics, this was exciting. I had never actually used SQL in my engineering roles at that company and now was the opportunity to improve my skills.

A few weeks in I was introduced to a tool my manager had just began implementing called dbt (which stands for data build tool, hence the construction cover image). At first I didn’t really understand its purpose and why we were using it. But as I began exploring its features, I became obsessed, to put it lightly. Now, here I am at a different company as an analytics engineer, helping to build out an entire data stack centered around using dbt.

What it is

Dbt is a data modeling tool that makes the lives of analysts and engineers a whole lot easier. It helps you write SQL queries in a way where you don’t have to worry about dependencies. I like to think of it as splitting queries into containerized blocks of code. It allows you to write SQL code in a way where you don’t have to keep repeating parts of queries that you’ve already written.

Photo from dbt
Photo from dbt

For example, looking at this DAG you can see that the orders model depends on two models, base_orders and base_payments. If you were to run the model orders, it runs the two dependency base tables as well. However, these base models are completely independent of orders and can be used by other models as well. This keeps the SQL code modular and reusable across models.

How it’s organized

While there’s a few different ways you can organize your folders within a dbt project, the company that created the tool suggests doing it this way. Within the "models" directory within the project you can create three directories- one for base models, another for intermediate models, and one for your mart.

The base directory contains all of the models that read directly from a source, which is typically a schema in your data warehouse. In dbt a source is denoted using this notation:

{{ source('campaigns', 'channel')}}

This source refers to the name and table defined in your .yml file. These source models are similar to the raw data files that they are reading from. However, best practices here entail cleaning up the names of the columns and casting to different data types if needed.

Next is the intermediate directory. These models always reference the base models that we just talked about, rather than from a source. You can think of them as the middle man between the base models and the final models that your analysts will be querying. The references here will always use this notation:

{{ ref('base_campaign_types')}}

These intermediate models are typically used for joins between multiple base models.

Lastly, the mart directory is where the "final" models that are used by the analytics team are configured. They use the same reference notation to refer to an intermediate model rather than a base model. This is where any calculations may happen as well as any other fine tuning to prepare the model to be queried.

Other features

Dbt also includes some other cool features like seeds, tests, and tags. Seeds are csv files that you add to your dbt project to be uploaded to your data warehouse. This is helpful if you have a list of unique codes or employee ids that you may need in your analysis but is not present in your current data. In my previous role, we used to use these to help test our code.

Tests allow you to check certain features of your data sources to ensure your data meets your standards. The two most common tests are checking if the primary key of a table is unique and ensuring it is not null. You can also test the primary and foreign keys between tables to make sure a value isn’t in one table and not the other. This is really helpful for teams looking to follow strong data governance practices.

Another feature I like to utilize that is unique to dbt is tags. This is exactly what it sounds like- you can tag your models with certain keywords. The best part about this is that you can then run a group of models using their tags. For example, let’s say I have a few key models that I want to run daily. I can schedule the following command to run every morning at 6am and it will execute all of the models with the tag daily.

dbt run --models tag:daily

This is also helpful if you’re looking to run models in a certain line of business. You can tag them with marketing, finance, sales, etc.

As I dive deeper into dbt, I will be sharing my findings here. Be sure to follow for articles on naming your dbt models, using dbt with Snowflake, and best practices for documentation.

For more on Analytics Engineering, the modern data stack, and dbt, subscribe to my free weekly newsletter.


Related Articles