Setting up a CI using dbt Cloud and Snowflake

Merging to master and sleeping at night

Massimo Belloni
Towards Data Science

--

At HousingAnywhere, we are using dbt and Snowflake to model our data in order to offer meaningful views and materializations to our business analysts and data scientists. Our data warehouse is populated with raw structured data coming from a wide variety of different sources, the majority of which are replicated via Stitch; an easy and scalable plug-and-play managed ETL tool. Others required a more personal touch and are integrated using Airflow with ad-hoc periodical tasks. In both cases, it can be frustrating to extract meaningful insights from these sources, considering the prevalence of data that is dirty, incoherent and/or stored in over-engineered structures that were not designed with ease of access and information extraction in mind.

A simplified representation of the ETL architecture at HousingAnywhere

dbt itself is a command line tool that allows data engineers to efficiently transform data in their data warehouse. It uses SQL enriched with Jinja templates when more complex logic is needed or to access information about a job’s execution or the targeted schema/table. The data warehouse is used as a source from which data is transformed and pushed back into the same warehouse, though in a different location. The results of dbt are usually easier to understand and maintain, often consisting of multiple joins from tables possibly originating from different data sources. It isn’t uncommon for BI tools (eg. Mode, Looker, etc) to only have visibility of these resulting tables, as opposed to the original raw sources.

The results of dbt are usually easier to understand and maintain, often consisting of multiple joins from tables possibly originating from different data sources.

To give an example: imagine a table containing various information concerning a property, with the latitude and longitude to describe its location. It isn’t easy to query this kind of data. We created a model using dbt that adds a ‘city’ column, retrieved by computing the haversine distance from the property location and cities of interest for our business. These distances are calculated and the closest city is assigned to the property for reporting purposes. Business Analysts can simply query the second table without having to worry about the complex logic behind the scene.

dbt, being a Python module distributed on pypi, can be deployed on any machine running Python. That said, it is definitely handier to use dbt Cloud, which allows you to have an efficient deployment of dbt for free (initially, of course, with a pay-as-you-grow model coming into play later). dbt Cloud also adds a lot of interesting features to the core product, and the user-friendly web interface makes it a real no-brainer in terms of the best way to implement dbt in a production setting. Thanks to the seamless integration with Github, all the models and configurations can be code versioned using git, and the development process can be carried out as in any other classical software engineering project, with multiple branches opened in parallel and pull requests (PR) to merge changes on master. At scheduled times, dbt Cloud picks up the master branch and runs the selected models against production data, transforming them as defined in the code and delivering the changes.

Thanks to the seamless integration with Github, all the models and configurations can be code versioned using git, and the development process can be carried out as in any other classical software engineering project.

This allows for an efficient, observable and easy to maintain environment, though having master directly connected to the production warehouse isn’t the most reliable thing to do. The dbt syntax is quite error prone and even if it ideally can be compiled locally (Jinja templates have to be translated to SQL), it isn’t easy to assess whether the changes are breaking some semantic constraint in the data. In a naive implementation, nothing prevents you from merging to master with a commit that breaks something, only for you to discover the error in your ways when the scheduled task is run sometime during the night. Not ideal.

“Run on Pull Requests” feature on dbt Cloud

Luckily, dbt Cloud comes to help, allowing for jobs to be run against custom branches when triggered from an opened PR on a connected repository. In other words: the exact copy of the production jobs can be run using the models found on the branch every time a PR is opened, materialised against a temporary schema on this same production environment. The temporary schema is deleted when the PR is closed or merged, but while the PR is open, it can be accessed as any other schema on the database, to run tests or to manually check things — if needed. This feature is particularly useful to check the syntax and high level consistency of the branch’s content, preventing contributors from merging breaking changes. If the dbt pipeline in production already contained tests, these will also be run against the temporary schema in order to check its content on a semantic level.

A failed CI check, preventing developers to merge

However, while having the whole production environment replicated on the temporary schema, is appealing in theory, it is not feasible to materialise tables with hundreds of millions of rows on every commit. The main goal of having a CI is to provide the developers with quick checks on their work, while a complete re-population of some tables can require a few hours in the best case. While executing dbt jobs, it is quite easy to distinguish production runs from CI ones, considering the strict naming pattern followed by temporary schemas used for the purpose (dbt_cloud_pr_xx at the time of this post). Overcoming references to raw tables in dbt models and reducing the size of the referenced ephemeral instances while executing CI jobs will do the trick, allowing the CI runs to use a small subset of the production data, drastically reducing the time needed to have reliable checks on each PR.

First step: remove hardcoded references to raw tables in dbt models

Reducing the scope of the tables in order to avoid tampering with the external references of the data structures is an easy job syntactically and there isn’t any rule of thumb to do it properly. What we decided to do is to restrict the time horizon on one master table and to populate the others in order to preserve the joins and references.

Second step: reduce the scope of the raw tables when run against CI schemas

Here’s an example not very far from reality: imagine a database schema where there are conversations between tenants and advertisers related to some housing properties (listings). The most logical decision is to just select the conversations created in a specific time interval (eg. January 2019, some tens of thousands of samples), to enforce the existence of the listings to which those conversations were related and — of course — of the users that were involved on both sides. The resulting schema is reduced in size but still consistent. Reasoning in this direction we can close all the transitive dependencies related to this part of the schema, possibly time constraining on the same period other tables that don’t have direct dependencies to these.

Thanks to Julian Smidek and Stephen O’ Farrell for their contribution to this project and this post.

--

--