The world’s leading publication for data science, AI, and ML professionals.

How to Make Your Data Models Modular

Avoid highly coupled systems and unexpected production bugs with these steps

Photo by T.J. Breshears on Unsplash
Photo by T.J. Breshears on Unsplash

It’s every engineer’s worst nightmare to discover that something in production is broken. Even worse is when it’s a simple change that broke everything. When this happens, chances are your system is highly intertwined and one tweak here can cause a domino effect way over there.

No change is safe! You never know what it could end up affecting.

This is exactly what we want to avoid when we build data models. You don’t want them to be tightly coupled so you can easily debug and make changes when necessary without breaking the entire pipeline.

Building your data models with modularity in mind is the solution to this. Modular data models exist independently from one another. They are like pieces in a bigger puzzle. Together, the pieces create something beautiful and show you the full picture. However, each puzzle piece can still be pulled out and exist on its own without breaking the entire image.

We want our data models to be puzzle pieces that can be easily removed, changed, and added back without any performance impacts to the data pipeline.

How do we do this? It starts with reassessing your current data models and how you can rewrite them to be more modular.

Create base models for each of your raw data sources.

Your raw data should never be directly referenced by any of your data models. You always want to keep a copy of your raw data, untouched in your data warehouse. This way, in case your data is ever compromised, you can always revert back to the raw copy.

Base models exist to protect your raw data from being transformed in any way. They typically are views in your data warehouse that sit on top of the raw data source, so they don’t take up any storage space. They reference the raw data but also change its basic features to make it cleaner for the analytics engineer to use.

Base models include:

  • data type casting
  • column name changes
  • timezone conversions
  • simple case when statements

Base models shouldn’t include any fancy calculations, joins, or aggregations. They are just a standardized version of your raw data that makes it easier for you to reference in your data models downstream.

For example, let’s say you have a raw data source that looks like this:

select 
  id,
  user_id, 
  created_date,
  source
from web_traffic.first_visits

In order to make this easier to understand and reference in my downstream models I would write a my base model for this using dbt like so:

select 
  id AS first_visit_id,
  user_id, 
  CAST(created_date AS date) AS created_date,
  created_date::timestamp_ntz AS created_at,
  source AS referrer_source
from {{ source('web_traffic', 'first_visits') }}

I am specifying id to also reference the table name in order to make future joins easier to read.

I casted created_date to be an actual date and also used that column to cast it to the correct timestamp type and rename it.

Lastly, I changed the source column name to be more descriptive.

Identify common code between your data models.

Now that you’ve created a base model for each of your raw data sources, you want to dig deeper into the more complex data models that already exist. If you aren’t using a transformation tool like dbt, chances are you have long files of SQL code written for each model. Read through each of these carefully and see if you can find code that is repeated in multiple different files.

This repeated code could be a calculation for currency conversion that is used multiple times, mapping code, or tables that are joined together frequently.

Let’s look at two separate models.

The first is looking for all promo orders to the state of New York in the month of October.

# model looking for all promo orders to the state of New York in the month of October 
with order_information_joined AS (
  select
    orders.order_id,
    orders.ordered_at, 
    order_types.order_type_name, 
    order_addresses.city, 
    states.state_name
  from orders 
  left join order_types 
    on orders.order_type_id = order_types.order_type_id 
  left join order_addresses 
    on orders.order_address_id = order_addresses on order_address_id 
  left join states 
    on order_addresses.state_id = states.state_id 
) 

select 
  order_id 
from order_information_joined 
where '10-01-2022' <= CAST(ordered_at AS date) <= '10-31-2022' 
  and order_type_name = 'promo' 
  and state_name = 'New York'

The second is finding the number of subscription orders placed in 2022.

# model finding the number of subscription orders placed in 2022
with subscription_orders AS (
  select * 
  from orders 
  where order_type_id=1
),

order_information_joined AS (
  select
    orders.order_id,
    orders.ordered_at, 
    order_types.order_type_name, 
    order_addresses.city, 
    states.state_name
  from subscription_orders 
  left join order_types 
    on orders.order_type_id = order_types.order_type_id 
  left join order_addresses 
    on orders.order_address_id = order_addresses on order_address_id 
  left join states 
    on order_addresses.state_id = states.state_id 
)

select count(*) from order_information_joined where YEAR(ordered_at)=2022

What do these models have in common? Both join the orders, order_types, order_addresses, and states tables together. Because this code is used at least twice, chances are it’s a helpful query to write as its own model. This way, whenever it is needed, it can simply be referenced in another model someone is writing.

Also, notice how the second model filters the orders table before joining it to the other tables. In our new model, we wouldn’t want to do this because then it limits where we can use the model. Instead, you will be able to filter for order_type_id when referencing the new model.

Write this code as its own data model.

Now that you’ve identified code that is repeated in multiple data models, you want to turn it into its own data model. This is what makes your models modular!

By turning repeated code into its own model, you are saving on compute costs and run time. Your models will run a lot faster when they’re written this way! You are no longer running the same code multiple times, but rather just once and then referencing its output in different models.

Be sure to take out any pieces of code that are specific to a certain data model. For example, you may be filtering one model for new users but all of the other models look at all users. Instead of including the filter in this modular data model, add it to the code of the one specific model where you reference the modular model.

The goal is for these pieces to be un-specific and reusable. There’s always room to add the specifics back when you reference it again.

Another thing to note- make sure you give the model a descriptive name. You want others to be able to read the name of the model and know what it does. This way, they can also reference it in their code rather than rewriting the code all over again. One of the big benefits of writing modular data models is the reduction of redundancy between the data team’s work. Modular models make it easy to use the code your teammates’ have already written, saving you time and effort!

If we look at the example above, I would call the data model order_details_joined_location. This signifies that I am joining different order-related tables to location-related tables. The model would look like this:

select
    orders.order_id,
    orders.user_id, 
    orders.product_quantity,
    orders.ordered_at, 
    order_types.order_type_id, 
    order_types.order_type_name, 
    order_address.street_line_1, 
    order_address.street_line_2, 
    order_addresses.city,
    order_address.zipcode,
    order_address.country,
    states.state_id,
    states.state_name
  from orders 
  left join order_types 
    on orders.order_type_id = order_types.order_type_id 
  left join order_addresses 
    on orders.order_address_id = order_addresses on order_address_id 
  left join states 
    on order_addresses.state_id = states.state_id

Notice how I included all of the columns from each table rather than just the ones the specific models from earlier used. This allows for more flexibility in where the model can be used.

Reference this model in the other models.

Now, after you pull out the shared code and make it its own data model you can reference this new model in your original two models. dbt makes this easy by allowing you to use a simple {{ ref() }} function to call other data models.

Rewriting the first model to reference our new order_details_joined_location model would look like this:

select 
  order_id 
from {{ ref('order_details_joined_location') }}
where '10-01-2022' <= CAST(ordered_at AS date) <= '10-31-2022' 
  and order_type_name = 'promo' 
  and state_name = 'New York'

The second would look like this:

select 
  count(*)
from {{ ref('order_details_joined_location') }}
  where order_type_id=1
  and YEAR(ordered_at)=2022

Now we have two simple models that produce the same output as before but reference another model that can now be used in many other models moving forward.

Conclusion

Modularity is a key characteristic of a set of data models that will last the test of time. When data models are modular, you reduce compute costs, decrease the time it takes to run your entire pipeline, and make your team’s life easier by creating a collaborative coding process. The future of Data Modeling is modularity. There’s a reason that dbt has become such a popular data transformation tool! They understand this need and they solve the problem well.

For more on analytics engineering, subscribe to my free weekly newsletter where I share learning resources, tutorials, best practices, and more.

Check out my first ebook, The ABCs of Analytics Engineering, an all-in-one guide to getting started in the role of analytics engineering.


Related Articles