Recently, I’ve been thinking a lot about dimensional modeling, specifically how we represent different kinds of history in the warehouse / lakehouse. There are many articles that describe how to build an SCD2 table across many languages and platforms. Instead, I want to focus on something more nuanced and less commonly discussed: the semantics of SCD2 and how various design choices have meaningful consequences on use cases.
The two major topics I want to cover are:
- The dates you choose to row-version your dimensions matter quite a bit. The choice should never be arbitrary, and your most common use cases should be top-of-mind in your design.
- How you row-version records will determine the access patterns against your tables. To some extent this is strictly ergonomic, but I would argue that ergonomics are an important aspect of data quality; making it easy for users to do the right thing should be our goal as data modelers.
Choosing reference dates
The most common pattern for creating an SCD2 table is utilizing some date or timestamp in your data. Once you’ve established that a row has changed meaningfully, either via direct comparison of columns or comparison of hash values, you will have to establish dates to "retire" existing records and insert new records.
But which dates do we use? For many types of data, we’ll be able to choose from one of three options:
- Extract timestamps
- Source system timestamps
- Business (entity / event) timestamps
Let’s dive in.
Extract timestamps
This method takes the perspective of, "What the raw data looked like when we captured it." The source of truth is your warehouse and the processes that load it, as opposed to any essential attributes of the data itself.
Consider the following example:
From this record, we would generate the following SCD2 record (omitting some metadata fields like a surrogate key, hash, dimension insertion timestamp, etc. for readability):
Nothing super exciting so far; the big thing to note is that our record is valid from the time we extracted, not some other date. So, what does this look like when we get a new record? For example:
This new record will generate the following changes in our dimension table:
We’ll talk a bit more about our choice of valid_to
and valid_from
timestamps later, but for now, let’s focus on the semantics of our table. When a user queries dim_contracts
using our valid metadata dates, what exactly does that mean?
select *
from dim_contracts
where
valid_from <= '2023-01-01 01:00:00'
and (valid_to > '2023-01-01 01:00:00'
or valid_to is null)
As I mentioned before, this approach considers the landing of data in the warehouse to be our primary point of reference. Our users will need to know (and express to their consumers) that records are valid not according to business processes or even source system processes, but rather data warehouse processes. Personally, I think this approach can get in the way of user intuition, and I would rather use one of the other methods; however, some sources may not have other timestamps available, in which case, you’re stuck with extract timestamps. Just make sure you have good onboarding and documentation so that your users can interpret results correctly.
Source system timestamps
This method takes the perspective of, "What the raw data looked like when the source system created or updated it." Extending our example from above, let’s look at the table after our insert and update:
Fundamentally, we’re representing the same changes here as we did above; the key difference is that our semantics have shifted subtly. Instead of considering the record validity with respect to when we landed the data in the warehouse, now we’re concerned with when the source system updated these records. The advantage here is that users have an easier story to tell; records are valid when they were valid in the source system!
This method can be especially valuable for source system history tables that aren’t true dimensions. You’ll have a representation of the source system that can be referenced when building other dim and fact tables, a necessity if bitemporal history¹ is important to your analytics capabilities.
Business timestamps
This approach takes the perspective of, "What the business entity looked like in relation to a business date." A key advantage here is that users can join directly on the table with business dates; the effective dates map to "real-world" events, such as when an invoice was created or a contract became effective. These kinds of tables give you the full business history of an entity without the overhead of having to know how the source system generated records or how the warehouse captured them.
Let’s consider what a dimension looks like if we use business dates to bound our records:
In this case, we can actually provide a firm date boundary for the record, since the contract will no longer be effective on 2023–12–31, and thus the record shouldn’t be effective from that date forward, either. This approach can be extremely intuitive for users, as the record reflects the business process.
This simplicity does come at a cost. Unlike the other methods, you will need to give more thought to backdating and post-hoc corrections. Will you perform "Type 1" updates (i. e., overwrite the affected fields for the records active over the backdated dates), or will you maintain these "incorrect" records for audit purposes but somehow indicate they aren’t active? If the latter, how will you help users avoid these records for everyday analytics? Views? Entries in the data dictionary? Sample queries?
Choosing the format of valid_to and valid_from
In our examples, we used a popular strategy for picking our record effective dates based off of some update columns. dbt snapshots provide this functionality out of the box via their timestamp strategy.
The subtle note on usage is that when the valid_to
of the "old" record and the valid_from
of the record that’s replacing it are equal, our query patterns require a strict inequality, as seen above.
If instead you go through the additional step of offsetting these dates / timestamps, you can use slightly different and perhaps more ergonomic access patterns:
This small change allows us to query the data in two different ways:
select *
from dim_contracts
where
valid_from <= '2023-01-01 00:00:01'
and (valid_to >= '2023-01-01 00:00:01'
or valid_to is null)
Or:
select *
from dim_contracts
where
'2023-01-01 00:00:01' between
valid_from and valid_to
Our latter example, however, only works for retired records; "current" records will fail due to our NULL
value in valid_to
. One popular method for getting around this is to COALESCE()
with use some far-future date, like 2999–12–31, 9999–12–31, etc.. Whether this additional work is worth it to enable these query patterns is up to you and your users.
Bonus Round: SCD2 vs. dimensional snapshots
As you can see, SCD2 introduces a lot of complexity to your data models, and there’s an open question whether this modeling exercise is always worth it. In one of data engineering’s most seminal works², Maxime Beauchemin discusses this idea in some depth. To summarize, should we be maintaining complex logic to retire existing rows and insert new ones, or is it worth the storage overhead to simply snapshot our dimension tables on a daily basis?
This is a math problem. You can figure out how many records your dims contain, how wide the tables are, and then calculate the monthly storage costs. If those costs are acceptable, you can greatly simplify your design; just add a snapshot at the end of your dim update process, and you’re done. Users will love it, too, since they can join directly on a snapshot date rather than trying to wrestle with effective date ranges.
Wrapping Up
Dimensional modeling is a powerful tool in any data engineer’s or analytics engineer’s toolbox. Being able to track history is crucial to certain analytics use cases, and history can provide you with valuable insights into operational workflows. While there are many different ways you can approach SCD2, you need to be conscious of the decisions you make. These small changes can seem abstract and inconsequential, but in actual usage, these distinctions will become crystal clear. The first time you have to explain why a "missing" record isn’t actually missing, just not valid when a user expects it to be, you’ll know exactly how important these choices are.
¹Martin Fowler. (April 7, 2021). Bitemporal History. (https://martinfowler.com/articles/bitemporal-history.html
²Maxime Beauchemin. (January 18, 2018). Functional Data Engineering – a modern paradigm for batch data processing. https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a