For a long time, the Kimball method has been a standard for dimensional data modeling techniques. As per Kimball " The notion of time pervades every corner of the data warehouse". What does this mean in the context of data analytics? At a high level, modern analytics can be seen as the aggregation of constantly changing data with the passage of time. The problem is that the constantly changing data not only includes new additions but also changes to previous datasets.
Overall dimensional data modeling groups data into two major categories:
Facts – This data represents limitless data sets that store measurements of entities. It contains data that is essential for quantitative analysis and decision-making. Fact tables frequently have columns that join to other tables (dimensions) for reference.
Dimensions – This data represents relatively limited data sets that provide descriptive information regarding the measurements performed in the fact tables. In comparison to fact tables, dimensions evolve at a lot slower pace. This is the reason they are commonly referenced as "slowly changing dimensions".
Kimball’s approach involves the creation of a star schema based on facts and dimensions. Because of the denormalized structure, the star schema is well suited for analytics use cases…no need for complex join conditions. For this reason, for several years, the star schema has been a de facto standard for modeling in traditional data warehouses.

For many years, data handlers have faced the challenge of handling slowly changing dimensions without losing their previous history as well as preserving the relational reference to fact tables. The Kimball method proposes several methods for effectively dealing with slowly changing dimensions or SCD in short. The reality is that once a particular SCD method is chosen it is relatively easy to implement it in a data warehouse. Support for SQL and ACID transactions makes it easy to handle.
Unfortunately, implementing the same in the data lake is a different story. There are a couple of reasons for this:
- The first problem is immutability. As per best practices, data in a data lake should not be changed.
- Secondly, for several years it was not possible to perform atomic writes in a data lake. This meant you need to rewrite the entire table even if you have a minor edit.
The Delta Lake framework addresses the problems above. Support for ACID (atomicity, consistency, isolation, and durability) transactions now makes it possible to implement SCD with the same ease as a data warehouse. In this article, we will learn how to implement the most common methods for addressing slowly changing dimensions using the Delta Lake framework.
Consider an example case scenario below:
"A company wants to keep track of the customer dimension for changes happening over time. They have requested their data engineering group to suggest a few alternatives. After careful consideration, data engineering has come up with three options to manage the slowly changing dimensions SCD Type 1, SCD Type 2 and SCD Type 3."
Before we move into each option let us try to understand the data structure of the customer dimension. Throughout this article, we will be using the sample datasets below. The dataset below shows some sample customer records. For explaining the different options for dealing with slowly changing dimensions our focus will remain on the customer record highlighted using a red box (customer with name = Magee Cash).

Magee Cash has recently changed her address. Change record(s) are delivered as CDC records to the OLAP system. In the context of data engineering, the CDC process aims at capturing incremental data sets from sources and merging them in enterprise data lakes. Following is the change record for Magee Cash, notice that the address differs from the original record above.

The core capabilities of Delta Lake make it an extremely suitable platform for building modern data lakehouse architecture. In the Lakehouse architecture, Delta Lake can be used for merging change records into a common data layer (silver). Once created the silver layer acts as the foundational data layer for your analytical workloads including BI, data science, machine learning, and artificial intelligence. For this reason, the silver layer is often referred to as the "single source of truth".
Let’s get back to the core objective of this article. Now that we have a clear understanding of the datasets, we are ready to explore the first SCD method.
SCD Type 1
This type is often referred to as the "Overwrite" method. In this method, any changes to dimension data simply override the previous state of data with the same key. Although very simple to implement this method suffers from a major drawback. Due to the overwrite mechanism, you not only lose the previous history of the dimension, but also the state of the fact table that it gets attached to. The before and after image of the customer dimension using the SCD type 1 method is shown below.

Notice how the new home address is simply overwritten over the previous address, the history of the previous address is lost. The repercussions of losing history can be severe in cases where a fact table aggregation gets affected by the change in the dimension. In such a case, without the history, it becomes extremely difficult to back-track the reason why the aggregation value got affected.
We will now learn how SCD Type 1 can be implemented using Delta framework. Start by creating the silver layer customer dimension table (_customer_silver_scd1_) using the raw customer data set in the bronze layer of the Lakehouse.

Create a new dataframe using the change record for Magee Cash.

Finally, merge the change of address record into the _customer_silver_scd1_ silver layer dimension table.

After performing a query on the silver layer dimension table, you will notice the change of address has overridden its previous state. The problem is that the previous state of this record is nowhere to be seen.
Consider a scenario where Magee Cash may have placed an e-commerce order using the previous version of the address. The product has not been shipped yet, but the address has changed in the meantime. Where should the product be shipped to? The old address or the new one.

Let me introduce you to a very useful feature in the Delta Lake framewrok. Delta Lake maintains a chronological history of changes including inserts, updates, and deletes. In the example above version 0 of the table was generated when the _customer_silver_scd1_ silver layer table was created. Similarly, version 1 of the table was created when we performed the data merge for the change of address record. In addition, Delta Lake tables can easily be restored to any previous version as desired.

Due to the deficiencies cited above, SCD Type 1 is rarely used in modern data platforms. Therefore, we need a better method(s), one that lets us perform changes to dimensions while preserving the previous references for active usages. Overall, simply use SCD Type 1 if your computations do not care about the previous state of data or the repercussions that it causes down the line.
SCD Type 2
Also known as an "Add a New Record" method. In this method, the change record is added as a new record to the dimension table and marked as "Current" or "Active". In addition, the previous version of the record is marked as "Expired" or "Inactive". The various versions (current and historical) of a record are tied together using a surrogate key. At a table level, SCD Type 2 is implemented by adding a StartDate and EndDate timestamp columns for each row in the dimension table. Additionally, a Status column is added to mark if the record is current or expired status. The before and after image of the customer dimension using the SCD Type 2 method is shown below.

We will now learn how SCD Type 2 can be implemented using the delta framework. Start by creating the silver layer customer dimension table (_customer_silver_scd2_) using the raw customer dataset in the bronze layer of the Lakehouse.

Now merge the change of address record into the _customer_silverscd2 silver layer dimension table.

Notice the previous record got marked as Expired and an enddate is updated. Also, a new record with the latest address was inserted with the startdate same as the enddate of the previous record. Using this approach Magee Cash will surely be getting her e-commerce order shipped to the correct address.

Using the SCD Type 2 approach, you can chronologically track the history of changes over time and maintain the references to fact tables in a time chronological manner. I must admit that the implementation is a little tricky compared to SCD Type 1.
As a cautionary note, the application that maintains the dimension table needs to be coded in such a way that the addition of the new record with the current version and the expiry of the previous version are performed in one single transaction. Also, every query that goes against the dimension table needs to filter status=Current.
There is a simpler alternative, further we explore yet another method which in some ways is simply an extension of the SCD Type 1 method.
SCD Type 3
Also known as an "Add a New Field" method. For every change, the prior version and the current version are stored as two distinct columns within the same row of the dimension table. SCD Type 3 is relatively easier to implement as compared to SCD Type 2, the history includes only the current and previous versions.

We will now learn how SCD Type 3 can be implemented using the delta framework. Start by creating the silver layer customer dimension table (_customer_silver_scd3_) using the raw customer dataset in the bronze layer of the lakehouse.

Notice that every column in the dimension table maintains a current and previous state. At the creation time of the dimension table, the current state of the column is populated with the latest data, yet the previous state of the column is left blank.
Now merge the change of address record into the _customer_silverscd3 silver layer dimension table.

Moving ahead to check the state of the record after the delta lake merge.

Notice that the address field is now populated with the changed record and the previous version of the address has been moved to the _previous_address_ field. Similarly, the modifieddate field has been updated to maintain a change chronology.
The fact that only a limited amount of history is available makes the use case of SCD Type 3 a little on the limited side. But the ease of implementation makes it somewhat desirable. It’s a good trade-off if you hate the limitations of SCD Type 1 and find SCD Type 2 hard to implement and manage.
All the code used in this article can be found on the link below:
In many respects, SCD Type 2 is frequently considered the primary technique for implementing slowly changing dimensions. It should be clearly understood that the primary objective of SCD is not to store the history of records over time, instead, it is to maintain an accurate association with the fact tables. Also, in many respects, the slowly changing dimensions require you to update records which in general terms goes against the principles of the immutable nature of the data lake/warehouse. However new advances made by frameworks like Delta Lake have made it possible to implement SCD scenarios with ease and simplicity.
I hope this article was helpful. SCD is covered as part of the AWS Big Data Analytics course offered by Datafence Cloud Academy. The course is taught online by myself on weekends.