The world’s leading publication for data science, AI, and ML professionals.

It’s Time to Change Dimensions

To everything there a season, and a time to every purpose… Let's keep track of those changes with a slowly changing dimension (SCD).

Photo by Anthony Garand on Unsplash
Photo by Anthony Garand on Unsplash

To everything there a season, and a time to every purpose… Let’s keep track of those changes with a slowly changing dimension (SCD).

What is the slowly changing dimension? It is a table that changes gradually over time, like a group of elected officials. Suppose we have the following set of US Senators loaded into our database.

https://theunitedstates.io/congress-legislators/legislators-current.csv

Even though many have been in office for a long time, some are about to change. According to our constitution Article I, Section 3, clause 2, one-third of these senators will be subject to change.

Immediately after [the Senate of the United States] shall be assembled in Consequence of the first Election, they shall be divided as equally as may be into three classes. The Seats of the Senators of the first Class shall be vacated at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year

There are many ways to implement a slowly changing dimension. Keeping a simple pattern can simplify the process of deciding how you will change.

Determine the columns that identify a unique row and the columns you want to track.

For example, I want to have a unique senator for each state and Class. State senators do not overlap by Class so that only one can be up for election in an election cycle for a given State. Right now, Georgia is an exception, but we won’t go into that.

Say we have staged the raw data for the above CSV file into a table. To process the raw data into a table for the current senate, we insert rows into a dimension table.

Note that I’ve identified the unique rows as my MATCH_GUID and the CHANGE_HASH as has values. I use the hash to minimize CPU processing time on SQL.

The MATCH_GUID encodes any overall changes in the row and will update the dimension with a new record should anything change.

insert into OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT(CONGRESS_SEAT_GUID,
                                               LAST_NAME,
                                               FIRST_NAME,
                                               FULL_NAME,
                                               BIRTHDAY,
                                               GENDER,
                                               TYPE,
                                               STATE,
                                               DISTRICT,
                                               SENATE_CLASS,
                                               PARTY,
                                               ADDRESS,
                                               PHONE,
                                               CONTACT_FORM,
                                               DW_MATCH_HASH)
with source as (
    SELECT hash(STATE,
                SENATE_CLASS)    MATCH_GUID,
           LAST_NAME,
           FIRST_NAME,
           FULL_NAME,
           BIRTHDAY,
           GENDER,
           TYPE,
           STATE,
           DISTRICT,
           SENATE_CLASS,
           PARTY,
           ADDRESS,
           PHONE,
           CONTACT_FORM,

           hash(
                   LAST_NAME,
                   FIRST_NAME,
                   FULL_NAME,
                   BIRTHDAY,
                   GENDER,
                   TYPE,
                   DISTRICT,
                   PARTY,
                   ADDRESS,
                   PHONE,
                   CONTACT_FORM) CHANGE_HASH
    from OWLMTN.STAGE.CONGRESS_MEMBERS sor
    where type = 'sen'
),
     target as (
         select CONGRESS_SEAT_GUID,
                DW_MATCH_HASH match_hash
         from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT tar
         where DW_ACTIVE = 'T'
     )
select source.MATCH_GUID,
       source.LAST_NAME,
       source.FIRST_NAME,
       source.FULL_NAME,
       source.BIRTHDAY,
       source.GENDER,
       source.TYPE,
       source.STATE,
       source.DISTRICT,
       source.SENATE_CLASS,
       source.PARTY,
       source.ADDRESS,
       source.PHONE,
       source.CONTACT_FORM,
       source.CHANGE_HASH
from source
         left outer join target
                         on target.CONGRESS_SEAT_GUID = source.MATCH_GUID
where target.match_hash is null
   or target.match_hash <> source.CHANGE_HASH;

The update query will find any new rows and mark them as active.

-- Arrange the dimensions on the SCD.
update OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT new_data
    set new_data.DW_ACTIVE = update_logic.DW_ACTIVE,
        new_data.DW_FROM_DATE = update_logic.new_from_dw_date,
        new_data.DW_TO_DATE = update_logic.new_to_dw_date,
        DW_UPDATE_DATE = current_timestamp()
from (

    with updated_gui as (
        select DIM_CONGRESS_SEAT_KEY, CONGRESS_SEAT_GUID
        from OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT
        where dw_from_date is null
        )
    select current_row.DIM_CONGRESS_SEAT_KEY,
           current_row.dw_active as old_active,
           case when current_row.DW_FROM_DATE is null
               then TRUE
               else FALSE end as DW_ACTIVE,

           current_row.DW_FROM_DATE,
            case when current_row.DW_FROM_DATE is null
               then CURRENT_TIMESTAMP()
               else current_row.DW_FROM_DATE end as new_from_dw_date,

           current_row.DW_TO_DATE,

           case when current_row.DW_FROM_DATE is null
                 then to_timestamp_ltz('2099-12-31 00:00:00')
               else CURRENT_TIMESTAMP() end as new_to_dw_date,
           current_row.DW_MATCH_HASH
        from updated_gui
        inner join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT current_row
            on updated_gui.CONGRESS_SEAT_GUID = current_row.CONGRESS_SEAT_GUID
                   and (DW_FROM_DATE is NULL or current_row.DW_ACTIVE=TRUE)
        left outer join OWLMTN.WAREHOUSE.DIM_CONGRESS_SEAT old
            on current_row.CONGRESS_SEAT_GUID = old.CONGRESS_SEAT_GUID
                   and old.dw_ACTIVE

      ) update_logic
where new_data.DIM_CONGRESS_SEAT_KEY = update_logic.DIM_CONGRESS_SEAT_KEY;

So what happens after November 3, and there is a change in the seat status? Let’s track the election from my favorite state, Colorado, to project a seat change hypothetically.

According to the data projection on Azri.us, Hickenlooper has a 10 point advantage over Gardner, who currently holds the seat.

Image by Google GeoChart based on 538 Raw Data
Image by Google GeoChart based on 538 Raw Data

Viewing the hash comparison in the previous insert query, we see that the MATCH_GUID stays the same (same seat), but the CHANGE_HASH changes (new senator).

SCD Hash Match Strategy
SCD Hash Match Strategy

After the insert detects a changed record, it is not yet active, and we need to activate that record in a transactional manner that sets the old record inactive when the new row replaces it.

Running the query that shifts the date on the currently active row will move the inactive to active.

Items declared unique will always return one row for any query given with a point in time.

If I want to know who was the senator of class 2 for Colorado at a specific time, I can specify the time, and it will never change.

Note that the DW_FROM_DATE is less than the date, and the DW_TO_DATE is greater than or equal to the date.

select CONGRESS_SEAT_GUID, FULL_NAME, DW_MATCH_HASH, DW_ACTIVE, DW_FROM_DATE, DW_TO_DATE
from owlmtn.WAREHOUSE.DIM_CONGRESS_SEAT
where state = 'CO' and SENATE_CLASS = 2
  and DW_FROM_DATE < '2020-10-30 21:51:58.166000000 -07:00'
  and DW_TO_DATE >= '2020-10-30 21:51:58.166000000 -07:00';

Conclusion

A slowly changing dimension table is the foundation of Data Warehousing. It allows the analyst to track trends, to view history, and observes the principle of preserving all data.

Some of the SQL constructs can be tricky, and some databases support SCD natively. However, using a base SQL SCD pattern will give your code portability and clarity.


Related Articles