Dimensional Data Modeling

Why do you need dimensional data modeling and how to implement it?

Aviral Srivastava
Towards Data Science

--

What is dimensional data modeling?

Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, a bottom-up approach. [1]

What are the objectives of Data Warehouse Modeling?

The objectives, laid out by Ross and Kimball, are straightforward:

  • make information easily accessible
  • present information consistently
  • adaptable and receptive to change
  • present information in a timely way
  • protect information assets
  • serve as an authoritative and trustworthy foundation for improved decision making (single source of truth in Data Engineering language)
  • the VIPs must accept your system

If you worked on or used an ETL system, you would have noticed that information consistency is achieved by conforming measures, timeliness provided by ETL cycles and adaptability also depends largely on the ETL design.

Why Modeling?

As a data engineer, you know SQL very well and can probably write SQL queries for all day long. But you cannot assume that the typical end-user will be an expert on writing SQL queries. So, our objective is to build a DW so easy for analysts to write analysis queries quickly and effectively.

Things you would not want your analysts to do:

  • Queries based on ID
  • Cascading outer joins(even you would not want to do that)
  • grouped or joined multiple subqueries
  • recursive subqueries(just visit Hackerrank SQL and you would understand the pain)
  • Subquery correlation: fetching data across multiple columns in different subqueries
  • Joins without PK/FK: even for me(1.5 years of DE experience), it's hard to visualize.

Things you could and should expect from your analysts:

  • simple joins
  • columns with names and comprehensive text
  • simple aggregation
  • analytical windowed functions
  • DISTINCT

Note, the above points are not the easy ones and that your system should be scalable enough for all these kinds of queries.

The OLTP databases are transformed into facts and dimensions due to the aforementioned objectives.

Transitioning from transactional databases to facts

Most businesses measure their success and efficiency by measuring certain types of data. This data captures real business activities and progress. This data is called as facts.

The OLTP oriented databases record transactions at a time, kind of like event streaming but centralized around transactions. The DW is different. DW does not need to record details at the transactional level. DW needs to have facts across different criteria of your business. DW needs to aggregate(or let analysts aggregate) the information required to improve business. And so, redundancy is an unforgivable sin in DW.

What are measures and why should you fill your fact tables with them?

In a data warehouse, a measure is a property on which calculations can be made.[2]

The facts we derive from the operational data stores come with some additional data which is typically summed in our analysis. These are the aspects of a fact that allow the analyst, or the executive viewing the analysis, to see the value in the fact.

Maintaining Consistent Grain

Why do you need to maintain a consistent grain?

So that you can ensure that your system can legitimately correlate and aggregate across facts.

But its not always possible to have data at atomic levels. So, to bridge this gap, there are two methods:

  • Periodic Snapshot Fact Tables
  • Accumulating Snapshot Fact Tables

Periodic Snapshot Fact Tables

As the name suggests, they are collected at regular time intervals. Consumption of gas, audit, and inspections are some instances of data collection that have periodic snapshots enabled for them.

Accumulating Snapshot Fact Tables

When a business performance indicator is a rate of completing a multi-step business process, you may want to capture at the grain of the entirety of the process and record the start, the completion, and the steps in between. This may be transaction-grained, but it has a lot of measures in between. So, you use accumulating snapshot fact tables to answer complex questions in business intelligence where there is the passing of time between facts. One good example would be a fact table row of your ordering a chicken sandwich and the fact table row of the bag being handed through the window of the car at McDonald’s drive-thru.

Locating Facts via Dimensions

You and your analysts need to know how to query and filter facts to derive business intelligence from them. This purpose is served by dimensions.

Drawing Dimensions from Reference and Raw-Meta data

Dimensions are almost always created with a surrogate key; the surrogate key, naturally, is referenced by the foreign key (or keys) in the fact table. We search the table by searching the dimensions in which we are interested. All of the other data describing our facts, such as timestamps, customer agents, store location, product, and customer are what we turn into dimensions.

The beauty of dimensional modeling is that facts are not defined by the primary keys or any sort of unique identifier, instead, they are defined by the combination of dimensions. This gives rise to Star Schema.

It is very important that we have a uniqueness in our dimensions. When we get to queries across facts, duplicates among dimension combinations will turn into a disaster. If you can’t, then add or aggregate dimensions to make ’em unique.

Hierarchy in Dimensions

Consider the following two images.

[3]
[3]

An analyst will have an easy life if you setup the second dimension table for him/her.

So, with the second table, you have the following hierarchy:

[3]

There are all kinds of hierarchies — multiple hierarchies, single hierarchy, etc. I am not addressing them in this blog post.

One thing I would like to point out is that the time dimension is a real pain in the neck. You have to take care of the magic days, the fiscal calendar, the time zones, the cycles(profit over quarter types). And don’t be lousy or over-confident in this, even time-series databases won’t help you in hierarchies if your ETL is messed up. You might want to take a look at Outrigger dimensions. There are also occasions in which one dimension is naturally dependent on another. In such a case, designers may put a foreign key from the one to the other. This is what constitutes an “outrigger dimension. In calendar dimensions, this is very common.

You cannot use a date at a different grain in an outrigger from the dates you use in the fact table. You cannot allow aggregation over outrigger dimensions. If need be, mask numeric values in the outrigger with textual prefixes or suffixes to obstruct this. [4]

Slowly Changing Dimensions

As much as I would like to write about it, I still think that it’s better for my readers to understand this concept thoroughly from here.

I am not discussing Snowflake dimensions but just to point out, they are still in use with OLAP databases.

Integrate your Big Data Into Your ETL System

You will treat your tabular Big Data as having been acquired through one of your standard Extract phases. Thus, you will apply to it the same steps you did in transform:

  1. Data cleansing
  2. Conforming units and formats
  3. De-duplication
  4. Restructuring
  5. Staging

Summary

I wanted to understand the theoretical aspects of database design which led me to read the book, Ross and Kimball. I then became curious to draw differences and analogies in their methods and that of today’s leading data-driven companies like Netflix, Airbnb, Uber, etc.

In this quest, I can fairly say that the structured format of dimensional modeling is preferred over just a hardcore ETL. Because in this way, you remove the dependency on you, your BI team doesn’t ring you up on Slack to create a new DAG for every other insight, instead, with correct modeling, you enable them to act and explore freely without your need.

Please leave feedback on how I could improve, I am sure this was not your best read. Thank you for your time.

Footnotes

[1] https://en.wikipedia.org/wiki/Dimensional_modeling

[2] https://en.wikipedia.org/wiki/Measure_(data_warehouse)

[3] Ross and Kimball, ch 2 and ch 18

[4] Kimball/Ross pp103–109

--

--