The world’s leading publication for data science, AI, and ML professionals.

Data Literacy Series: Dimensional Modeling

This divulgation series aims to explain different topics about Data subjects, in a simple but effective way.

Photo by Honey Yanibel Minaya Cruz on Unsplash
Photo by Honey Yanibel Minaya Cruz on Unsplash

Introduction

In general, a Data model is a representation of how data is organized into database tables. I know, that’s not a rigorous/academic definition, but my only goal here is to be clear. For a given collection of data, representing a Business Process, there is a number of possibles ways to organize store them: do we store all data in 1 big table? Or in 2, in 100,.. ? It depends. For any Data Analytics- driven subject, ranging from classic Business Intelligence to the wildest Data Science applications, a very comfortable and efficient way is using Dimensional Modeling.

The topics will be:

  • What is Dimensional Modeling?
  • Concepts of Dimensional Data Models
  • Why should Data Science, ML, and BI use Dimensional Models?

What is Dimensional Modeling?

Dimensional Modeling is a technique to build a data model (a bunch of tables, usually around 10-15), optimized for data analysis. The purpose of dimensional modeling is to allow data historicization, simplify the business data understanding, maximize the speed of data retrieval, and to provide a user-friendly model for analytics if compared against the operational databases used by the business processes (a.k.a Online Transaction Systems, or "OLTP").

Simply put, an OLTP is any database that stores directly the data produced by business events, such as the checkout transactions in a shop, which simple records every product scanned. It’s quite clear that native OLTP data are never in a good shape, or complete enough, for serious analytics.

Imagine that you’re a Data Scientist and need to build a predictive model based on the features found in 10 of the tables in the schema below. Pure nightmare..

Image from https://meta.wikimedia.org
Image from https://meta.wikimedia.org

Here is where a dimensional model can help: it’s designed to allow quick reading, summarizing, analyzing numeric information like values, balances, counts, weights, and aggregate the measure across any business attribute. E.g. a Sales analytics to calculate the Total Turnover by customer country and product category.

In contrast, the relational models used for real-time OLTP systems are optimized for adding, updating, and deleting data, to "record" all the events of the business process they serve.

These dimensional and relational models have their unique way of data storage that has specific advantages. The relational model tries to make efficient data capture and storage, by using redundancy in data. That implies complicated models, with lots of tables, with unreadable schemas, often as clear as Egyptian hieroglyphs.

On the other side, the dimensional model aims to arrange data in such a way that it is easy to understand where data is stored and how, with very quick retrieval, for agile Business analytics and reporting. The sacrifice to have all that is a poor storage space optimization, but who cares if the resulting model is as simple and clear as this one :

Screenshot by Author: example of Dimensional Model (Star Schema)
Screenshot by Author: example of Dimensional Model (Star Schema)

Concepts of Dimensional Data Models

Fact & Fact Tables

Facts are the quantitative measurements of the business process.

Example: For a Sales business process, a fact could be the total monthly sale amount.

To model facts, we need to build a table with:

  • One or more Facts (amounts, ratios, weights, quantities, etc.)
  • Relationships columns towards the related dimensions tables (explained next).

Dimension & Dimension Tables

Dimensions provide the context surrounding a business process event. All the non-numerical data of a process can be considered as dimensions. In simple terms, they tell the "when", "whom", "what" and "where" of a fact.

To model Dimensions, we build a dimension table for each business concept in our process. Example: For the Sales business process, dimensions would be

  • When: Sale transaction date
  • Who: Customer information (Name, Address, Type, Subtype, etc.)
  • Where: Store information (Name, Address, Division, Size, etc.)
  • What: Product information (Description, Category, Sub-category, Segment, etc.)

How to build a Dimensional Model?

  1. Analyze the Business Process: find out what measures can be interesting to know (facts) and their context (dimensions). Example: the Sales process mentioned earlier can be fully modeled by looking at the content of its transactions. Amount sold, price, product, customer, etc.
  2. Define the Grain: it describes the finest level of detail that answers any questions about the business process. Example: the Grain for Sales could be the single sale transaction. This way our model will be able to represent both single sales details and any aggregation we might want (stats by week, month, by country, etc.)
  3. Identify Dimensions and Facts data. Example: sale transaction’s Facts can be transaction amount, number of products, quantity, cost per unit, etc. Dimensions: customer identifier, date, channel, market category, product identifier, product segment.
  4. Build the model: organize all the measures in one fact table per business process, and usually several (10–15) dimension tables linked to the fact table. Each dimension table should cover a business concept. Example: a dimension table Customer (with its identifier, name, address, etc.), one for Product, another for Calendar (for dates), etc.

The fact table in a star schema is normalized, whereas dimensional tables are heavily de-normalized, meaning that the information in the dimension is heavily redundant across records. But that’s what we want to improve data retrieval performance, as with just a simple join on two levels ( fact and dimensions tables ), analytics queries can run more effectively than on multi-level joins.

Screenshot by Author: example of Dimensional Model (Star Schema)
Screenshot by Author: example of Dimensional Model (Star Schema)

Star Schema isn’t the only possibility when talking about dimensional modeling. A common variation is called Snowflake schema, in which dimensions can be partially normalized.

Why should Data Science, ML, and BI use Dimensional Models?

  • Dimensions and Facts are simple to use for analytics. They offer an intuitive representation of any complex business process. Also, each dimension can be used in multiple star schemas: imagine a very typical dimension Customer, can be used both to make a simple descriptive analysis of customers, or more sophisticated predictive analysis of their future behaviors, segmentation, clustering, all that using the same core table.
  • Dimensional Models allow data historicization, which usually operational databases don’t have. Imagine transactional, or real-time data: once they change, their older values are lost. A dimensional model, by using the Time dimension, can capture all the historical changes (actually that’s one of the reasons it was first invented).
  • If a new Dimension must be introduced in the model, the fact table only needs a minor change, with no impact on the existing dimensions.
  • Dimensional models are optimized for fast data retrieval. More complex topics, not described in this article, such as indexing and table partitioning, are always used in Dimensional Models, to increase dramatically the query performance.

How to use Dimensional Models

As a user of a Dimensional Model, what do you need to know? Whether you need the data for a dashboard, report, to feed ML pipelines, or any other sort of data angry analytics, the principle is the same: take one or more facts, and all the dimensions you want to aggregate the fact by. Combine them in a way (by Sql, R, Python, doesn’t matter, same concepts apply) that considers the Fact(s) as the main data, and the Dimensions as secondary data.

In short: "Facts always LEFT JOIN Dimensions"!

For example, using the star schema model in the previous picture, a typical query would join the Sales fact table to all the dimensions and aggregate the facts by the attributed of interest. If we wanted to do some analytics by Year and by customer’s country the SQL could be the following:

Conceptually it’d be the same principle in R / Python: the fact is the master table, facts must be joined as secondary tables.

Is it really that simple? Almost, just a couple of more things.

  1. Type of facts: when you analyze Facts, it’s quite natural to aggregate them, to calculate e.g. monthly sales Total turnover, Averages Sale by product segment, or by Market, or any dimension. But facts not always can be combined in any way we want: you need to understand Additive, Semi-Additive, and Non-Additive Facts.
  • Additive facts can be summed across any of the dimensions associated with the fact table. E.g. Sale Amounts can be aggregated by Customer, Period, Product.
  • Semi-additive facts can be aggregated across just some dimensions. E.g. the mobile credit balance is additive across all dimensions except time.
  • Non-additive facts cannot be aggregated at all. E.g Ratios.
  1. Fact historicization: when we query fact tables we need to know how the measures are stored. There are 2 possible ways, depending on the nature of the process you try to measure. You need to know them to avoid mistakes when analyzing data, by making incorrect aggregations.
  • Periodic Snapshot Fact Tables: the facts data are captured periodically, to measure the situation at the end of that period. E.g. If you look at your bank account at the end of each month, and you collect the balance into a table, you’re doing a Periodic Snapshot historicization. The Period automatically becomes a part of your grain. Periodic Snapshots are semi-additive (we can’t aggregate by time).
  • Accumulating Snapshot Fact Tables: the facts are captured when a certain event happens. E.g. If you want to see your bank transactions over a month, you capture a fact every time a movement is done, and you can even calculate the total amount by summing them.

3. Dimension historicization: how do we reflect changes to the dimensions attributes? In other words: if an attribute gets a new value in our source data, how do we propagate it towards the dimension tables?

E.g. if a customer changes country or a new product replaces an old one in the product catalog, how do we allow our analytics of "Sale by Country", and "Turnover by Product", to work on the past, present, and future? There are in total 8 ways to handle the changes in the dimensions, also called "Types" of dimensions. I introduce just the first 3 types, which are the only ones I have actually seen in real environments (there are 5 more in the theory).

  • Dimension "Type 0": it retains the original value, no need to update. E.g. An insurance company could assign an Initial Risk Score to a customer, which reflects the very initial situation, and needs to be kept that way forever.
  • Dimension "Type 1": it replaces the dimension’s attribute value when a new value is loaded. E.g. the same insurance company could have an attribute "Most recent Risk Score" of a customer, which must give the latest risk classification for each customer, with no interest in the previous values.
  • Dimension "Type 2": the new value in coming to the dimension table creates a new row in the dimension, and does not delete the previous one. Some additional columns allow to flag what is the currently valid value and indicate the history dates.

In a nutshell…

Dimensional Modeling is an easy way to model business data, by separating all the business quantification (figures) in one table, and qualifications (descriptions, attributes) in other tables. This allows many advantages for any data analytics application.

What’s next? In this article I just mentioned the essential concepts. If you want the details about the actual design, a good place to start is here and on Kimball’s books (he’s the father of modern dimensional modeling).

Thanks for reading.


Feel free to subscribe to my "Sharing Data Knowledge" Newsletter.

If you wish to subscribe to Medium, feel free to use my referral link https://medium.com/@maw-ferrari/membership : it costs the same for you, but it contributes indirectly to my stories.


Related Articles