The world’s leading publication for data science, AI, and ML professionals.

[big] Data Modeling

Data Engineering for Big Data Warehouses

Making Sense of Big Data

A modern update of dimensional data modeling techniques for big data warehouses

(Image by author)
(Image by author)

Dimensional data modeling has been around for a long time but it wasn’t until Ralph Kimball popularized the concept in the late 1980s and early 1990s that it really picked up steam. Many aspects of dimensional data modeling assumed the data warehouse would be confined to a single-server database (either RDBMS or OLAP). However, with the popularity of distributed computing and storage systems in the last 15 years, many of these assumptions no longer apply. The methods and techniques of dimensional data modeling popularized by Ralph Kimball 25 years ago require some revision.

Many seasoned data engineers will probably remember the old days when Ralph Kimball and Bill Inmon battled it out in the data warehouse wars. For those not familiar, these were two very different approaches for building a data warehouse which is beyond the scope of this article. While Kimball had a larger following, many of Inmon’s ideas still survive to this day and the majority of modern data warehouses are based upon ideas from both camps. That being said, I believe it was Kimball’s use of dimensional data modeling that had the biggest impact on how data warehouses are designed and operate today.

The overarching theme of dimensional data modeling is simple: organize data in such a way that it is quick & easy to understand & query for analysis & reporting. To this day, this theme (mostly) still applies; the only thing that has changed is that today’s data warehouse has far more uses than just analysis and reporting. Data science, machine learning, and algorithm engineering are just a few of the emerging uses for Big Data stored in modern data warehouses. Of course, this revision doesn’t require us to come up with an entirely new way of modeling data; it simply means that – with some minor tweaks – dimensional data modeling can satisfy the (big) data needs of today’s much wider audience.


What Should We Change?

I’m proposing a list of revisions and updates to bring dimensional Data Modeling into the world of big data. We may need to add/revise this list in the future but this is probably a pretty good start:

1 – Ending the War on Star Schemas vs. Snowflake Schemas

From a data modeling point of view, snowflake schemas have basically won; but that doesn’t mean star schemas just gave up. All it means is that we should have different tables for many of the different levels of granularity a dimension might hold (which is the approach of a snowflake schema). However, we should still de-normalize lower-granularity attributes into parent dimensions (which is the approach of a star schema) to reduce the number of shuffle stages that a query will require to execute in a distributed computing environment.

Consider a scenario where we have product_type_d and product_subtype_d dimensions. We’ll want all of the attributes of product_subtype_d included (via de-normalization) in product_type_d. Of course, this begs the question: if we have all of the information in product_type_d, then why do we need the product_subtype_d table? The answer is that we still need a dimension at the granularity of product subtype so that fact tables with a product_subtype_id foreign key can still perform a simple join for dimension attributes.

Basically, merging the two techniques yields us the best of both worlds. The only cost is in storage. So while you may want to be careful doing this for extremely large dimensions; it is well worth the cost for the majority of use cases.

2 – Do Not Use Surrogate Keys

As cool as I thought surrogate keys were when I first learned about them, they have turned out to be nothing but trouble and can be extremely difficult to maintain in today’s environment. Using natural keys is perfectly fine. If your dimension has a composite primary key and you want to simplify querying (so that the join doesn’t need to be made with two columns); just concatenate them together with a reasonable delimiter to produce a single key based upon multiple natural keys. Surrogate keys can make re-stating data a total nightmare and my next point eliminates the only good reason they were ever really needed in the first place.

3 – Avoid the Use of Type-2 SCDs (Slowly Changing Dimensions)

The vast majority of the time, a Type-0 or Type-1 SCD will do the trick. Unless there is an exceptionally critical reason, I avoid using Type-2 SCD’s. In over 15 years of data modeling, I have only come across a handful of real use-cases for Type-2 SCDs. While they can be a very powerful feature in a Data Warehouse, they are very seldom actually used (even when data consumers swear that they need it).

If you need to implement a Type-2 SCD, do not use surrogate keys. Instead, use the natural key in combination with date/timestamp fields representing effective and expiration dates in the SCD. This is only slightly more complicated to query but is far more flexible, much easier to implement, and avoids the need for surrogate keys.

4 – Introducing: Snapshot Dimensions

This is a somewhat new concept that I think needs to be added to the data modeling toolbox. Snapshot dimensions wouldn’t have made sense in an RDBMS but make perfect sense in a big data environment. What are they? Take a dimension (for example: customer_d) and add a partition column (for example: as_of_date). Each day that your ETL runs all you need to do is pick up data in the previous as_of_date, read event data about inserts/updates/deletes to customers, process the combined dataset and write the result to a new as_of_date partition.

Some of the benefits to this approach are:

  1. If downstream users detect a problem with the data in your snapshot dimension, it becomes far easier to detect when that error was introduced. It also means you have access to the snapshot of your data before it was corrupted; making it possible to correct the bug and re-state all of the partitions since then.
  2. You have a much simpler variation of a Type-2 SCD (if you need it). It isn’t exactly the same thing because you’re not capable of capturing intra-snapshot changes; but if you’re dealing with multiple changes within a day then it isn’t a very slowly changing dimension from the start and I would suggest considering some other way of modeling the data.

Is this a waste of space? Sure. But the extra space this requires is probably a lot cheaper than having a customer dimension with corrupted data and no quick and easy way to fix it. Think of it less as a waste of space and more as an insurance policy. Additionally, it is relatively simple to create a clean-up job that drops old partitions that are no longer needed.

5 – Denormalization Isn’t A Dirty Word

Denormalizing some attributes into fact tables isn’t as taboo as it once was. Part of the reason why this was frowned upon in the past was because of how an RDBMS would store data in tables. With the advent of columnar data storage formats like Parquet and ORC; this is no longer a huge concern. For a field to be denormalized, it should meet one of the following requirements:

  • If the value of the attribute is immutable (ie: it is static and is not going to change) then it doesn’t matter if you cement the current value of it in your fact table.
  • If the value of the attribute is mutable but you’re interested in representing the "value as it was" in the fact table (like a Type-2 SCD); then storing the attribute in the fact table turns out to be a huge advantage because it reduces your need to implement a Type-2 SCD. Of course, if you go this route, I would strongly suggest that when you denormalize the attribute, you do so by pulling the value from a snapshot dimension (so that if your ETL job is re-stating or backfilling a previous day’s data, it joins to the correct "value as it was" in the snapshot dimension).

If one of these requirements are met, then I consider the advantages and disadvantages to denormalizing an attribute into a fact table:

  • Advantage: There is a significant value in having attributes denormalized in a fact table if it prevents downstream data consumers from performing an expensive join operation (ie: analyze usage patterns to see how often and how expensive such a join would be).
  • Disadvantage: If the value of the attribute is large and/or variable in size, I may avoid denormalizing because this could cause data skew in the fact table.
  • Disadvantage: If the fact table already contains a large number of columns, I would avoid adding more fields that are simple enough to derive. What is a large number of columns? That is difficult to answer but I generally think a table with 60 to 100 columns is pushing the limit. At that point, it becomes difficult for data consumers to remember all of the columns in a table. This is, of course, arbitrary and depends upon whether you have good metadata tooling; but it is something I have generally found seems about right for the last 10–15 years.

It is important to understand that this list of advantages and disadvantages is just a starting point and that each point may carry a different weight. In the end, this is one of those situations where the artistic side of data modeling becomes just as important as the scientific side.

6 – The "Accumulating Fact Table" Is Dead

Like surrogate keys, I thought these were cool when I first learned about them; but, in practice, I never really found a lot of use for them. In addition, by their very nature, they required you to perform UPDATE operations to a large number of random rows in a fact table. Random UPDATE operations and big data are sworn enemies – and their battleground is your cluster. Unless you need the heat from your cluster to make it through the winter, find some other way to model your fact tables.

7 – Don’t Be Afraid of Complex Data Types!

While complex data types violate the most important rule of normal forms (atomic columns); much of the traditional teachings of dimensional data modeling broke several of these rules as anyway. With a bit of common sense and caution; complex data types can be extremely powerful in a modern data warehouse.

Here are some examples of powerful use-cases where complex types should probably be used.

  • Internal metadata about ETL processing can easily be stored in a map. For example, if you want to keep track of processing details like the id/version of a job that wrote a row, information about the methodology used to determine how some complex transformation came to its conclusion, or other debugging details that will probably only be used by data engineers. Rather than creating columns, it is probably easier to store these in a map or struct/row.
  • When you know details about an entity will change in the near future; a map or even a string of JSON may make sense to hold these fields until the details of the schema can be solidified (and you have a better idea what the analysis requirements may be).
  • In cases where you need to represent a very basic 1-to-many relationship but it is only for a single attribute, an array might be a better solution than creating a whole separate table. For example, if you would like to store a list of tags in your product_d dimension – rather than having the tags stored in a product_tag table; this can be stored in an array within the product_d table.
  • For the dreaded scenario of representing parent-child relationships, an array is a great asset. It can be used to store the IDs of parents in order; thus making it possible to determine lineage with a simple self-join instead of relying on recursive queries.
  • Unstructured data is typical and common in big data. This is not so much a use-case as much as it is a reality. Generally, it is our job as data engineers to re-organize unstructured datasets into structured (or semi-structured) datasets so that our stakeholders can make sense of them. This means that, at the very least, several of our data sources will utilize complex data types.

Of course, with great power comes great responsibility, so here are some points of caution to consider about complex types:

  • Complex data types can lead to data skew. This often happens when there isn’t a consistent schema used for the complex types. Data skew will happen when some rows only consume a few bytes in the complex fields while others consume a massive amount of storage.
  • Insight into the inner schema is lost and can become messy over time. While this isn’t likely an issue when you’re simply keeping track of some ETL processing metadata (since you control that metadata); it is common in situations where you’re using complex types because you don’t want to nail-down a specific schema. In the end, depending too much on JSON in your data model will come back to haunt you eventually.
  • Most complex types do not benefit from the amazing capabilities of columnar-stored fields. While some types in some systems may offer some limited benefits; it still does not compare to the capabilities provided by atomic columns. Features like predicate push-down will not work on information embedded in a JSON field (at least, not yet).

Closing Thoughts

Much can be learned by reviewing Ralph Kimball’s dimensional data modeling techniques. While the technology may have changed (which inspired me to write this article), the core idea remains the same: a data warehouse should be designed with the data consumer’s requirements as a top priority.


Related Articles