Notes from Industry
Data modelling has been around for 40+ years. There are a few established approaches on how data should be modelled for consumption by different types of users and systems. But there is no arguing that technology has changed massively in 40 years, and the last 5–10 years in the data space have been particularly significant. Are those approaches still applicable today?

A little bit more on history
Dr. E. F. Codd published the paper, "A Relational Model of Data for Large Shared Data Banks", in June 1970. Codd’s model is now accepted as the definitive model for relational database management systems (RDBMS). The language, Structured English Query Language (SEQUEL) was developed by IBM Corporation, Inc., to use Codd’s model. SEQUEL later became SQL. In 1979, Relational Software, Inc. (now Oracle) introduced the first commercially available implementation of SQL. Today, SQL is accepted as the standard RDBMS language.
A few things happened after that, as you can expect.
In the 1990s, two major books on data warehousing were published – one by Bill Inmon, 1992 and another by Ralph Kimball, 1996. In data warehouse modelling they became two well-known competing options for modelling the layer where the data are stored. Either you model according to Ralph Kimball, with the facts tables and conformed dimensions (as a Star schema), or you model according to Bill Inmon with the database normalised.
In 2000, Dan Linstedt released the Data vault modelling, as an alternative to Inmon and Kimball to provide long-term historical storage of data coming in from multiple operational systems, with emphasis on tracing, auditing and resilience to change of the source data model. The data vault model actually provides a "graph based" model with hubs and relationships in a relational database world.
In the 2010s, Apache Hadoop and the ecosystem around it was developed and became popular. Those technologies were very different to relational databases, even though some of them allowed to query the data using SQL. Schema-on-read allowed loading data as-is for querying later – so no upfront data modelling was required. When data was processed for later usage, the approach was to "join everything together" into a wide denormalised file. Data storage was not an issue anymore, but "data joins" were very computationally expensive.
It is 2021, and a new type of data technologies is keep growing in adoption, in particular cloud-native columnar data warehouses with MPP (massively parallel processing) architecture like Snowflake, Amazon Redshift or Google Big Query.
Does it make sense to model the data using approaches from the 1990s? Should new data or analytics engineers invest time in learning them today?
Data Vault vs Traditional vs Flatten that Crap Out
I came across this video recently where a panel of data professionals discussed these questions.
Data Vault vs. Traditional vs. Flatten that Crap Out : Great Data Minds
It is a long and interesting discussion, and here is my summary if you choose not to watch it.
Ultimately, there is no single answer – different use cases and workloads require different approaches. Traditional approaches are called traditional for a reason – they are time tested and they work. And sometimes there is a space for all those approaches within the single architecture.
Any solution should be purpose-built, and here are some commonly chosen options (just some examples):
- Ad-hoc analytics – Star schema (Kimball)
- Data science – Flat tables
- Enterprise data warehouse with tracing and auditing requirements -Data vault
- Operational databases— Third normal form (3NF)
Horsepower provided by modern data stack can overcome many design deficiencies and queries can perform well even when there is less than optimal modelling choice. However, apart from the query speed, there is also maintenance cost, or so called technical debt, and this cost can be unnecessarily high if the suboptimal technique is used.
In fact, when we talk about speed, we should consider several different things:
- speed, or performance, of queries
- speed of delivery
- speed of change (in other words, agility, or how quickly we can respond to change in the requirements within our data model)
Modern data tools utilising cloud elasticity, combined with good development practices like automated data quality testing and CI/CD (which I wrote about here), allow us to be much more agile in our development process and iterate on requirements as they come up. When the speed of delivery and change is fast, you can start simple and focus on delivering business value first and improve as you go. Alternatively, some data models are designed to be resilient to change (data vault), and that can be a better approach in certain scenarios.
Another significant consideration is data management. When we look at the data, we need to know what it is, where it came from and how it holds together. Good data models help with that, but so do tools that help keep track of data lineage, data quality, data dictionaries, etc. It is very easy to end up with a huge data mess if the data is poorly modelled and badly documented.
Final thoughts
I am a data engineer and my typical use cases are ad-hoc analytics, BI and data science. I like to think about the data I work with as of facts and dimensions (even when it’s flat), and prefer to model the data in a Kimball-like style when it makes sense. I find star schema easy to reason about and business users usually like it too. Do I always physically model the data this way these days? No.
If requirements allow, data model can be "logical" rather than physical and can be changed when requirements change. In some cases, it is possible to skip up-front modelling altogether or model as little as needed. I must emphasise here, it works when the speed of delivery and the speed of change is fast, and when data management is in place. And of course, the assumption is that original data is available in the data warehouse in its raw form so final models can always be rebuilt, what is a common practice today when data storage is not a concern, and is usually referred to as ELT (extract – load – transform).
For example, in the past, I had to be very careful when adding columns and choosing data types for my multi-billion rows fact tables in MS SQL Server, as 4 bytes saved per record when choosing int over bigint meant 10–20 gb saved overall— that was significant for a mid-range on-premise server 10 years ago. Today, I can save json logs directly into the blob storage, read them in Snowflake’s SQL view straight away with schema-on-read, and then build a dashboard on top of it – all of that in a matter of days without any upfront thinking of the schemas or data types. And then, within days, it can be redesigned into a star schema with referential integrity – when it becomes valuable.
What is the answer?
So, does it make sense to model the data using approaches from the 1990s?
You don’t have to be dogmatic about Kimball’s approach or any other approach mentioned here, but it is often helpful to be aware of classic design patterns and common pitfalls, especially when you try to solve some tricky design challenge with no obvious answer. Data modelling is around for 40+ years – so most likely, there is an answer to that tricky question already.
Where do I learn more?
If you want to read Kimball’s "The Data Warehouse Toolkit" today, I highly recommend this article (and other data modelling resources on holistics.io) that explains which chapters are still relevant and which you can safely skip.
I have no practical experience with data vault (it seems to be better suited for enterprise data integration), but I found this review interesting if you want to learn more.