The world’s leading publication for data science, AI, and ML professionals.

Keep your friends close, keep your entities closer

Entity relationship diagrams: not cool, not modern, still super valuable

In its purest form the Data Warehouse is a mirror. Done right, that mirror reflects the real-world entities and events from your business or domain and articulates them in a smattering of tables, metrics, and dimensions. Done really right – your internal stakeholders and other analysts can navigate that ecosystem without fear of nuance or error – getting there starts with the lowly entity relationship diagram (ERD).

Consider the Jaffle Shop

dbt’s Jaffle Shop demo project is a fictional ecommerce business that has two core entities: customersand orders. Every customer is a unique individual (a human) and every order reflects a transaction made by one of those customers. As such we’d say that Customers:Orders are One:Many and every order inherits the attributes of those upstream customers making the transaction. In that way, we can easily answer questions such as "What dollar value of transactions are from customers with the last name Smith?"

The ERD would look something like below:

This is an intentionally reductive model that’s useful for helping get up and running with dbt – it’s so simple that it wouldn’t really merit making an ERD. Real world examples, however, engender exponentially higher complexity and are where ERDs begin to earn their keep.

Democratizing information and reducing cognitive overhead

During my years working in data at YouTube, one of my colleagues often remarked on the cognitive overhead that was required to use our internal data warehouse correctly. That cognitive overhead – how many logical connections or jumps your brain has to make in order to understand or contextualize the thing you’re looking at – was driven by the complexity of the business, entities, and events it was attempting to mirror.

In the course of their work, analysts needed to navigate across entities such as channels, content owners, videos, digital assets, copyright claims, partners, viewers, users, etc… Seemingly simple questions such as "How much watch time did we have in Germany last month?" could be inconspicuous landmines if not attached to the right entity (e.g., do we care about the viewer‘s country? the channel‘s declared country? the content owner‘s country?). Many-to-many relationships between these entities and non-summable metrics made analysis even trickier.

There were a handful of analysts that knew how all those pieces worked together and could ask the right questions (in the form of SQL queries) – others queried at their own peril.

In retrospect we would have greatly benefited from a well-articulated semantic layer to help our analysts navigate this ecosystem without the fear of non-sense. But we were missing an even simpler solution: a well-articulated entity relationship diagram.

Even if you have a semantic layer at your disposal, you need to figure out how to put the pieces together – that’s why this lo-tech, notably non-modern approach to data modeling is still so critical today. It serves as a forcing function for the creator, requiring that they think critically about the real-world events and entities they are trying to model (and how they relate to one another). For the data consumer, the ERD is a powerful enabler, immediately democratizing knowledge that is otherwise walled-off in many mature organizations.

An ERD for your troubles

When I first joined Mux in 2021, we had the benefit of building our data warehouse from scratch. This meant that we needed to figure out what core entities and events needed to be represented in that data warehouse. I spent hours with the raw data in our Data Lake and in our public API reference and docs just trying to wrap my head around this ecosystem – and those hours have been some of the most high ROI of my tenure here. The result of that time is the crude document below:

This demonstrative view captures less-than-half of the complexity of our ecosystem (i.e., it does not account for newer products, invoices, viewer sessions, GTM systems, etc.) -still putting this all down on paper helped us quickly understand what our end-state data warehouse would look like in order to reflect reality, and how to connect the dots in our semantic layer.

ERD in hand, we started building our dbt models and LookML and have arrived at a stable set of tables that reflect our core entities and events, while we continue to build out newer frontiers.

Closing thoughts

In summary, if your data team doesn’t have one already, invest the time to make a thorough entity-relationship-diagram (ERD). While it may feel less-productive than writing SQL or dash-boarding, the time spent here will be some of your most valuable.


Related Articles