The Basics of a Good Analytics Data Warehouse

Properly Layering your Data into L1, L2 and L3 Layers

Elvis
Towards Data Science

--

Image by Author

An analytics data warehouse stores essentially the same data generated by transactional databases — nothing more, nothing less. But whereas a transactional database may be designed to be as fast as possible, an analytics database must be designed to be as flexible and consistent as possible with all the business logic embedded in the data warehouse. The goal of the analytics data warehouse is to enable a data scientist (or interchangeably analyst) to rapidly mix data to answer complex questions and discover useful information that is not obvious. However, the analytics data warehouse has another equally important goal: make it highly unlikely that two different data scientists tasked with the same question will arrive at different answers. This is very likely if the analytics is performed using transactional databases because business definitions and business logic are not embedded in transactional databases. If your company tends to have meetings where different departments arrive with dueling answers to very simple questions, then you probably do not have a correctly designed analytics data warehouse. A properly designed analytics data warehouse will make it difficult for data scientists to arrive at different conclusions; they would have to actively try to arrive at different conclusions. Imagine how much time the company would save if consistent answers were the norm. Instead of wasting hours arguing over which number is “right” you could spend the time deciding what course of action to take (or perhaps discussing the logic that was used to arrive at the conclusion). The dollar value of that time is enormous (not to mention the opportunity cost of delayed action because you can’t agree on basic questions).

The key to creating a good analytics data warehouse is to deconstruct the business into a series of fact, dimension and user tables.

  1. Fact tables: This is exactly what the naive, plain English interpretation would be. A fact table is a collection of facts around a single entity. For example, if your business sells widgets using the internet and advertises only on television, then you would probably have the following fact tables: orders, customers, products, website, and marketing. The orders fact table would have an order ID, a value for each order, the date it was placed, the date it was shipped, a customer ID for the customer who ordered it and so on. The customers fact table would have a customer ID, the name of the customer, the address of the customer, the lifetime number of orders, the lifetime sales value, their business segment, and so on. The marketing fact table would have a television commercial ID, the name of the television commercial, the station ID for the station it aired on, the time it aired, the length of the commercial and so on. There would be additional fact tables to capture information about stations such as average number of viewers, geographic location and so on; that data would not be stored in the marketing fact table because these are facts about the stations and not the television commercials the business runs. This is the most difficult aspect of creating a proper analytics data warehouse: correctly separating the fact tables by identifying the correct logical entities that the data must be segmented into.
  2. Dimension tables: You can identify dimensions by using plain english statements; anything that comes after the word “by” is a dimension. For example, “How many orders did we have by week? How many orders did we have by month? How many orders did we have by day of the week?” In this example, the dimension is the “date”. So any analytics database worth its salt will have a “date dimension” table. This table will have one primary key — a date such as 2015–01–10 — and many, many columns that provide information about that date. For example, it will have the day of the year (10), the week of the year (2), the day of the week (Saturday), the month of the year (1), the week start date (2015–01–05), the week end date (2015–01–11), whether its a weekend (Yes), whether its a federal government holiday (No), whether its a company holiday (No) and so on. Another example is “How many orders did we have by business segment”? Thus, business segment is a dimension and we would have a “business segment dimension” table. Yet another example is “How many sales did we have by zip code? By city? By state?” Thus, geography is also a dimension and the analytics data warehouse would have a “geography dimension” table.
  3. User tables: A user table typically answer a very specific question by combining facts from several different entities with dimensions. For example, “How many non-canceled, non-fixes orders did we have by week?” Or “What was the average order value by business segment by month?” Most businesses create dozens if not hundreds of user tables straight from their transactional databases. The problem with creating user tables straight from transactional databases is that they are inflexible (you can’t recombine the data to generate a different view without going back to the source systems and beginning from scratch) and they are inherently inconsistent because every time a user table is created the creator decides what logic to apply. However, user tables derived from fact and dimension tables are incredibly useful because they can be used to quickly answer the basic questions that the business asks all the time. Instead of paying a data scientist to run a dozen SQL queries across transactional tables and pull the data into an Excel spreadsheet and apply a pivot to answer the question, the analytics data warehouse can automatically create a user table performing the same steps at 5am every day and email the results to the data scientist for his or her review at 7am when he or she arrives. This both saves incredible amounts of time and reduces errors by using computer automation in place of manual processing.

There is one key component though of any analytics data warehouse that is easy to ignore and overlook but that is absolutely critical: a well-maintained, accurate and documented data model. Building a good analytics data warehouse isn’t just writing a bunch of SQL code, its just as much about doing the boring but critical work of creating and maintaining the data model. How do you know if you have done a good job creating and maintaining documentation on the data model? There is an easy test any business can apply: Print the data model onto paper, tape all that paper to a wall, and ask a new analyst to quickly write out SQL pseudo-code for how he or she would extract the answer to a basic question. If it takes the analyst more than 2 minutes to quickly review the papers on the wall and write out the basic pseudo-code (or if you know the pseudo-code is wrong because the documentation is out of date) then you don’t have a good data model or proper documentation. Its critical for your business that you do because that is when you will really unleash and notice the power of a properly built analytics data warehouse. Question that use to take hours to days to answer will be answered in minutes. Imagine what competitive advantage that could provide to your business if you can react faster than competitors because your analysts are not wasting time trying to wade through your transactional database to uncover key findings!

In a future article we will further delve into each layer that should exist in a well-designed analytics data warehouse. In the image above, those layers should be the L1 (i.e., raw), the L2 (i.e., data model described above), and L3 (i.e., user) where you reuse the components from the L2 as if they were Lego blocks. Sometimes there is the misunderstanding that layering your data creates more work; in fact, its the opposite. Layering your data will actually reduce the amount of work your team spends on getting the data they need and in maintaining that data. We will soon discuss how!

--

--

An Amazonian academically trained in Physics and Electrical Engineering experienced in Data Science, Data Engineering, Analytics and Business Intelligence.