The world’s leading publication for data science, AI, and ML professionals.

Fundamentals of Data Warehouses for Data Scientists

DATA SCIENCE

Or what are the Dimensional Design principles to make a good Data Warehouse?

Photo by Waldemar Brandt on Unsplash
Photo by Waldemar Brandt on Unsplash

For data scientists, machine learning models and visualization packages are essential tools. But more importantly, data scientists rely on data and data infrastructures to do their analytics and modeling. Without data and databases, all developed analytics tools and techniques are useless. Many data scientists get their data in raw formats from several sources of information. But, for many data scientists as well as business decision-makers, especially in large enterprises, the main sources of information are corporate data warehouses. A data warehouse is a structured organization of all available data (ideally) in the company. Using data warehouses, data scientists can answer important business questions and analyze the business performance.

Usually, data scientists are not concerned about the processes and methods behind data warehouses, and they only use these resources to do their analytics jobs. Normally the tasks regarding provisioning, creating, and maintaining data warehouses fall under Data Engineers’ responsibilities. But, understanding the basics of data warehousing help data scientists and business decision-makers to communicate with data engineers more effectively.

This short article covers the basics of data warehouses and their design, known as Dimensional Design. The majority of concepts in this article are borrowed from an excellent book by Christopher Adamson titled "Star Schema The Complete Reference."

Star Schema The Complete Reference

Types of Information Systems

Before diving into data warehouses, let’s have a quick review of two major categories of information systems (or simply databases).

Information systems fall into two major categories:

  1. Transaction/Operational Systems (OLTP)
  2. Analytics Systems (OLAP)

Here we quickly review both systems.

Transaction/Operational Systems

These information systems support the execution of business processes. These systems, sometimes called "Online Transaction Processing" (OLTP) systems, mostly concern with database interactions such as inserts, updates, and deletes.

As an example, consider a database for users of a social media website. This database must take care of user registration, login, posting activities, etc. To keep the business running, the transaction/operation information system should continuously read and write to the database. It must add users, updates their passwords (if they asked for), adds their posts to the database, records the interaction of other users with their posts, and so on. This system must be very efficient in doing transactions with the database(s).

Analytics Systems

These systems support the analysis of business processes. These systems are also called "Online Analytical Processing" (OLAP) systems or Data Warehouses. Analytics systems mostly interact with databases through queries and retrieving aggregated data.

As an example, let’s go back to our social media business. To have a good idea of our business, we need to review and understand our data and measure our business performance. For example, we need to know how many people signed up monthly. How many people have posted recently? Simply we need data to give us insight into our business performance. Information systems and databases that deal with these types of tasks are different from transaction systems. These types of systems require an architecture that makes querying easy, accurate, and efficient.

What is Dimensional Design?

Dimensional Design is a set of techniques, principles, and rules to design information systems and databases suitable for analytics systems. Remember that the goal of analytics systems is different from transaction systems. Since we normally record data in our transaction system, to use it for an analytics system, we must transform its architecture and design. Dimensional design helps us find the best architecture and design for our analytics system (i.e., data warehouse).

When we find our best design, then through a process called ETL (Extraction, Transform, and Loading), we can get data and information from the transaction/operational systems into analytics systems.

Center of Dimensional Design: Facts and Dimensions

The most important concept in Dimensional Design is to understand that any types of data in your database fall into one of these groups:

  1. Facts
  2. Dimensions

Facts are measurements that we would like to monitor and get answers from them. For example, if you want to know the number of users registered monthly in 2020, your desired fact (the one you are looking for) is the "number of users." Most of the time, you are looking for facts in some contexts. For example, in my previous question (i.e., "what is the number of users registered monthly in 2020?"), I am looking for the number of users based on "months of registration." Therefore, the month of registration provides some context for my fact (i.e., number of registered users). Dimensions provide context for facts, and without them, facts are almost meaningless.

Sometimes it is easy to find facts and dimensions; sometimes, it is hard. As soon as you could separate facts from dimensions, you can build a table for facts. For your dimensions, you must do a little bit more job and divide them into a few major groups based on your analytics purposes. But remember, you don’t want to make many tables for your dimension data as most people do for transaction systems. Your new dimension tables can now be linked to your facts table. Now, if you look at your database design, what do you see? See the next section for your answer.

Star Schema

The most simple view of an analytics system with one fact table surrounded by a few dimension tables is more or less like a star. That’s why we call it a Star Schema. Star Schema is a dimensional model for a simple relational database. In other words, applying the principles of dimensional design to a simple relational database leads to a database model that is like a star and is called Star Schema.

Star Schema provides an efficient and accurate way to query against a database (especially large databases) for analytical purposes. Remember, Star Schema is a part of Dimensional Design, not the whole concept.

Type of Data Warehouse Architectures

Although data warehouses vary in their architectures, they usually fall into one for these three groups:

  1. Stand-Alone Data Marts architecture
  2. Corporate Information Factory architecture (by Bill Inmon)
  3. Dimensional Data Warehouse architecture (by Ralph Kimball)

Before explaining each architecture, let me quickly introduce you to the concept of "Data Marts." Data Marts are small (sometimes filtered and aggregated) databases built for a specific subject area or business need. For example, we might build a data mart for analyzing customer behavior or a data mart for analyzing our inventory issues. All three architectures normally deliver data in form of data marts to data scientists and decision-makers (shown as users in the following images). In some architectures, data marts are getting constructed from transaction/operational databases directly. In some others, they are made from a central data warehouse. By knowing the concept of data marts, let’s review each of these three data warehouse architectures.

An abstract of the Stand-Alone Data Mart Architecture (image from the author).
An abstract of the Stand-Alone Data Mart Architecture (image from the author).

Stand-Alone Data Marts are the most simple data warehouse architecture. In this architecture, for a specific business need or subject area, you make a focused data mart from your transaction/operational databases. The process is simple, straightforward, and inexpensive. Although, you should create a separate data mart for another subject area from scratch. As you see, the approach works very well if you only need to build a limited number of data marts. The process of building and maintaining a large number of stand-alone data marts becomes too difficult, and for that reason, many large organizations avoid this approach.

Unlike Stand-Alone Data Marts architecture, two other architectures suggest creating a central data warehouse first and then create subject-area data marts from the central data warehouse. As a result, you don’t need to repeat the process of extracting and transforming data from transaction/operational databases every time that you want to build a data mart. In addition, this approach lets you maintain and update your data marts easier and more accurately since they are connected to a single source of information.

An abstract of Inmon's Corporate Information Factory architecture (image by the author).
An abstract of Inmon’s Corporate Information Factory architecture (image by the author).

In Corporate Information Factory architecture, developed by Bill Inmon, first, we make a central data warehouse, called Enterprise Data Warehouse. As you can guess, the Enterprise Data Warehouse gets its information from transaction/operation systems. In technical terms, an Enterprise Data Warehouse is an integrated repository of atomic data in a normalized format. The atomic feature suggests that we have data in its lowest level at this central data warehouse (no aggregation). Also, we have data in a normalized format which is similar to operational systems (no dimensional design). Finally, data marts are being created for different business analytics purposes from this central data warehouse. It worth mentioning that in this architecture, the principles of dimensional designs are not applied to the central data warehouse (i.e., Enterprise Data Warehouse) but the data marts.

An abstract of Kimball's Dimensional Data Warehouse Architecture (image by the author).
An abstract of Kimball’s Dimensional Data Warehouse Architecture (image by the author).

Finally, we have the Dimensional Data Warehouse architecture that is developed mainly by Ralph Kimball. Like the previous architecture, here we have a central data warehouse called Dimensional Data Warehouse that gets its information from transaction/operational systems. This central data warehouse feeds multiple data marts used by data scientists and business decision-makers for their analytics.

You may wonder, what is the difference between this architecture and the previous one? The main difference is the step that we apply principles of dimensional design to our architecture. In Dimensional Data Warehouse architecture, we apply principles of dimensional design to the central data warehouse (i.e., Dimensional Data Warehouse). In Corporate Information Factory architecture, we don’t apply principles of Dimensional Design to the central data warehouse (i.e., Enterprise Data Warehouse). However, Corporate Data Factory architecture applies principles of Dimensional Design to build data marts. The other interesting difference between these two architectures is the difference in Data Mart’s concept. In Corporate Information Factory architecture, a data mart is a physical entity separated from the central data warehouse. But in Dimensional Data Warehouse architecture, data marts are logical entities that live physically inside our central data warehouse.

Summary

Analytics systems are different from transaction/operational systems. Their different nature requires a different design and architecture. In this article, we discussed three main types of data warehouse architectures. We also learned about Dimensional Design and how it is applied in each of these major architectures.


Related Articles