Why Does Data Need a Massage?

A bird-eye overview of the basic principles of data transformation

Andrey Shalitkin
Towards Data Science

--

Image by Adeolu Eletu on Unsplash

This post is a part of the “Analytics for startup” series so it might be helpful to the ones who just started figuring out how to build their own BI solution. I tried to answer the most common question that I frequently hear at the beginning of the BI implementation stage. Why do we need to do any data transformation? A very fair question having the fact that the data transformation can be the most challenging and the most time and money consuming part of the implementation.

This is not any kind of tutorial but a bird-eye overview of the basic principles.

Why do we need to transform the data?

We don’t need to, actually. If the source is just one simple system or a couple of google sheets, then we don’t. In reality, though, we normally have many sources from the very beginning, e.g. some kind of CRM, Google analytics, google sheet, a database of our product, a website, etc.

The main ideas of the transformation are:

  • Join the data
  • Denormalize the data

I believe that joining data from different sources is obvious. We need to join GA and CRM to e.g. understand whether there is any correlation between a client device and a paycheck, client location and client preferences, or a product type. We would probably need to be able to perform some kind of GEO-based cohort analysis, and so on.
It’s a bit less obvious thought why do we need to “denormalize” and what is “denormalization” in our case?

Denormalization

The source systems are normally optimized for processing user requests, some kind of operational events. One operation in the majority of the cases is isolated from the others. In other words, when the system creates a user order we are only interested in the information of this particular user and don’t need (or almost don’t need) information about other users. From the data point of view what is normally used over here is the so-called normalized form.

At this point I must add that not all the computer systems work like that, not all of them use relational databases, but for the sake of simplicity, let's assume that we deal with a relational database of something like an average-size e-commerce shop.

The ER diagram of the user-related data that we need to collect in our case can look like this:

Diagram by author

There are certain pre-defined operations the system can perform, like user creation, change of the user’s address, user deletion or inactivation, place of the user order, etc. All these predefined operations normally become clear during the application development stage and can be coded by the developers. Of course, the development page can be like “forever” but this is another story. The point is that we define, we code, we use proper development methodology and everybody is happy.

The situation is different from an analytical point of view. We might need to calculate the count of users broken down by countries or by region, by actions, or by first/last action name, count of orders, or other user actions on different pages or websites. An analyst normally can’t formulate in advance what specific slice and dice analysis is needed as the analysis itself is a data-driven process. In other words, the set of operations an analyst might need to perform is not predefined during the BI system implementation stage.

One can think about requests from an analyst as queries to the BI system or, if to go more technical, as queries to the analytical database (Data Warehouse, DWH). Not all the analysts are very technical and not only analysts are interested in the data, but any business user might need to access the data to make thoughtful data-driven decisions. This means that we need some level of abstraction and the industry standard for this purpose is Dimensional Fact Model. The idea of this model is to define a simple model consisting of facts (e.g. user actions in our case) and dimensions (anything to slice and dice the data). Here is what it can look like:

Diagram by author

In this case, any analytical query will always be a join of these 3 tables with different aggregation functions and filters. We don’t need to code all the possible queries now. The only thing we need to implement is a set of measures (something like aggregation functions in SQL) and a set of dimensions.

This is a very basic example and the Dimensional Fact Model has its complications and caveats but it should be enough for the basic understanding of why we need to transform the data which is the point right now.

How to do it

The transformations I’ve mentioned earlier can be performed in many different ways and each BI system has its own mechanism but one of the common ways to do it is to perform the majority of the transformations on the DWH level which is a database level. If such a database supports the SQL standard then everything that we need is just SQL queries. Well, not just SQL queries but also a place to store them, a mechanism to execute them, and some way to organize a sequence of execution.

One can use the embedded possibility of the many databases to store objects (like views, and stored procedures) inside the database itself and execution scheduler or use any external tool. I personally prefer to deal with some external tool like DBT. It allows to store everything in GIT and provides a mechanism to reuse existing code, combine queries with the autotests and documentation, and has many other great features. Plus it’s an open-source tool. This is my choice though and I had successful projects where everything was done by means of MS SQL or PostgresSQL without any additional tools.

--

--