A Well-Designed Analytics Data Warehouse, Leveraging Automation and Self-Service Analytics

The purpose of Data Science (or interchangeably Analytics) is to enable business users to make better decisions guided by data. Quantitative data is typically better than "guesses" or "gut-feeling" based decisions but one should not slavishly follow data because – especially in a business setting – it must be understood within some business context and within the limitations of the data. In business, data is never as pristine as it is in physics, chemistry, engineering or other hard sciences. We do not have the luxury of running highly-controlled experiments nor the advantage of unchanging physical laws! Therefore, when applied to business the use of data science techniques borrowed from the hard sciences should be done with the following general understanding:
- The Data will never be pristine and will almost never support more than one decimal place of accuracy (e.g., do not calculate conversion rates to 5 decimal places). Business is not Physics.
- The primary purpose of a data scientist (or interchangeably analyst) is to provide the subject matter expertise needed to judge how good the data is, what data science technique should be used, and to understand how confident we can be in what the data is telling us.
- Data Science (or analysis) that cannot be understood by business users is pointless; the other primary purpose of a data scientist is to know how to communicate the results of the analysis in a way that business users can understand. Talking about the first-order moment of the data (i.e., the average) to the Director of Marketing is about as useful as talking in Cantonese to someone who only speaks English. A data scientist must have a great degree of experience with mathematics but just as important be able to communicate with non-experts in plain English.
- Reporting does not equal Data Science. The only purpose of reporting is to count things (e.g., How many orders did we receive this week? How many dollars of widgets did we sell last month?). If you are paying a data scientist to just count things, you are wasting both their brains and your money. Data Science is taking the raw, basic data and making more advanced, less obvious inferences by borrowing the mathematical techniques long used in the physical sciences. However, do not forget the importance of the point right above (point #3).
- Fancy mathematical techniques will not yield better results than simpler techniques when applied to business data. See point #1. Using a simple Linear Model may yield just as good an answer as a Neural Net when applied to most business data. There is a time and a place for more advanced methods, but first try the simple ones and use judgement before attempting something more complex. See point #2.
And more data is not always better! This is a fundamental misunderstanding. Adding bad data is like adding more "random noise" to a music track… its data but not data that will help you better enjoy the song.
The Three Pillars
There are three pillars to a well-run, efficient and useful data science department within any business:
- A properly structured analytics data warehouse that deconstructs the business into a series of fact, dimension and user tables. It is important to understand the fundamental differences between these three types of tables. One of the major challenges with using this approach is the temptation to not clearly delineate the differences between the fact, dimension and user tables! A fact table is exactly what it means in plain English: a collection of facts around a single entity (e.g., orders). A user table typically contains a specific combination of facts across entities (e.g., the specific definition of an order for the purposes of financial planning versus operational planning – the operational planning user table may include fixes to problem orders because its work that has to be done whereas the financial user table may exclude fixes because they do not generate revenue). A dimension is anything that comes after the word "by" in the English language. E.g., number of orders by channel (i.e., channel is a dimension); number of orders by state (i.e., geography is a channel). The dimensions are how you "segment" the data for the purposes of understanding the data.
- Automation so that data automatically flows from sources into the analytics data warehouse and out of the analytics data warehouse to users. Basic, standard analysis should be applied automatically as close to the data warehouse as possible to avoid wasting the valuable time of a data scientist. They are paid to think… Not to copy data across spreadsheets and apply simple calculations.
- Use of self-service analytics to push curated information to business users using business intelligence tools such as QlikView or Tableau. A business user should own and understand their own data so that they are empowered to make rapid information-driven decisions. Note that we say ‘curated information’ purposely because a mountain of poorly understood or structured data is useless and harmful; more data isn’t always better.
These three things are necessary to a well-run data science function because the majority of time (or effort) in any data science project is not the data science itself! It is the acquisition of the data you need, the repetitive execution of useful data science, and the publication of the data science to business users so they can act on it. We will expand on each of these in a future post.