The world’s leading publication for data science, AI, and ML professionals.

7 steps to elevate your BI reporting infrastructure to the next level

How to bring your BI team to the next level using a Cloud Data Warehouse, dimensional data modelling and other best practices

Photo by Oleg Illarionov on Unsplash
Photo by Oleg Illarionov on Unsplash

In the last decade data driven decision making has been acknowledged as a top priority in most organizations. This improved decision making ideology is based on facilitating data to track well-defined KPIs to measure success and performance on an organisational, business unit and even individual level. The toolset needed to accomplish this is called a Business Intelligence (BI) System. In order to provide this crucial data driven decision making system to the organization, a well designed BI reporting infrastructure needs to be implemented. In many cases, particularly when we are talking about a company coming from the startup ecosystem, this reporting infrastructure starts from the most simple foundation needed to provide insights from the very beginning. As companies are growing and becoming more mature, their reporting infrastructure is supposed to go through the same maturing process. There are several steps that BI teams might want to take in order to improve their reporting infrastructure and some of the more common steps are discussed here in the following article.

1. Centralise data in a Data Warehouse

A reliable data warehouse is the foundation of every successful BI Infrastructure. The concept is rather simple – creating a platform as a central location for all data sources with which you fuel all analytics functions. Data warehouses are sometimes also referred to as OLAP systems (OnLine Analytical Processing systems). OLAP systems are optimized for read-heavy scenarios and running complex queries on large amounts of data. In contrast, there is also OLTP (OnLine Transaction Processing) which captures and maintains transaction data in a database. There are two different concepts of data warehouses, On-premises (on-prem) & Cloud data warehouses. Cloud hosted data warehouses are rapidly replacing on-premise systems and are especially interesting for startups for an easy way to set up and on-demand scalability. The most notable Cloud data warehouse products are Snowflake, Amazon Redshift, Google BigQuery and Microsoft Azure Synapse Analytics.

2. Facilitate Data Warehouse dimensional modelling

A dimensional model is a design technique to support end-user queries in a data warehouse. The main goal of dimensional data modelling can be summed up to abstracting and centralizing complex logic. Spending time on this in the early stages of your data team will save you from accumulating a lot of technical debt and can therefore significantly decrease the required rework and maintenance further down the line. It comes along with many benefits:

  • Increase delivery speed by letting Analysts write queries more quickly and effectively
  • Give someone with basic SQL knowledge the chance to write ad-hoc queries
  • Reflect business rules/business processes that don’t exist in the database structure
  • Possibility to integrate with a version control system (Git) to maintain change history
  • Improve maintainability due to increased adaptability

The 3 main dimensional modelling approaches are Kimball, Inmon and Data Vault and their differences are explained well in this research paper.

3. Serialize data definitions

On the basis of the previous step we already have a solid system in place to perform dimensional data modelling. This is a requirement for this step in order to serialize internal business definitions. Let’s take a look at the example of defining MAU (Monthly Active User) for your business. "Active User" can have a lot of different meanings here: It could be visiting your website, making a purchase or completing any other specific event. Having a mutual understanding of the actual meaning of MAUs (or any other KPI) is crucial for consistent reporting in an organization and to be able to measure how effective a company is in achieving their business objectives.

There are two different aspects that need to be highlighted here.

  • Formally documenting Key Performance Indicators (KPIs) and other business metrics: Storing well defined KPI definitions in a centralized repository is a great way to prevent confusion and conflicting results as you can easily align definitions across the entire organisation.
  • Store static data in seeds: The seed feature that’s provided by dbt is an extremely simple yet very useful tool to work with static data which changes infrequently.

4. Set up a CI/CD for your data models

Continuous Integration (CI) is a software development practice where code changes are being integrated into a shared repository. Each integration can then be verified by an automated build and automated tests. The main goal of having a CI is to detect and locate bugs more efficiently. Having a CI/CD in place for all your data models is a huge step that can save you a lot of headaches as we experienced at HousingAnywhere. The CI/CD will drastically improve your development processes and if we stick to the dbt example can be described like this:

_When Pull Request builds are enabled, dbt Cloud will listen for webhooks from GitHub indicating that a new PR has been opened or updated with new commits. When one of these webhooks is received, dbt Cloud will enqueue a new run of the specified job. Crucially, this run will be configured to build into a special, temporary schema using the prefix dbtcloud. The name of these temporary schemas will be unique for each PR, and is shown in the Run Details page for the given run.

When the run is complete, dbt Cloud will update the PR in GitHub with a status message indicating the results of the run. The temporary schema created for the run will remain in your warehouse until the PR is closed, allowing you to inspect the relations built by dbt Cloud. Once the PR is closed, dbt Cloud will delete the temporary schema.

Source: https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-enabling-continuous-integration-with-github/

Another consideration is to not just use it for development but also to test your models on the master branch to detect new data in your data warehouse that doesn’t adhere to the tests you defined. Just think about all the times you expected to find only numeric characters in a column, yet alphabetical and special characters somehow managed to find their way in. Automating this process helps to avoid these situations.

5. Define the scope of your reporting tools

Over time data teams and other departments that make use of analytics will start using many different analytics tools that likely have overlapping use cases. A common case that leads to apparent inconsistencies is using the web UI of web analytics tools (Google Analytics, Google Ads, Mixpanel) compared to the respective data after ETL-ing into your data warehouse and possibly applying post-processing to this data. To avoid facing these discrepancies it is suggested that you define a strict scope of use cases for your reporting tools and educate all users about it’s possible limitations.

6. Implement code reviews

It’s advisable to always implement code reviews, no matter how "simple" a task seems or how experienced your contributors are. Yes, code reviews will take some development time, but if done right, they will actually save time in the long run. Making code reviews will help sharing knowledge, mentoring less experienced team members and help following coding best practices. In the end it will lead to a more maintainable and higher-quality code. Is there anything more to ask for?

7. Implement Identity and Access Management policies

Identity and Access Management (IAM) is about defining and managing users and what level of access they are granted. Those privileges are usually defined by roles. These roles are granted certain access privileges and can be granted to users to create, modify or view certain objects. There are many best practices that describe how to establish a solid IAM policy to improve security. A lack of an appropriate Identity and Access Management might lead to certain sensitive features (dashboards, reports, raw data) to be accessible by users outside of the target audience and can eventually result in the worst case to data breaches that lead to a cost of millions and reputational damage.

Conclusion

There are many different ways that can be taken to get to a reliable reporting infrastructure. This very much depends on the specifics of your organization, your current state and the goal for the future. A technology roadmap can be used as a flexible planning technique to support strategic and long-term planning to bring your infrastructure from the current state closer to your ideal state. This will lead to an increased delivery speed, adaptability to changes and a more accurate reporting, which in the end will support your organization in facilitating data-driven decision making.


Related Articles