Like in any other IT systems you will collect technical debts over the years. In this article, I want to provide a list of typical debts within Data Warehouses and Data Lakes, reasons for them and possible solutions.
Reasons of Technical Debts
There are two main reasons: Unwanted debts due to lacks of skill or knowledge and consciously accepted ones e.g. due to time pressure. But there are also many technical reasons for both categories [1]:
- Architectural principles are not adhered to, such as deviating from the layer model
- Poor technical infrastructure
- Lack of development standards and design patterns
- Inaccurate business requirements that are misunderstood and implemented
- Insufficient testing which results in unwanted errors
- Constantly changing or additional requirements, etc…
Typical Technical Debts
Due to the reasons above, you will experience some typical technical debts. I experienced the following situations myself:
Technical Debt 1: Old fashioned and inefficient Infrastructure
Old fashioned and not scalable servers and software for your Data Warehouse and ETL/ELT processes could really slow down your data projects and analytic capabilities. For example, if your business departments are really frustrated because Data Integration projects take longer than expected and the backlog becomes longer and longer, it could be the result of a non-agile and out-of-date-toolset. However, these problems can also occur in the area of data analytics in combination with the lack of modern technologies. For example, if your Dashboard tool is not suitable for Big Data analytics because your system is not able to scale the data properly.
Technical Debt 2: No Test Scenarios and Monitoring
Maybe the logic in your ETL process, transformations within the Database or Reporting Service are not completely correct. This is often the result of an uncontrolled and unmonitored data process (You can read more about here.). Also then the provided data or reports, that are not really tested (IT and business should check the data) could lead to wrong numbers. Beside incorrect information, error tickets will fill your backlog and stop other important development tasks.
Technical Debt 3: Slow Data Integration
Beside inefficient infrastructure and errors in your data pipelines, also the architecture of them could lead to problems. Once, I experienced an ETL process that was realized with talend hosted on-premise. Loading the data takes very long and if errors occur, the correction was very time consuming also data has often to be loaded again – which also takes time. The process was then realized with ELT instead of ETL. Which resulted in significantly lower loading times. Also raw data was loaded into the Data Lake, so if errors occur it was due to the T in ELT (which was realized within the cloud based Data Lake) that can be solved quickly.
Technical Debt 4: Using the Data Warehouse as a Data Hub
The ETL infrastructure of a Data Warehouse is repeatedly misused as a Data Hub between operational systems. This means that the Data Warehousing team now also becomes responsible for operational data flows. These get the higher priority in each case. Thus, resources for the operation and further development of the Data Warehouse are missing [1].
Technical Debt 5: Shadow IT
Very common is the misuse of Data Warehouses, Databases and Data Lakes as a Data Dump/Swamp. Data is loaded into them without much governance and then queried by tools like Excel, R scripts or BI tools [2]. This leads to a shadow IT structure around the actual system. Shadow IT has then to be kept on running and no time is available for urgently needed new data projects and architecture adjustments.
Technical Debt 6: Loading unnecessary Data and/or use your System as a ArchiveData should never be loaded just to have the data – there should be always a use case for it. Especially in newer cloud technologies you pay for every byte you store and query so you should load only necessary data. You could use your Warehouse or Lake as an archive, but then you should think about partition and clustering your data [3] otherwise queries will be expensive and inefficient.
Technical Debt 7: Old School BI Layer
Using SQL or Excel as the tool for your reports and dashboard business will always need the help of IT if updates are made within the database or if they need a new report. With newer BI technologies you can shift that part fully to the business departments because you don’t need programming skills – you can build your reports via drag and drop.
Technical Debt 8: Deviations from the Layer Model
The data is only transported from the data acquisition layer, i.e. from the source system to a higher layer. Unfortunately, there are repeated deviations from this and there are bidirectional data flows between the layers or between operational systems and the Data Warehouse [1].
Technical Debt 9: Inadequate Patch/Release Management
Business requirements are often given higher priority than technical requirements. This means that there is often no time to apply the latest patches and close security gaps. It’s also really frustrating if you have to work with MS Power BI version from 2019 in 2020 because IT can’t handle the update and the needed underlying update of the MS Server due to non-existing priority.
Best Practices for solving the Shortcomings
What are the possible solutions to reduce this technical debt? Here, I see 3 important aspects you should look at.
Strategical Aspects
Firstly, there should be something like a IT and or data governance and a department which manages and develops the field of Data Warehousing and BI. Furthermore, you will need to stay up-to-date and follow the newest trends in this area. With new paradigms like the public cloud and shifts from ETL to ELT your data projects can be accelerated and IT infrastructure task simplified. A product owner is a good solution for managing this strategical decisions and also to keep the balance between technical and business requirements.
Data Integration and Monitoring Aspects
While strategically choosing the ELT over the ETL approach and using modern cloud service you should technically implement stable and monitored data integration processes. Additionally, no unnecessary data should be loaded. Also, data should be frequently controlled by IT and business departments.
Architectural and Technical Aspects
No misuse of the pipelines and Databases as a Data Hub should take place. Moreover shadow IT should avoided and functions shifted from workarounds in Excel or other solutions into the Data Warehouse and BI tools. When planing the architecture for your Data Warehouse or Lake, you should also stick to best practices and follow excepted layer models.
Modern cloud providers like Google often already provide you with technical proven concepts – like you can see in the figure below for a Data Lake based on different services:
![Google Data Lake Architecture - Source:Google [4]](https://towardsdatascience.com/wp-content/uploads/2020/12/19OfsBuaHkPTZb4pBeOHLCw.png)
Conclusion
There are many mistakes you can make when building up a data analytic platform. Over the years there will emerge technical debts – a few were discussed in this article. How to solve or even better avoid them in the first place is to follow best practices. To accelerate your projects, you should keep yourself up-to-date and use proven trends like cloud based Data Warehouses and hybrid Data Lakes or Data Lakehouses and an ELT based data Process. As already mentioned, a product owner is a good solution for managing strategical decisions and also to keep the balance between technical and business requirements. Together with other experts it is always a good idea to develop a roadmap and try to optimize your systems constantly.
Sources and Further Readings
[1] BARC, Technical Debt – der langsame Tod für jedes Data Warehouse (2020)
[2] BIQ, Warum ein Data Lake ohne Konzept zu einem Daten-Tümpel verkommt (2017)
[3] Google, Partitioning and Clustering in BigQuery (2020)
[4] Google, Cloud Storage als Data Lake