When integrating Data from system A to system B, data engineers and other stakeholders should not only focus on the data process, e.g. via ETL/ELT, but also on the source system. What various circumstances must be taken into account and what I learned from earlier projects are the following:
Availability
When is a source system available? You have to consider maintenance cycles, downtimes, etc. Otherwise, if the system is not available, the Data Integration process will not work or only part of the data will be captured. Here, it makes sense to implement a monitoring of the source system and work with re-try mechanisms (Read here more about it [1]). Another topic is the time window which must be considered – this would affect the following subject:
Load Times
How long does it take to load data from the source and how long can my process access the source system at full load? For example, if you are doing a batch load or a large amount of data, this usually takes a little longer with large databases – if it takes too long or puts too much load on the source system, you should work with CDC techniques or database replication services. Also, you can check that the source systems must not receive any unnecessary load from a data transfer, e.g. during peak times. One example I experienced is that by the end of the month of the monthly financial statement has had a great impact on the system – here, data processes are better avoided as additional overhead.
Different Data Structures
In the new world of Big Data, column-based non-relational databases and data warehouses are on the rise. But if you have a classic database as a source system, you have to take this into account in the Etl/ELT process. For example, if data comes normalized from the source system, it makes sense to denormalize it in state-of-the-art data warehouses services like Google’s Big Query or Amazon’s Redshift and process it in nested array structures. This results in speed but also cost effects [2].
Different Data Types
It can be that source and target system have different data types. A classic is that Oracle databases work with 0 and 1 or char ‘y’ and ‘n’ [3] while other technologies with Booleans. In order to enable easier data processing later, e.g. via BI tools, it may make sense to convert these as well.
Business Context
In addition to these technical points, you should also have a little understanding of the business behind the source system in general. What exactly does it do, what are the business processes behind it, by whom or through what is the data generated.
This has always helped me, for example, to quickly understand data errors, to design data transformations in a meaningful way and, if necessary, to optimize the source system in order to optimize the subsequent analysis process.
Summary
Through some experience in data integration projects, it became clear to me that it always makes sense to focus not only on the data process and the target system, but also to pay attention to the source system. As a project manager or product owner, it is essential to understand the business process anyway. This also applies to the system that maps this process. Here are some points that I consider most important and always keep in mind for any data integration project.
Sources and Further Readings
[1] Christian Lauer, Five Best Practices for stable Data Processing (2021)
[2] Google, Avoid repeated joins and subqueries (2021)
[3] O’REILLY, Oracle PL/SQL Programming (2021)