The following five things are the basics when it comes to implementing data processes such as Elt or ETL.
Prevent Errors
In case of failure a rollback should be done – similar to SQL: If a job aborts with errors, then all changes should be rolled back. Otherwise only X% of the transaction is transmitted and a part is missing. It will be very hard for you to find out what that missing data is.
Fair Processing Times
How long does a job take to process X data rows? This provides important insights for the process. How often and how long a process has to run? Which data actuality can I assure my department? What happens when data has to be reloaded?, etc.
Data Quality Measurement Jobs
Are my source and target systems compliant? How to be sure that all data has been transferred? Here, I recommend to build up a monitoring. It’s always a good idea to measure data quality and to detect errors in a fast manner, otherwise a lack of trust from the consumer can be the result. Find some inspiration here [1].
Transaction Security
When using database replication software in your process for example Amazon DMS instead of a direct connection between System A and B you can run into trouble: I once had a replication job that loaded data from table A and table B at the same time. Both, are further processed by an ETL process but if a data set from table B is not available due to high latency and the data set from table A is processed, the information from table B is missing. Here, I also recommend a monitoring or the dispensing of too many additional components in the process.
Consideration of dependency to other systems
Various circumstances must be taken into account for the source systems:
Availability: When is a source system available? Consideration of maintenance cycles, downtimes, etc.
High Data Load: The target systems must not receive any unnecessary changes (CDC) from the source system, e.g. during peak times. Following the point above, as a consequence, e.g. the data transfer can take place in a batch job e.g. in the night time.
Unwanted behavior of other Systems: Like described above, database replication services could ruin your ETL processes but also other problems can occur for example duplicate and inconsistent data. Here, it is important to get to know the source systems and their pitfalls.
Conclusion
For me, these are the five most important building blocks for establishing a stable and secure data process. You should always keep in mind, that data quality is a very important topic. Otherwise you might experience a lack of trust from users and business departments.
Sources and Further Readings
[1] Christian Lauer, How to improve your Data Warehouse Quality (2020)