Data Quality (DQ) continues to be a big challenge for many organisations, especially those trying to modernise their data stack. Years of underinvestment in data programmes are now costing companies nail-biting millions of pounds in regulatory fines.
But that’s the stick – what is the carrot? How can you convince your senior leadership teams about investment in DQ before the regulator waves its stick? Propose a framework which doesn’t cost millions to implement. But also improves company efficiency, helps create new revenue generation opportunities, and mitigates risks.
Today, we will look at five core things that should be included in the modern DQ framework to achieve the above goals.
Let’s dive in:
1. Data Observability
Isn’t Data Observability taking over DQ? Not entirely. The idea behind having robust Data Observability ensures basic technical DQ checks are carried out as part of the data flow. I believe Data Observability is a subset of DQ, reducing common technical errors to surface the business end users.
Defining basic checks, such as reconciliation, uniqueness checks & schema changes, can help reduce a large portion of issues upfront before this data ends up in an analytical or ML model and causes more disruption to business-critical processes.
For example: the business end user will not know if partial data was loaded into your staging tables. They will complain that their "numbers don’t look right". Data Observability check for data completeness can catch the problem close to its source.
2. Self Healing Pipelines
Each time an alert is raised about a potential DQ issue and an engineer has to fix the pipeline, the time to remediation will depend on the engineer’s workstack. Sometimes as part of the ETL (extract, transform, load) process, an expected DQ issue can be handled. Although the best approach would be to fix this problem at the source, occasionally, this is impossible, and you have to be pragmatic about the solution.
A more innovative approach can be used to self-healing, where you have either set rules that handle bad data or an ML model trained to spot poor DQ issues. If a row of bad data is filtered out of the pipeline, it has to be logged in an exception table to ensure end-to-end auditing is available for the data.
For example: a rule can be created in the pipeline where duplicate rows across each column can be auto-filtered to an exception table. This will ensure the pipeline job does not fail and a steward/engineer is made aware of the filtered data, should they wish to investigate it.
3. Intelligent Triaging
Finding a problem is the more accessible part of the battle; there are only so many places where it could go wrong. Finding someone who understands the problem and then subsequently can fix the problem at either the source or target is a more significant challenge.
A robust governance framework with aligned engineering, analytics, and business team accountabilities can help triage the DQ issue intelligently. Whether a technical problem is alerted via Data Observability alert or a business problem is alerted using traditional DQ rules. An additional ML layer could be added to ensure the alerting model learns the accountable team over time and essentially removes any level of human intervention.
For example: a workflow can be created where depending on which kind of alert is produced, it is auto-assigned to the engineering/analytics/business team. If the auto-assignment carries out this assignment incorrectly, the team could reject this, leading to the ML model learning from this false-positive scenario. The model would initially also be trained with a large amount of data to reduce false assignments.
4. Implementing Automated Lineage
Lineage has many benefits, including an overview of your data flows from source to target. Overlaying the lineage with DQ issues is the cherry on the cake. Visualising where DQ issues arise whilst intelligently triaging and assigning it to the right team will drastically reduce the time to remediate the issue.
Lineage should also transcend your data stores and start from the business process layer to follow the entire cycle of data capture > data transfer > data storage > data transformation > data consumption. This is important so you can pinpoint that a business process needs to change to avoid sending bad data through your lifecycle.
For example: for a B2C business, a visual data lineage graph can be created which integrates with a different part of the data flow and looks explicitly at capturing stage of the data through the CRM/Billing tool. The minute bad data is fed through the data flow, an alert can be produced using 1. dealt with using 2. or assigned using 3.
5. Scoring Data Health
The modern DQ framework needs to be transparent, as there are too many downstream dependencies to hide or misrepresent poor-quality data. Implementing a scoring mechanism for the health of your data will allow the analytics & data science team to understand whether it’s worth investing time and effort to use this data for answering critical business questions.
Scoring also provides a further visual representation of the data and areas that would require remediation if seen as a priority by the business. As part of the overall framework, weighting should be applied to each DQ dimension like completeness, accuracy, consistency etc., and ultimately a score can be produced to represent the health of the data.
For example: A score of 0–10 can be applied to a table. If a table has multiple uniqueness issues, but accuracy checks have passed, it could be given a score of 5. On the other hand, if the table has passed most of the checks, it can be given a score of 9. Data science teams can agree on which data score is acceptable for downstream business cases. Data management teams can devise a plan to improve the health of the data with lower scores.
Improve Health of Your Data by Using These 5 Scoring Methods
Conclusion
The basics of the DQ framework should always be included, such as the ability to profile source and target data, the ability to create complex business DQ rules etc. However, what makes the process more modern/automated and less painful in light of the challenges of big data are the five things highlighted above. Would you add any more to this list? Feel free to share them in the comments below.
Want to learn more about Data Quality? Check out the FREE Ultimate Data Quality Handbook. By claiming your copy, you’ll also become part of our educational community, receiving valuable insights and updates via our email list.
If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium, win-win.
I also write regularly on Twitter; follow me here.