The world’s leading publication for data science, AI, and ML professionals.

Should I Check the Quality of My Data Every Day?

Improving Data Quality Is a Balancing Act between Time and Cost

Photo by Photoholgic on Unsplash
Photo by Photoholgic on Unsplash

When I was running a Data quality training session, an attendee asked about the frequency of Data Quality (DQ) checks. I realised it was not a straightforward answer.

All data is not created equal; hence, DQ checks vary based on criticality, time, and cost. Although it would be excellent to check every attribute in your data lake or warehouse after each update or insert, such an operation is usually not feasible.

For this reason, each organisation adopts a DQ framework where these specifics are defined. So although the question is like asking, "how long is a piece of string?" we can agree on some basic principles which can help define this framework.


Defining Critical Data

This is an essential first step in any DQ framework. It would help if you didn’t boil the ocean. What data does your organisation treat as critical? What is the definition of critical?

Data that is high risk and high value. Data that allows your organisation to:

  1. Improve revenue generation opportunities
  2. Avoid regulatory/reputational risk
  3. Realise efficiencies ultimately leading to cost savings

If the data is not doing any of the above, you should ask yourself why we collect this data? Once this has been defined, it is essential to agree on the level of DQ checks you need to apply to these data buckets.

Now that you know the criticality of your data try and find DQ prioritisation of that list. This ensures you use your limited budget for DQ improvement on the right attributes.

You can create DQ prioritisation based on a weighted average of several factors: How often is this data captured, transformed or transferred? How much human intervention happens on this critical data? What threshold is the DQ score required for this data?

You should prioritise the attributes with the highest weighted average score for DQ checks above others. Once you have the list, decide what DQ checks you want to apply and how frequent?

If you are unfamiliar with Data Quality dimensions, I suggest you read №7 on this article first:

25 Terms to Help You in Your Career in Data Science


Technical DQ checks

Technical DQ checks should be largely automated and applied to all known critical data. Each time an ETL process/data capture takes place, the Uniqueness, Completeness & Integrity checks should be carried out as part of this process.

If ETL jobs run multiple times a day, you should carry out these checks various times a day. This ensures that duplicate, incomplete and un-reconciled data is alerted to the Engineering team instantaneously.

Timeliness checks could fall under Technical or Business DQ checks depending on the reason for checking the timely availability of data. But for this check, it makes sense to check daily to ensure multiple-table updates do not go out of sync, causing more downstream data problems.

Reversing bad data inserts from an end table is highly complicated and costly, and for this scenario, you should carry out DQ checks daily.


Business DQ Checks

If poor data breaks a critical business report or impacts a board metric, you will want to check its quality daily. Under the Business DQ checks, the frequency of checks will be dependent on the business priority.

Once you have created a baseline profile, you should test the top critical attributes weekly or monthly, depending on the business priority. Accuracy, Validity and Consistency checks would be essential to ensure the information being fed downstream can be relied upon.

Your business priority could be that Accuracy and Validity checks are carried out daily, whereas the Consistency check is only carried out weekly or monthly. There is less risk of poor Consistency as Integrity checks are being carried out already in your Technical DQ section.


Other DQ Checks

Apart from regular DQ checks for Technical and Business reasons, you may want to check ad hoc DQ for specific use cases. For example, for a Data Science initiative, starting a new Data Engineering project etc. This activity can be irregular and project priority dependent.

If the project creates new critical data, it needs to be added to the critical data element inventory. In business as usual, ensure this falls under either the Business or Technical DQ sections.

Ad hoc DQ profiling could be powerful as it points out data areas that have not been checked or monitored for a while. This gives you the opportunity of improving the overall data health. Ad hoc DQ checks would be use case dependent and checked daily or weekly.


Conclusion

So, as you see, unfortunately, the answer isn’t as straightforward, and it depends on the end goal of checking the quality of the data. I recommend starting with the above principles and adjusting your strategy based on your learnings.

If you found the article helpful, feel free to let me know by leaving a comment below. Check out my other post here on Medium:

Apply Data Quality Checks at These 5 Points in Your Data Journey

If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium.

I also write regularly on Twitter; follow me here.


Related Articles