According to a Gartner survey, nearly 60% of organisations don’t measure the annual financial cost of poor-quality data. I think the other 40% are being liberal with the truth. In my experience, the loss due to Data Quality is rarely quantified by organisations, although it hits them in the face daily.
I don’t think our state is due to a lack of trying; it’s more because we don’t know where to get started. Like poor habits, trying to fix all the issues at once or in a year-long project will lead to failures. There needs to be a culture shift of accountability, clear processes and a bit of technology help.
Today, we will dive into five ways to find and resolve issues. Let’s go!
1. Vetting the data coming from source systems
Like most large organisations, if you have old archaic source systems feeding your datawarehouse/lake information, then you know source data is a big problem.
If the systems are old, it may be inflexible to accept changes. In that case, when data is received or staged in your data warehouse, applying a duplicate & reconciliation check will ensure you catch the issues before they pollute your wider data estate.
Once the issue is found, you can either reject those bad records from flowing further or handle the issue in your pipeline design. As long as you know the data issue exists, you can alert the users, hence avoiding making wrong decisions.
_As an example, when you receive a customer data file from a source, it is recommended to perform a completeness check. This will ensure that essential fields such as last_name, date_of_birth, and address are fully populated. If there is missing data, it is advisable to discard those records from your Analytics bucket. Alternatively, you can add default records for missing information. For instance, you may add 01/01/1800 as the default date_ofbirth, as this will help you identify records that lack critical information, thereby facilitating better decision-making during analysis.
2. Fixing data issues in existing tables
Over time the Data quality deteriorates due to a lack of governance processes. Some keys were recycled, duplicate information was added, or patches were applied, which worsened things.
A simple data profile can provide the current state of data in a given table. Now – focus on the core attributes/columns that have these issues. The key is to isolate the issue as much as possible. Once attribute(s) have been determined, apply a one-time fix. For example, if data is duplicated, agree with the Data Stewards on how to get to a single record. Or if the data is inaccurate such as date of birth, start and end dates etc., then agree on the correct replacement and apply the fix.
Once the fix is applied, you must operationalise this process to avoid further deterioration of data quality. This cleansing job can run daily and fixes the data by running update statements. Or it could be manual intervention by an end user assessing an audit table.
As an example, if your customer data table has duplicate customer records, you can use a data quality tool to profile your data. This will help you identify the duplicates and determine why they occur. The duplicates could be caused by the source sending the same information multiple times, poor data pipeline code, or a business process. Once you have identified the duplicates and their root cause, you can merge the records or delete the redundant record. If you cannot resolve the root cause, you can set up a cleansing job to perform a duplicate check, match customers, merge them, and delete the redundant record regularly (master data management).
3. Re-creating poorly designed data pipelines
Data issues can sometimes stem from poorly designed or inefficient data pipelines. You can enhance data flow, transformations, and integration processes by re-evaluating and reconstructing these pipelines.
Poorly designed pipelines may suffer from bottlenecks that slow down the timely processing of data, or complicated data transformations and integration processes can introduce errors and inconsistencies. Analysing the pipeline is paramount to isolating the issue and applying a fix.
For bottlenecks, the pipeline can be redesigned to execute on multiple nodes or for data transformation issues; the pipeline can be broken down into various stages (avoiding redundant joins, querying large tables multiple times etc.) to reduce overall complexity.
As an example, if you’re experiencing long update times with your customer data table, evaluating the pipeline by breaking down its components will be helpful. Upon closer inspection, you will discover that the pipeline design is complex due to its reliance on multiple tables, reference lookups, and generating a master record output. To improve performance and isolate the issue, designing and testing each pipeline component is recommended. This process may reveal that specific table joins are taking longer than expected. At this point, you can examine the table and determine whether it performs a cartesian (cross) join or is read multiple times due to the join design. Once you’ve pinpointed the issue, break it down further and remove those joins or create other staging tables to simplify the pipeline.
4. Leveraging data visualisation dashboards
The hard part of resolving issues is first finding them. And you will hear the usual spiel from vendors about how their technology is the next best thing that will singlehandedly resolve data issues. The truth is you need somewhere to visualise the issues.
A simple data profile with an even more straightforward dashboard showing outliers, gaps, inconsistencies and skewness will do the job. An outlier data point indicating an unusually large transaction can be easily identified in a scatter plot showing customer transaction amounts over time.
A line chart representing daily website traffic that displays sudden drops or periods of zero activity could indicate missing data points or drastic data changes.
As an example, if you’re having trouble with incomplete customer data in your table, consider using a visualisation dashboard to highlight NULL values in specific columns like the date of birth. A daily bar graph can track this attribute and detect any sudden spikes in the data, as shown below.

5. Machine learning for detection and resolution
As the automation age advances, ML can be used to improve Data Quality. By training models on historical data, machine learning algorithms can learn patterns and anomalies, enabling automated identification and resolution of data issues.
Machine learning can also automate data cleansing processes by identifying and rectifying common data issues. For instance, models can impute missing values, correct formatting errors, or standardise inconsistent data.
For example, an anomaly detection model can be created using a historical customer data table training set. The model learns the patterns and characteristics of normal birth dates based on the distribution and statistical properties of the training data. The model establishes a normality threshold for the "Date of Birth" column using the training data. This threshold is typically based on statistical measures like the mean, standard deviation, or range of the birth dates observed in the training dataset. The model evaluates new customer records in the anomaly detection phase and compares their birth dates to the established normality threshold. If a birth date falls outside the threshold or deviates significantly from the expected patterns, it is flagged as an anomaly.
Conclusion
Investing in fixing data issues will pay dividends for all your downstream analytics and AI use cases. Garbage in will result in garbage out. These five ways should help you start your journey of fixing data issues.
However, if you want to learn how to implement all the core aspects of Data Quality, check out my FREE Ultimate Data Quality Handbook. By claiming your copy, you’ll also become part of our educational community, receiving valuable insights and updates via our email list.
If you are not subscribed to Medium, consider subscribing using my referral link. It’s cheaper than Netflix and objectively a much better use of your time. If you use my link, I earn a small commission, and you get access to unlimited stories on Medium, win-win.