What Is Data Quality?

Discover Methodologies to Ensure Accuracy, Consistency and Completeness of Supply Chain Data

Samir Saci
Towards Data Science

--

What is Data Quality? — (Image by Author)

Data Quality defines how data sets can be trusted, understood and utilized effectively for their intended purpose.

In Supply Chain Management, data is crucial in detecting issues and making informed decisions.

Supply Chain Systems Creating and Exchanging Information — (Image by Author)

Ensuring that data is accurate, consistent and fit for its intended purpose is a critical task to ensure smooth and efficient operational management.

What are the processes in place to ensure good data quality in your organization?

In this article, we will delve into the concept of data quality by exploring its dimensions and understanding its importance in the context of supply chain management.

💌 New articles straight in your inbox for free: Newsletter
📘 Your complete guide for Supply Chain Analytics: Analytics Cheat Sheet

Summary
I. The Pillars of Data Management
1. Why is it key?
2. Quality vs. Integrity vs. Profiling
II. What are the 6 Dimensions of Data Quality?
1. Completeness: do we have all the data?
2. Uniqueness: are features all unique?
3. Validity: is the format respecting business requirements?
4. Timeliness: up-to-date data
5. Accuracy: does it reflect reality correctly?
III. Next Steps
1. Data Quality for Environemental Reporting: ESG & Greenwashing
2. Generative AI: Automate Data Quality Check with GTP Agent
3. Conclusion

I. What are the Pillars of Data Management?

Why is Data Management key?

High-quality data can be the difference between the success and failure of your supply chain operations.

From planning and forecasting to procurement and logistics, every facet of supply chain management relies on data to function effectively.

Supply Chain Systems Creating and Exchanging Information — (Image by Author)
  • Planning and forecasting algorithms rely on WMS and ERP data for historical sales, inventory levels and store orders.
  • Transportation Management Systems rely on WMS data to track shipments from warehouses to stores properly.
Four Types of Supply Chain Analytics — (Image by Author)

Supply Chain Analytics solutions can be divided into four types that provide different levels of insights and visibility.

Descriptive solutions usually represent the first step in your digital transformation: collecting, processing and visualizing data.

It all starts with data collection and processing to analyze your past performance.

Let us take the example of the store distribution process of a fashion retail company.

Analysis of Time Stamps for the Distribution Process of a Retail Company — (Image by Author)

A simple indicator to pilot your operations is the percentage of orders delivered on time: On Time In Full: OTIF.

Simple Example of Data Processing for OTIF Calculation— (Image by Author)

This indicator can be built by merging transactional tables from

If you don’t ensure that the data is correct, it is impossible to

An indicator you cannot measure confidently cannot be analyzed and improved.

What are the differences between Quality, Integrity and Profiling?

Before we delve into data quality, it is essential to understand how data quality differs from related concepts like data integrity or data profiling.

Data Integrity a Subset of Data Quality — (Image by Author)

While all three are interconnected, they have distinct areas of focus

  • Data integrity is concerned with maintaining the accuracy and consistency of data over its entire life cycle.
  • Data profiling involves examining and cleaning data to maintain its quality.

Data profiling should not be mixed with data mining.

Data Mining vs. Data Profiling — (Image by Author)
  • Data profiling will focus on the structure and the quality of data, looking at outliers, distribution or missing values
  • Data mining focus is on extracting business and operational insights from the data to support decision-making or continuous improvement initiatives

Now that we have clarified these differences, we can focus on defining data quality.

💡 Follow me on Medium for more articles related to 🏭 Supply Chain Analytics, 🌳 Sustainability and 🕜 Productivity.

II. What are the 6 Dimensions of Data Quality?

Data quality is evaluated based on several dimensions that play a crucial role in maintaining the reliability and usability of data.

6 Dimensions of Data Quality — (Image by Author)

Data Completeness: do we have all the data?

The objective is to ensure the presence of all necessary data.

Missing data can lead to misleading analyses and poor decision-making.

Example: missing records in the master data
In a company, Master Data Management (MDM) is a crucial aspect of ensuring consistent, accurate, and complete data across different departments.

Example Master Data Management — (Image by Author)

Master data specialists enter product-related information into the ERP during the item creation process

  • Product information: net weight, dimensions, etc.
  • Packaging: total weight, dimensions, language, etc.
  • Handling units: number of items per (carton, pallets), pallet height
  • Merchandising: supplier name, cost of purchase, pricing per market

These data specialists can make mistakes, and missing data can be found in the master data.

What kind of issues can we face with missing data?

And many other issues along the value chain, from raw materials sourcing to store delivery.

💡 How can we check it?

  • Null Value Analysis: identify and count the number of null or missing values in your dataset
  • Domain-Specific Checks: confirm that every expected category of data is present.
    For instance, if a column is supposed to contain five distinct categories and only four appear. This would indicate a lack of completeness.
  • Record Counting: comparing the number of records in a dataset with the expected number of records
  • External Source Comparison: Use external data sources that are known to be complete as a benchmark

Data Uniqueness: are features all unique?

The objective is to ensure that each data entry is distinct and not duplicated.

In fine, we want to represent the data landscape accurately.

Example: transport shipments used for CO2 reporting
The demand for transparency in sustainable development from investors and customers has grown.

Therefore, companies are investing in analytics solutions to assess their environmental footprint.

The first step is to measure the CO2 emissions of their transportation network.

Data processing for CO2 reporting — (Image by Author)

Shipment and master data records are extracted from the ERP and the WMS.

They cover the scope of orders shipped from your warehouses to stores or final customers.

What kind of issues can we face with duplicated data?
If we have duplicated shipment records, we may overestimate the CO2 emissions of transportation as may count emissions several times.

💡 How can we check it?

  • Duplicate Record Identification: with Python’s Pandas or SQL using functionalities that identify duplicate records
  • Key Constraint Analysis: verifying that the primary keys of your database are unique

Validity: Does the format respect business requirements?

The objective is to verify that data conforms to the required formats and business rules.

Example: Life Cycle Assessment
As a method of evaluating the environmental impacts of a product or service over its entire life cycle, Life cycle assessment (LCA) heavily relies on data quality.

In the example below, we collect data from different sources to estimate the usage of utilities and natural resources to produce and deliver T-shirts.

Data Requirements for Life Cycle Assessment for Fashion Retail — (Image by Author)
  • Production Management System provides the number of t-shirts per period
  • Waste inventory, utilities and emissions in flat Excel files
  • Distance, routing and CO2 emissions from carriers' APIs

The final result, the overall environmental footprint of a t-shirt, depends on the reliability of each data source.

What kind of issues can we face with invalid data?

  • The total evaluation will be wrong if the fuel consumption is in (L/Shipment) for some records and (Gallons/Shipment) for others.
  • If you don’t ensure that all utility consumptions are by month, you cannot evaluate the consumption per unit produced.

💡 How can we check it?

  1. Data Type Checks: each field in your data is of the expected data type
  2. Range Checks: compare values with an expected range
  3. Pattern Matching: for data like emails or phone numbers, you can use regular expressions to match the expected pattern

Timeliness: is data up-to-date?

The objective is to ensure the readiness of data within an expected timeframe.

Example: Process Mining
Process mining is analytics focusing on discovering, monitoring, and improving operational and business processes.

Time stamps collection — (Image by Author)

In the example above, we collect status with timestamps (from different systems) at each step of the order-to-delivery process.

What issues can we face if we don’t get the data on time?

  • The status may need to be updated correctly.
    This can create “holes” in the tracking of your shipments
    For example, my shipment was delivered at 12:05 am, but the status is still “Packing in progress.”
  • Incidents may be reported

💡 How can we check it?

  • Timestamp Analysis: check that all timestamps fall within the expected time range
  • Real-time Data Monitoring: monitor data flow that creates alerts when we have an interruption

Accuracy: does it reflect reality correctly?

The objective is to ensure the correctness of data values.

This is mandatory to maintain trust in data-driven decisions.

Example: Machine Learning for Retail Sales Forecasting
These algorithms use historical sales data records to predict future sales by store and item code for the next X days.

Machine Learning for Retail Sales Forecasting — (Image by Author)

For this kind of business case, data accuracy is way more important than the level of sophistication of your forecasting model (tree-based, deep learning, statistical forecast, …).

What kind of issues can we face with inaccurate data?

  • Incorrect historical sales data due to a data entry error or system glitch could impact your model's performance.
  • This might lead to overstock or stockouts with financial and commercial implications.

💡 How can we check it?

  • Source Validation: cross-verify your data with other authoritative sources to ensure that the information is accurate
  • Data Auditing: Periodic auditing of the data can help detect inaccuracies by manually checking a sample of data records for errors

Consistency: does it reflect reality correctly?

The objective is to evaluate records from different datasets to ensure consistent trends and behaviours.

💡 How to enforce it?

  1. Data Standardization: enforce strict data entry and format guidelines to ensure consistency.
  2. Automated Data Cleansing: implement automated tools or scripts to clean and standardize data.
  3. Error Reporting: establish a robust error reporting and resolution process.

These examples gave you enough insights to understand how to implement data quality checks in your organization.

V. Next Steps

Data Quality for Sustainability Reporting: ESG and Greenwashing

Data quality will impact all of your company's analytics and data products.

Among them, you have strategic reporting impacting your organisation's financial and legal aspects.

Example of Strategic Report: ESG — (Image by Author)

The Environmental, Social and Governance (ESG) approach is a methodology companies use to report their environmental footprint, societal impacts and governance structures.

Data Processing Capabilities to Generate ESG Report — (Image by Author)

It relies on collecting, processing and harmonising datasets from multiple sources.

This kind of report usually involves a complete audit of the data and assumptions before publication.

Five sins of greenwashing — (Image by Author)

To fight greenwashing, auditors may audit

  • Data sources covering end-to-end supply chain
  • Data processing and harmonization
  • Final calculation of environmental footprint and governance KPIs

Greenwashing is the practice of making misleading claims about the environmental benefits of a product.

Therefore, your sustainable department relies on high data quality to avoid miscalculations that may lead to compliance issues.

💡 For more details about greenwashing and ESG reporting,

Generative AI: Automate Data Quality Check with GTP Agent

OpenAI released the first version of ChatGPT at the end of 2022.

Since then, Generative AI has become the chance to improve users' experience of data and analytics products using Large Language Models (LLMs).

Supply Chain Control Tower Agent with LangChain SQL Agent [Article Link] — (Image by Author)

As a first attempt to explore this technology, I shared my experimental journey in this article.

Prototype of smart agent boosted by GPT — (Image by Author)

The objective was to create a smart agent that

  1. Collects user requests formulated in natural language (English)
  2. Automatically query a database to extract insights
  3. Formulate a proper answer with a professional tone

The initial results are impressive

The agent can answer basic and advanced operational questions using transactional data stored in a database.

Can we create a “Data Quality” smart agent?

This approach can be adapted to our Data Quality problem

  • We can connect a smart agent to our different data sources
  • Equip the agent with advanced Python scripts to perform specific analysis
  • Teach the agent the basics of data quality

💡 For more details on how to create a smart agent with LangChain,

Conclusion

If your company invests in a Supply Chain digital transformation, data quality is no longer a luxury but a necessity.

It should be included in your strategic roadmap to ensure the right level of quality to make informed decisions, streamline operations and achieve business goals.

Using a trusted data source, investing capital and energy in advanced analytics solutions is possible.

About Me

Let’s connect on Linkedin and Twitter, I am a Supply Chain Engineer using data analytics to improve logistics operations and reduce costs.

If interested in Data Analytics and Supply Chain, look at my website.

💡 Follow me on Medium for more articles related to 🏭 Supply Chain Analytics, 🌳 Sustainability and 🕜 Productivity.

References

--

--

Top Supply Chain Analytics Writer — Follow my journey using Data Science for Supply Chain Sustainability 🌳 and Productivity ⌛