The world’s leading publication for data science, AI, and ML professionals.

Data Quality for Everyday Analysis

What is Data Quality, why it matters and how to do it right!

Photo by Randy Fath on Unsplash
Photo by Randy Fath on Unsplash

Getting Started

In Data We Don’t Trust!

A while ago, a friend of mine presented a compelling analysis that convinced the managers in a mid-size company to make a series of decisions based on the recommendations of the newly-established data science team. However, not long after, a million dollar loss revealed that the insights were wrong. Further investigations showed that while the analysis was sound, the data that was used was corrupt.

The realization that the data and not the analytical approach was the problem was a relief for my friend. However, the report led to a negative perception of his capabilities and damaged his reputation; something that took him a while to reclaim.

The story of my friend can happen to anyone using data. Excited about the state-of-the-art tools and techniques, the impact we want to make, or sometimes being rushed by deadlines, we may be tempted to ignore the fundamental step prior to any analysis: assessing Data Quality. Otherwise, no matter how advanced the methodology bad data guarantees a waste of time and resources and wrong and misleading outcomes. Data analysis and predictive models are the true manifestation of garbage in, garbage out! In Thomas Redman’s words,

If Your Data Is Bad, Your Machine Learning Tools Are Useless.

Bad data happens for many reasons: bugs in the ETL processes, manual entry, data integrations, loss of expertise, changing business logic and legacy data can all lead to data quality issues.

No matter how rigorous the ETL process is, it still cannot catch all the data issues. One reason is that many of these issues like inaccuracies can be discovered only when data is used. Therefore, very few data scientists, if any, are lucky to start their analysis with a clean and ready dataset. Datasets are usually full of errors and inconsistencies that need to be fixed. Consequently, while assessing the data quality is not as sexy as training a machine learning model, it is the most crucial part of any analysis.

Therefore, any user of data needs to have a fundamental understanding of the variety of issues that can happen to data and to be equipped with tools and techniques to detect, resolve and communicate these issues. That’s why assessing that data is complete, unique, timely, consistent, valid and accurate should be a prerequisite to any analysis.

Here I will introduce the basic concepts behind Data Quality, discuss the cost of bad data, review six dimensions of Data Quality assessment and go through tools and techniques that can be used to deal with quality issues when they arise.


What is Data Quality?

Data quality is the assessment of usefulness and reliability of data to serve its purpose. To make it more clear, consider the following scenarios:

  • When dataset does not include the most recent records.
  • When revenue numbers read from two different sources do not match!
  • When a dataset has many names for the same country like "US", "United States", "United States of America", … plus all the typos.
  • When dataset has missing values.

In the first example above, data is not timely (up to date). In the second one, data is not consistent, and in the third and fourth cases, data is not valid and complete, respectively. We will get back to these concpets in the context of data quality shortly.


The Cost of Bad Data

The consequences of using bad data go way beyond some incomplete rows or inconsistent records. Simply put,

It costs 10x as much to complete a unit of work when the data are flawed in any way as it does when they are perfect.

Ringlead visualizes the cost of bad data based on the number of data records. According to this estimate, the cost of a poor record is approximately $100: 100k bad __ records costs 2 million annually ignoring the new records added to the data. These number for organizations like Google with tremendous amount of data is significantly higher.

Sources 1 and 3 in the chart are from Sirius Decisions - The Impact of Bad Data on Demand Creation and Small Business Association 2013 Census, respectively.
Sources 1 and 3 in the chart are from Sirius Decisions – The Impact of Bad Data on Demand Creation and Small Business Association 2013 Census, respectively.

Another study by IBM estimated that the annual cost of bad data in the United States is around $3.1T- T for Trillion! Thomas Redman explains why:

The reason bad data costs so much is that decision makers, managers, knowledge workers, data scientists, and others must accommodate it in their everyday work. And doing so is both time-consuming and expensive.

Another eye-opening number is from Gartner: poor data quality is responsible for the failure of 40% of all business initiatives and it also impacts the labor productivity by as much as 20%.


Who Is Responsible for Data Quality?

It is the overall organization’s responsibility to provide a holistic view of the data ecosystem by defining roles, processes and technologies to support data quality. Nevertheless, numbers show that the organizations are not doing a great job in this regard! According to Experian, 29% of organizational data are inaccurate and at the same time 95% of businesses struggle to implement a data governance program. The same report states that 75% of respondents believed that data quality checks are the business users’ responsibility, not the IT infrastructure. Therefore, if you think you can trust the available checks in place and assessing the quality of data is not your responsibility, think again!

The other aspect of the problem is that not all the issues can be detected when data is created or stored. Two important moments in the life of data are when data is created and when it is consumed. While data is expected to be fixed at the creation moment, however, it will not be assessed until it is used. That’s why analysts, as the consumers of data, play an important role in detecting and resolving the flaws.

Therefore, I argue that data quality assessment is everyone’s job; data users are encouraged to use their technical expertise and domain knowledge to detect inconsistencies and make sure that the analyses are based on reliable sources that pass quality control measures.


Six Dimensions of Quality Data

So far we reviewed the significance of data quality assessment and quantified the cost of bad data. We emphasized the critical role that analysts, or generally speaking data users, play in making sure data meet standards before it is consumed. In this section, we introduce the requirements that data should meet.

To assess the quality of a dataset, it is evaluated against 6 dimensions:

  1. Quality data is complete Completeness of data means that the main features/attributes required for the analysis do not have missing values. Missing values can skew the analysis and lead to misleading trends.

  2. Quality data is uniqueDuplicates have detrimental impacts. They take storage space, extend the running time of the code, and skew the analysis. Checking for Uniqueness guarantees that there is one only one record of each observation in the data.
  3. Quality data is timelyAddresses, phone numbers, emails and credit cards are constantly changing, and consequently, a record in the data that was valid last month may not be useful now. Timeliness ensures that data is up to date and the most recent records reflect the most recent changes.
  4. Quality data is consistent Consistency refers to the lack of contradictions within a dataset or between different datasets. A 5-feet tall newborn or mismatching revenue between sales and usage tables are both examples of inconsistency in the data.

  5. Quality data is valid Validity means that data has the right type, format and range. Data has the right type and format if it conforms to a set of pre-defined standards and definitions: 08–12–2019 is invalid when the standard format is defined as YYYY/mm/dd. In the meantime, validity assures that data is in an acceptable range. Outliers, an important concept in Statistics and Data Science, are datapoints that do not meet this requirement.

  6. Quality data is accurate Accuracy reflects the correctness of the data such as the correct birth date and the accurate number of units sold, etc. You should note that accuracy is different from validity: while the latter focuses on type, format and range, the former aims at form and content. As discussed in The Data Quality; The Accuracy Dimension, accuracy is about correctness while the validity is about being among the possible accurate values: For an employee with $50k salary, a $45K salary is inaccurate but valid (in acceptable range).

Before proceeding, we shold note that meeting 100% of the data quality requirements with real-life data is almost impossible. One workaround is to define a threshold for each dimension we defined above below which the data is assessed as poor quality. For example, 90% completeness allows the data to be used only if 10% of data is missing. The threshold, to some extent, depends on the business context. The higher the threshold, the higher quality data you have for you use case.


Assessing & Improving Data Quality with Analytical Techniques

In this section, we go through some of the tools and techniques that analysts can use in order to check and improve the quality of their data. Examples in this section are in SQL.

CompletenessIncomplete data presents itself in the form of missing values. The metrics to measure it is the ratio of missing records for the features that are necessary (not optional) for the analysis. For example, if 30 out of 100 records in your analysis is missing the industry feature that you need for your marketing campaign, the completeness of your data is 70%.

First, check the number of missing values:

SELECT 
 1 - COUNT(col_A) / COUNT(*)
FROM
 table_name;

Alternatively, to count the number of empty cells

SELECT 
 COUNT(col_A)
FROM
 table_name
WHERE 
 col_A = '';

If the ratio is beyond your acceptable threshold, you may want to use one of the options below:

  1. Use other data sources, if available: When I was working on an unsupervised segmentation model for outbound marketing, I needed to know whether the business was engaged in e-commerce and in particular, what payment providers they use. The problem was that 90% of these records were missing in the database. We ended up testing datasets from multiple providers and merging one with our data. The whole process of fixing incomplete records took about a month.

  2. Imputation Imputation, a well-known approach in Machine Learning, is inferring the missing values from the existing ones, e.g. by replacing the missing values with mean, median or mode of the data.

SELECT
 COALESCE(col_A, AVG(col_A))
FROM
 table_name;

You can use IFNULL() as well. In Python you can use dropna(), fillna(), interpolate(), etc. For an overview of more advanced imputation methods, check out Data Imputation with examples.

  1. Manual fix Handling missing data manually is risky and non-scalable and not recommended and you can only use it when there is a handful of missing values and you know the business logic well.

  2. Dropping missing values Only when you have a large dataset and few missing values such that it does not impact the analysis, you may drop the null values.

SELECT
 *
FROM 
 table_A
WHERE
col_A IS NOT NULL;

Be aware of losing information and trends due to dropping important records (like outliers) with some missing values.

UniquenessStatistical packages are well-equipped to handle duplicate values. To check for duplicates values in SQL:

SELECT
 col_A, 
 col_B, 
 col_C, 
 COUNT(*) AS cnt
FROM
 table_A
GROUP BY 
 col_A, col_B, col_C
HAVING cnt > 1;

To remove duplicates:

SELECT
 col_A, 
 col_B, 
 col_C
FROM
 table_A
GROUP BY 
 col_A, col_B, col_C;

Other approaches for removing duplicates are using DISTINCT(), ROW_NUMBER() or even self-join. In Python, duplicated() tells you about the duplicate values and drop_duplicates() takes care of removing them.

TimelinessData analysts need to check the timelines of data to make sure it is current! Using a table that is not being updated or a record that has changed but the change has not reflected in data leads to undesired outcomes. While there are no particular technical tools, checking the last time data was updated, or the minimum and maximum dates in the table should help.

SELECT
 MIN(date), MAX(date)
FROM
 table_A;

If your data is partitioned, you can replace the date above with partition_date.

Consistency Inconsistency in data can happen within a dataset or between datasets. Programming languages make it easy to check datasets against one another and detect inconsistencies. Ultimately, you will have to use your business judgment, consult a domain expert or compare your data against a trustworthy source.

As a personal experience, I studied declined transactions in the Payment Industry where I had to resolve the inconsistencies between datasets from two different organizations. In the absence of a unique source of truth, I ended up researching the organizations that were maintaining these datasets and picked the one that had well-known customers and a better reputation.

Validity1. Outliers Several important statistics such as mean and standard deviation are sensitive to extreme values, a.k.a, outliers. Therefore, the data analyst needs to pay special attention to identifying and treating them: whether they are legitimate extreme values or have been recorded as a result of flaws in measurement tools.

From using the business knowledge and visualization techniques such as scatter plot or Box Plots, or using analytical techniques such as Density-Based Clustering and Isolations Forests, there are a myriad of techniques to treat outliers. Check out this article for a review.

Boxplot is a simple approach to identify outliers. Boxplots visualize a distribution by plotting the five important numbers: minimum, maximum, 25th percentile, median and 75th percentile of the distribution. To find outliers in the data, look at the regions that are 1.5 IQR (Interquartile Range) smaller than 25th or greater than 75th percentiles where IQR is the distance between 75th and 25th percentiles.

(Photo credit: Understanding Boxplots)
(Photo credit: Understanding Boxplots)

Generally speaking, removing outliers is not recommended as not all outliers are the same! To distinguish between legitimate vs. illegitimate datapoints, sometimes the former is labelled as an anomaly. However, most of the time anomalies and outliers are used interchangeably.

  1. Type and Format

For each attribute/feature in your data, check the format and data type (date, integer, string) to make sure it complies with standards. To check the data typein SQL

SELECT COLUMN_NAME, DATA_TYPE 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE 
 TABLE_NAME = table_A

In Python, you can simply use

df.dtypes

where df is your Pandas dataframe.

Skimming through the output is one way for a quick check on the data values and formats. In SQL, you can use something like

SELECT 
 col_A, 
 COUNT(*)
FROM 
 table_A
GROUP BY 
 col_A
ORDER BY 
 col_A ASC
LIMIT 100

where incorporating ORDER BYand LIMIT allows you to check for errors among the values that may be erroneous. In Python, value_counts()can give you some ideas about your inputs.

AccuracyAccuracy in the data can be achieved using the domain knowledge as well as running your numbers against another trustworthy source. Conduct exploratory analysis and use visualizations and statistical approaches to identify trends and numbers that d_o not make sense a_nd discuss them with a domain expert.


Final Words

In academia, and before submitting our article to journals for peer-review process, we used to wait a week or two and then review the manuscript once again. This lag normally led to substantial improvements in the quality of the work: We fixed errors we had not noticed before, improved the visualizations, made some paragraphs more clear and revisited data and calculations. Eventually, in the long run, it helped us build a reputation for creating quality work, and in the meantime reduced the time the review process took which in turn led to the quicker publication of the paper.

We are the best judge of our own work, and better than anyone else can identify the flaws and strengths of what we have created. The assessment of Data Quality is probably the most important step is delivering a high quality and impactful work that can help improve the business overall. I hope this short article helps raise awareness about the tools and techniques in our disposal to use in your analytical work and make your work even more quality!


Related Articles