Is Machine Learning the future of Data Quality?

Some Machine Learning techniques for data quality

Published in

Towards Data Science

4 min readAug 22, 2021

“Garbage in, garbage out”, in the data world we have often heard this phrase which means if your data is “bad”, you can never make “good” decisions(bet you didn’t see this one coming:P).

The journey from “bad” to “good” is what Data Quality is. Now the bad data can mean a lot of things such as:

Data is not up to date, Timeliness
Data is not accurate, Accuracy
Data has different values for different users or there is no single source of truth, Consistency
Data is not accessible. Usability
Data is not available, Availability

This paper nicely defines various dimensions of data, please read on to find more about it.

Data quality is important and pivotal for all domains of jobs but as a data engineer, it becomes a primary responsibility for us while delivering data we are delivering “good” data.

My experience:

For ensuring data quality I have also implemented rule-based solutions to take care of:

Bad schema
Duplicate data
Late data
Anomalous data

Which revolved around mainly having a clear understanding of what kind of data I am going to be feeding the system and of course in turn generalizing the same for the whole data pipeline framework.

Though the automated system helps from the move from a reactive approach to a pro-active approach, the problem with a rule-based system is

It can have too many rules for high cardinality, multi-dimensions data.
For every new error, every new anomaly, the Data Quality Framework needs some custom implementation, i.e. human intervention was inevitable in such a solution

To overcome, human intervention in the rule-based scenario, we need to look for a fully automated system. With many recent developments, ML is one of the domains which might help in achieving that.

Let’s see how the machines help us here in ensuring automated data quality or looking beyond the obvious?

Before discussing how let’s discuss why?

Why machine learning for Data Quality?

ML models can learn from tremendous amounts of data and can find hidden patterns in it.
Can take care of repetitive tasks
No need to maintain rules
Can evolve as the data evolves

But I would also like to point out though the above list looks like an election banner for ML as a candidate, using it depends on a use case to use case and also, ML generally doesn’t work well with small datasets or datasets which doesn’t exhibit any pattern.

Having said that, let’s look at some of the ML applications wrt Data Quality:

Identify wrong data
Identify Incomplete data
Identify sensitive data for compliance (maybe, PII identification)
Data deduplication, by using fuzzy matching techniques (sometimes just doing unique on the data doesn’t work)
Fill in missing data by assessing historical patterns
Alert on potential breach on SLA by using historical information (let’s say using historical information, the system detects drastic increase in the volume of data which might impact SLA)
Can help develop new business rules efficiently (defining apt thresholds)

Some ML techniques that can be used for Data Quality

Dimensionality Reduction

Dimensionality reduction is the ML method used to identify patterns in data and deal with computationally extensive problems. This method includes a set of algorithms aimed to reduce the number of input variables in a dataset by finding how important each column is.

It can be used to identify and remove columns which brings little to no information in a plethora of columns and hence curing the Curse of Dimensionality(Read more)

It can often be used as a first algorithm before feeding the data for any other data quality algorithm. Dimensionality reduction is helpful when working with visual and audio data involving speech, video, or image compression.

Example: UMAP, PCA

Clustering

Clustering organizes data into groups (clusters), basing on its similarity and dissimilarity.

Example: DBSCAN

Anomaly Detection

The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection, we, firstly, transform high-dimensional space into a lower-dimensional one. Then we can figure out the density of the major data points in this lower-dimensional space, which may be identified as “normal.” Those data points located far away from the “normal” space are outliers or “anomalies.”

Example: ARIMA

Association Rule Mining

Association mining is the unsupervised ML algorithm used to identify hidden relationships in large datasets which frequently occur together.

This algorithm is commonly used to identify patterns and associations in transactional, relational, or any similar database. For example, it is possible to build the ML algorithm, which will analyze the market basket through the processing of data from barcode scanners, and define goods purchased together.

Well, that’s about it! Having discussed the various approaches to data quality with machine learning, I want to emphasize is that there is no one size fits all solution to all DQ needs for you. It might vary from use case to use case, in some cases, the rule-based system might perfectly but as the data grows and changes eventually moving towards a machine learning approach to data quality might help us look beyond the obvious. Welcome to the future!

I will also do a follow-up article on how the tech giants are using machine learning for data quality.

Until next time,
JD

References:

Monitoring Data Quality at Scale with Statistical Modeling

Good business decisions cannot be made with bad data. At Uber, we use aggregated and anonymized data to guide…

eng.uber.com

Tracking down the Villains

Outlier Detection at Netflix

netflixtechblog.com

https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation

Is Machine Learning the future of Data Quality?

Some Machine Learning techniques for data quality

Why machine learning for Data Quality?

Some ML techniques that can be used for Data Quality

Dimensionality Reduction

Association Rule Mining

Monitoring Data Quality at Scale with Statistical Modeling

Good business decisions cannot be made with bad data. At Uber, we use aggregated and anonymized data to guide…

Tracking down the Villains

Outlier Detection at Netflix

Partnering for data quality

How two groups at Microsoft teamed up on a data quality initiative.

Data Quality at Airbnb

Part 2 — A New Gold Standard

Anomaly Detection for Airbnb’s Payment Platform

By Jingxiao Lu

Written by Jyoti Dhiman