Is Machine Learning the future of Data Quality?

Some Machine Learning techniques for data quality

Jyoti Dhiman
Towards Data Science

--

Photo by Sigmund on Unsplash

“Garbage in, garbage out”, in the data world we have often heard this phrase which means if your data is “bad”, you can never make “good” decisions(bet you didn’t see this one coming:P).

The journey from “bad” to “good” is what Data Quality is. Now the bad data can mean a lot of things such as:

  • Data is not up to date, Timeliness
  • Data is not accurate, Accuracy
  • Data has different values for different users or there is no single source of truth, Consistency
  • Data is not accessible. Usability
  • Data is not available, Availability

This paper nicely defines various dimensions of data, please read on to find more about it.

Data quality is important and pivotal for all domains of jobs but as a data engineer, it becomes a primary responsibility for us while delivering data we are delivering “good” data.

My experience:

For ensuring data quality I have also implemented rule-based solutions to take care of:

  • Bad schema
  • Duplicate data
  • Late data
  • Anomalous data

Which revolved around mainly having a clear understanding of what kind of data I am going to be feeding the system and of course in turn generalizing the same for the whole data pipeline framework.

Though the automated system helps from the move from a reactive approach to a pro-active approach, the problem with a rule-based system is

  • It can have too many rules for high cardinality, multi-dimensions data.
  • For every new error, every new anomaly, the Data Quality Framework needs some custom implementation, i.e. human intervention was inevitable in such a solution

To overcome, human intervention in the rule-based scenario, we need to look for a fully automated system. With many recent developments, ML is one of the domains which might help in achieving that.

Let’s see how the machines help us here in ensuring automated data quality or looking beyond the obvious?

Before discussing how let’s discuss why?

Why machine learning for Data Quality?

  • ML models can learn from tremendous amounts of data and can find hidden patterns in it.
  • Can take care of repetitive tasks
  • No need to maintain rules
  • Can evolve as the data evolves

But I would also like to point out though the above list looks like an election banner for ML as a candidate, using it depends on a use case to use case and also, ML generally doesn’t work well with small datasets or datasets which doesn’t exhibit any pattern.

Having said that, let’s look at some of the ML applications wrt Data Quality:

  • Identify wrong data
  • Identify Incomplete data
  • Identify sensitive data for compliance (maybe, PII identification)
  • Data deduplication, by using fuzzy matching techniques (sometimes just doing unique on the data doesn’t work)
  • Fill in missing data by assessing historical patterns
  • Alert on potential breach on SLA by using historical information (let’s say using historical information, the system detects drastic increase in the volume of data which might impact SLA)
  • Can help develop new business rules efficiently (defining apt thresholds)

Some ML techniques that can be used for Data Quality

Dimensionality Reduction

Dimensionality reduction is the ML method used to identify patterns in data and deal with computationally extensive problems. This method includes a set of algorithms aimed to reduce the number of input variables in a dataset by finding how important each column is.

It can be used to identify and remove columns which brings little to no information in a plethora of columns and hence curing the Curse of Dimensionality(Read more)

It can often be used as a first algorithm before feeding the data for any other data quality algorithm. Dimensionality reduction is helpful when working with visual and audio data involving speech, video, or image compression.

Example: UMAP, PCA

Clustering

Clustering organizes data into groups (clusters), basing on its similarity and dissimilarity.

Example: DBSCAN

Anomaly Detection

The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection, we, firstly, transform high-dimensional space into a lower-dimensional one. Then we can figure out the density of the major data points in this lower-dimensional space, which may be identified as “normal.” Those data points located far away from the “normal” space are outliers or “anomalies.”

Example: ARIMA

Association Rule Mining

Association mining is the unsupervised ML algorithm used to identify hidden relationships in large datasets which frequently occur together.

This algorithm is commonly used to identify patterns and associations in transactional, relational, or any similar database. For example, it is possible to build the ML algorithm, which will analyze the market basket through the processing of data from barcode scanners, and define goods purchased together.

Well, that’s about it! Having discussed the various approaches to data quality with machine learning, I want to emphasize is that there is no one size fits all solution to all DQ needs for you. It might vary from use case to use case, in some cases, the rule-based system might perfectly but as the data grows and changes eventually moving towards a machine learning approach to data quality might help us look beyond the obvious. Welcome to the future!

I will also do a follow-up article on how the tech giants are using machine learning for data quality.

Until next time,
JD

References:

https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation

--

--