
The ultimate goal of every data scientist or Machine Learning evangelist is to create a better model with higher predictive accuracy. However, in the pursuit of fine-tuning hyperparameters or improving modeling algorithms, data might actually be the culprit. There is a famous Chinese saying "工欲善其事,必先_利_其器" which literally translates to – To do a good job, an artisan needs the best tools. So if the data are generally of poor quality, regardless of how good a Machine Learning model is, the results will always be subpar at best.
Why is data preparation so important?

It is no secret that data preparation in the process of data analytics is ‘an essential but unsexy‘ task and more than half of data scientists regard cleaning and organizing data as the least enjoyable part of their work.
Multiple surveys with data scientists and experts have indeed confirmed the common 80/20 trope – whereby 80% of the time is mired in the mundane janitorial work of prepping data, from collecting, cleaning to finding insights of the data (data wrangling or munching); leaving only 20% for the actual analytic work by modeling and building algorithm.
Thus, the Achilles heel of a data analytic process is in fact the unjustifiable amount of time spent on just data preparation. For data scientists, this can be a big hurdle in productivity for building a meaningful model. For businesses, this can be a huge blow to the resources as the investment into data analytics only sees the remaining one-fifth of the allocation dedicated to the original intent.

Heard of GIGO (garbage in, garbage out)? This is exactly what happens here. Data scientists arrive at a task with a given set of data, with the expectation to build the best model to fulfill the goal of the task. But halfway thru the assignment, he realizes that no matter how good the model is he can never achieve better results. After going back-and-forth he finds out that there are lapses in Data Quality and started scrubbing thru the data to make them "clean and usable". By the time the data are finally fit again, the dateline is slowly creeping in and resources started draining up, and he is left with a limited amount of time to build and refine the actual model he was hired for.
This is akin to a product recall. When defects are discovered in products already on the market, it is often too late to remedy and products have to be recalled to ensure the public safety of consumers. In most cases, the defects are results of negligence in quality control of the components or ingredients used in the supply chain. For example, laptops being recalled due to battery issues or chocolates being recalled due to contamination in the dairy produce. Be it a physical or digital product, the staggering similarity we see here is that it is always the raw material taking the blame.
But if data quality is a problem, why not just improve it?
To answer this question, we first have to understand what is data quality.
There are two aspects to the definition of data quality. First, the independent quality as the measure of the agreement between data views presented and the same data in real-world based on inherent characteristics and features; secondly, the quality of dependent application – a measure of conformance of the data to user needs for intended purposes.
Let’s say you are a university recruiter trying to recruit fresh grads for entry-level jobs. You have a pretty accurate contact list but as you go thru the list you realize that most of the contacts are people over 50 years old, deeming it unsuitable for you to approach them. By applying the definition, this scenario fulfills only the first half of the complete definition – the list has the accuracy and consists of good data. But it does not meet the second criteria – the data, no matter how accurate are not suitable for the application.
In this example, accuracy is the dimension we are looking at to assess the inherent quality of the data. There are a lot more different dimensions out there. To give you an idea of which dimensions are commonly studied and researched in peer-reviewed literature, here is a histogram showing the top 6 dimensions after studying 15 different data quality assessment methodologies involving 32 dimensions.

A systemic approach to Data Quality Assessment

If you fail to plan, you plan to fail. A good systemic approach cannot be successful without a good planning. To have a good plan, you need to have a thorough understanding of the business, especially on problems associating with data quality. In the previous example, one should be aware that the contact list, albeit correct has a data quality problem of not being applicable to achieve the goal of the assigned task.
After the problems become clear, data quality dimensions to be investigated should be defined. This can be done using an empirical approach like surveys among stakeholders to find out which dimension matters the most in reference to the data quality problems.
A set of assessment steps should follow suit. Design a way for the implementation so that these steps can map the assessment based on selected dimensions to the actual data. For instance, the following five requirements can be used as an example:
[1] Timeframe – Decide on an interval for when the investigative data are collected.
[2] Definition – Define a standard on how to differentiate the good from the bad data.
[3] Aggregation – How to quantify the data for the assessment.
[4] Interpretability – A mathematical expression to assess the data.
[5] Threshold -Select a cut-off point to evaluate the results.
Once the assessment methodologies are in place, it is time to get hands-on and carry out the actual assessment. After the assessment, a reporting mechanism can be set up to evaluate the results. If the data quality is satisfactory, then the data are fit for further analytic purposes. Else, the data have to be revised and potentially to be collected again. An example can be seen in the following illustration.

Conclusion
There is no one-size-fits-all solution for all data quality problems, as the definition outlined above, half of the data quality aspect is highly subjective. However, in the process of data quality assessment, we can always use a systemic approach to evaluate and assess data quality. While this approach is largely objective and relatively versatile, some domain knowledge is still required. For example in the selection of data quality dimension. Data Accuracy and Completeness might be critical aspects of the data for use case A but for use case B these dimensions might be less important.