Unconventional Methods for Data Cleaning to Improve Data Accuracy

Creative ways to conduct data cleaning on string values

Chaneldiggs

Published in

Towards Data Science

11 min readJun 3, 2021

Chanel Diggs / Huize Wei / Shijun Wei / Zhitao Zheng / Nicholas Zhu

Data analytics is increasingly becoming a central part of business decisions and driving everyday processes. The question is no longer how do we collect information but what do we do with it? For companies entering into the phase of data-driven decision-making, many challenges can arise. Establishing an objective and preparing for common issues are essential in order to derive useful and profitable outcomes. Such as knowing the relationships and shortcomings of your data, checking for areas of unification, and determining your capabilities when it comes to deciding between simple or complex models. This tutorial will offer a guide to help you think critically when analyzing large datasets and initializing a data analytics project. If you are someone who knows little about data analytics, interested in learning non-traditional ways of data cleaning, or trying to solve data cleaning problems by applying advanced technologies, this tutorial will become handy.

“The use of analytics is no longer limited to big companies with deep pockets. It’s now widespread, with 59% of enterprises using analytics in some capacity.” Kale Panoho

The era of data analytics has arrived. With data analytics bringing continuous success for many big companies and more tools becoming accessible, data analytics has become a necessity for any business. Deloitte recently conducted a survey and found that nearly half of all respondents (49%) assert that data analytics is a key factor in better decision-making capabilities and nearly two-thirds of respondents say that analytics play an important role in driving business strategy. Moreover, the influence data analytics has on marketing is rising, as 55 percent of respondents said their marketing and sales groups invest in analytics second only to finance operations. As Master of Science in Business Analytics (MSBA) students from the University of California Davis, we are experts at the forefront of this trend. With a year of practicum experience gained working with BlueMatrix, a leading technology solutions provider for the global investment research space, we completed several analytical projects and learned how to leverage the power of data analytics.

BlueMatrix’s clients have an abundance of data which they can use in their decision-making process. BlueMatrix engaged in this new data analytics project to help give them new insights on what they can learn from their data. Given that clean data is a ‘must have’ before conducting any types of analysis, companies should seek more efficient ways to speed up the process, while improving accuracy in order to generate insights in a timely fashion. We leveraged the power of similarity score computation and machine learning to automate the data cleaning process, which greatly improved data quality and we believe these techniques can be easily adopted by companies with similar situations like our project. We have seen that more and more companies are trying to be creative and innovative in terms of improving their workflows but they are not sure exactly what to do to achieve that goal. In this case, the two solutions introduced in this tutorial will guide you through the thought process and show examples of potential use cases.

Problem Formulation

Text data is sometimes more difficult to deal with than numerical data. Simply deleting duplicates or truncating text is not enough and more sophisticated methods are needed. For example, a company may want to see if there are any trends in the customers who purchase their products and the companies they work for. If customers are able to enter custom text into a field, one might enter ‘ABC Co.’, another might enter ‘ABC Company,’ and another might simply list ‘ABC’ (Figure 1). Identifying and transforming those customized entries into a unified standard can be a challenge. Another issue that may arise is if customers sign up with multiple profiles to take advantage of a new customer promotion. Failure to remove the duplicated customers will lower accuracy of analysis. The purpose of using more advanced data cleaning methods is to increase accuracy of the data and thus will improve accuracy of the analysis.

Method 1: Similarity Score to Locate Duplicate Records

Why use similarity score for data cleaning

Similarity score is a pretty versatile tool. In general, a similarity score measures what percentage of your text is found within sources in the comparison database. Most popularly, it is used to detect plagiarism and build clusters for data segmentation. While it could be a more sophisticated method, it is surprisingly useful during the data cleaning process. For example, if your firm has just started using data analytics and has an abundance of data that needs to undergo a quality check. You might find duplicate records. In some easy cases, exact string matching is sufficient enough to locate the duplicates. However, especially more recently, some tasks require more capabilities than a simple join function in SQL. For example, if you are matching records from two different tables by using names, exact matching does not take into account if the name is spelled or punctuated differently. In addition, if the name has a prefix or suffix, or if there is a typo, you will need to use a more complex method such as a similarity score calculation.

How to calculate Similarity Scores

There are a few different ways to calculate the similarity score for strings. They all generally involve determining the distance between two strings, but depending on your specific case, one method might be better than the other. For example, Levenshtein-Distance is one of the most common and widely-used measures. However, you will find it is hard to interpret the results in a systematic way and this measure does not recognize patterns well. Jaro-Winkler is another distance measure that is more adaptable to character transpositions, picking up on approximate matches such as ‘Kiera’ and ‘Keira’ or ‘Marc’ and ‘Mark’, and will adjust the weighting for common prefixes. In reality, it is best to test out a few different methods to find the right solution to your problem. In our case, the python package FuzzyWuzzy performed the best.

The FuzzyWuzzy method uses fuzzy string matching that recognizes patterns. It also uses Levenshtein-Distance to calculate the differences between sequences. SeatGeek has developed this package and provides instructions on how you can install FuzzyWuzzy on your computer. One tip here is to make sure to install the python-Levenshtein package as it drastically improves performance of the process by increasing the speed four to ten times. To interpret the results of FuzzyWuzzy, you will need to obtain the similarity percentage using the similarity ratio between two sequences. A score of 100 in similarity percentage denotes a complete match between the two sequences. The lower the score, the more different two sequences are.

pip install fuzzywuzzy
pip install fuzzywuzzy[speedup]#OR
pip install git+git://github.com/seatgeek/fuzzywuzz.git@0.18#egg=fuzzywuzzyfrom fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzz.partial_ratio("Nicki Jonas", "Nicki Jones")
#Match = 91

After installing the packages and gathering the datasets, you are all set to calculate similarity scores for your own project.

choices = ["Nick Jonas", "Nicki Jonas", "Nick Jerry Jonas", "Joe Jonas"] 
process.extract("Nick Jonas", choices, limit=2)

Tips for calculating similarity score:

Below are a few tips for using the FuzzyWuzzy package. We share these with the goal of helping you streamline your process and avoid potential errors.

1.) Make sure your input data is as accurate as possible:

Essentially, the cleaner your data is, the faster your string matching process will be. For example, getting rid of all the prefixes or suffixes, any job titles, or any unwanted symbols.

2.) Set a threshold:

For your own business problem, you should choose a threshold to categorize the similarity scores. For example, if you only want to remove duplicates and keep unique records, any matches with a score below 100 could be removed. However, in other cases, you might want to set a lower threshold to allow either some false positives or false negatives to accommodate for other mistakes. For example, if you are comparing email addresses to find the same record, having a lower threshold might be better as false positives create less harm than false negatives.

#Similarity Ratio
a = fuzz.ratio("Nick Jonas", "Nick Jerry Jonas")b = fuzz.ratio("Nick Jonas", "Nicki Jones")c = fuzz.ratio("Nick Jonas", "Joe Jonas")d = fuzz.ratio("Nick Jonas", "Nick Jonas")e = fuzz.ratio("Nick Jonas", "Kevin Jonas")lis = [a, b, c, d, e]print("Here is a list of similarity scores:"+ str(lis))passed = [number for number in lis if number >= 80]print("Here is a list of similarity scores that passed the threshold:" + str(passed))#Output of similarity scores: [77, 86, 83, 63, 100, 67]
#Output of similarity scores that passed the threshold: [86, 100]

3.) Choose a ratio:

There are four ratio methods you can use. Make sure to check the samples on the documentation to see the details about each of the ratios.

Simple ratio: the simple and straightforward method of comparing two strings.

fuzz.ratio("Nick Jonas", "Nicki Jones")
#Match = 86

Partial ratio: this builds on the simple ratio and takes into consideration when there are extra punctuation marks in the strings.

fuzz.partial_ratio("Nicki Jonas", "Nicki Jones")
#Match = 91

Token sort ratio: this method is useful when there is a pattern of transpositions of words between the strings.

fuzz.token_sort_ratio("Nick Jonas", "Nick Jerry Jonas")
#Match = 77

Token set ratio: this method is useful when there are repetitive elements in the strings.

fuzz.token_set_ratio("Nick Jonas", "Nick Jerry Jonas")
#Match = 100

Method 2: Predicting Missing Values — Machine Learning Model

Why machine learning is helpful

Imagine you are managing a client information database and you have information like the table below. When clients register for your product, they will provide you with personal details such as age and gender. Some fields, area of interest for example, can be skipped at registration, so that information is not filled in by all clients.

Later, you plan to build a model to predict the likelihood of churning for each client and realize that about 15% of values are missing in the interest column. Since it wouldn’t make sense to fill in a ‘mean’ or ‘median’ of the column, one potential solution is to put them into the ‘unspecified’ category and proceed with your model. However, we found that using a machine learning model to try to classify users into possible interest groups could be a better method if the model accuracy is high. In such cases, there is enough information available to make an educated guess of the interest for each user. This can potentially drastically increase the accuracy of churn predictions.

On the other hand, we would suggest starting with an evaluation of the columns in your models. Running machine learning models can be time consuming, so examining the information should be done first. Your model’s performance might not change much after you apply this method, then taking a similar approach would be a better solution.

Few steps to help you get started:

1.) Deciding on a model:

If you have decided to proceed with machine learning models, we suggest you first take a closer look at the use cases of your model. When will the model be used? How often? Where will it be implemented? Asking these questions can guide you in thinking about your input data and your capabilities. To make the prediction in our case, we would need to consider what information would be available at the time you classify the customer’s interests for the churn prediction.

Additional considerations in determining which model to use include how soon do you need the decision to be made and how much do you value accuracy. For example, rather than using tree models, if you want to improve accuracy of the results by using deep learning models, then evaluating the tradeoff between the models is very important. Tree models tend to have lower costs and require less tuning compared to deep learning models. On the other hand, deep learning models might perform better, but they may have a higher development cost depending on the situation. Also, the model could be more difficult to explain than a tree model and performance may not improve if it is not tuned correctly. Just like James Montantes mentioned in his article, “in an industry setting, we need a model that can give meaning to a feature/variable to stakeholders. And these stakeholders will likely be anyone other than someone with a knowledge of deep learning or machine learning.”. The use case of the model, the amount of resources available, and the optimal balance between cost and benefit are all things to consider before you start building the model.

2.) Data Preparation

At this stage, you have finally chosen the models to test on and will begin preparing the data for it. Besides the typical checks and steps to go through, we suggest you start thinking about the accuracy of the model. In cases of supervised learning, if it makes sense to the business problem, should you consider reducing the number of categories in the model? In our example earlier, if there are too many different interest categories, then it would be difficult to make a reasonable prediction. In that case, we can group the interests based on the hierarchy. Such as putting music, movies, and TV shows together under a larger category called entertainment. In our practicum project, we grouped small subsets that were similar to each other into larger sets. This was done to reduce the effect of imbalanced data and improve the predictive power of the model by having a reasonable amount of categories.

3.) Model evaluation:

Once the models are built, the final step is to evaluate the performance of the models. There are several commonly used metrics such as classification accuracy, Area Under the Curve, and F1 score. There is a significant amount of information on these metrics and use cases on the internet, so we would like to focus on two things we learned through our own experiences.

The first is handling missing information. In the previous sections, we have discussed the importance of identifying input data that will be available at the time the model is built. We recommend testing the model without one of the features to see its performance when that information is missing. For example, as shown in the previous table, if ‘Company’ is one of the most important features in determining interests, then the model might perform much worse when a new set of clients are added and their company is not in the training database yet. Therefore, we believe running the model again without one of the most important columns can help evaluate the model’s performance in those kinds of situations.

The last thing is to adjust the model regularly as new information comes in. Models need to be maintained in order to continually provide meaningful insights. Maybe you will start collecting a new feature of your customers and want to add it to your models, or maybe the model needs to be retrained as time passes to improve its performance. Don’t just leave your model there and think it will work forever. Monitor its performance and put in effort to maintain it.

Conclusion

By now, you should have a solid understanding of two data cleaning methods to try out on your own text data. While calculating similarity scores and using machine learning models may be more complicated, sometimes a more complex method is needed to address the complexity and idiosyncrasies of the data. This tutorial provides a starting point for thinking through your objectives, working through the process, and hopefully obtaining a cleaned, transformed, and unified data set prepped for analysis. So whether this is your first analytics project or you are just now encountering the difficulties of string variables, we hope our tips and suggestions will help you avoid some common issues and save time troubleshooting throughout the process.