About ten years ago, a family friend had an interesting problem. He was a medical examiner and was responsible for cleaning up spreadsheets of data for reporting. He wanted to aggregate statistics by key columns such as primary and secondary causes of death. The issue he had was that these entries were hand-typed, and variations in format and spelling were sending him over the edge. He was spending hours each month cleaning up all of these datasets. Not one to waste his time, he wanted a solution.
At the time, I only knew of rules-based classification to assist. These days, I could easily recommend different techniques for solving this problem. With just a few lines of code and some basic human validation, a great majority of this work can be fully automated. With the correct setup, unmatched fields could be flagged for human review.
The Problem
Rather than explore such a dark dataset focused on causes of death, let’s examine a much brighter scenario. A data store (in this case, an excel spreadsheet) contains a text field denoting a US Holiday. Employees are hand-typing this field and use a variety of spellings and abbreviations. You want to match these text values up with a master US Holiday list. You can then go back and update your data source as you see fit.
The Data
For this scenario, I created two spreadsheets. One spreadsheet contains a column of the holiday values that the employees have typed in. I have also included a Validation column so that I can test the results against my expected results.
The other spreadsheet contains the mast list of US Holidays with the preferred spelling.

I loaded up the data and created both pandas data frames and python lists.
import pandas as pd
import numpy as np
from platform import python_version
print('python version: ' + python_version())
path = '<path to your files>'
# create dataframes
official_list_df = pd.read_excel(path + 'holiday official list.xlsx')
hand_typed_list_df = pd.read_excel(path + 'holidays.xlsx')
# create lists
official_list = official_list_df['Holiday Name'].fillna('*').to_list()
hand_typed_list = hand_typed_list_df['Holiday'].fillna('*').to_list()
validation_list = hand_typed_list_df['Validation'].fillna('*').to_list()
Possible Solutions
I’m going to review two python packages, fuzzymatcher and fuzzywuzzy. There are advantages to each package, and the one you choose might depend on your preferences.
fuzzymatcher
Fuzzymatcher was designed to match text columns between two Pandas data frames joined on at least one column. This simple interface is convenient because Pandas is such a popular package to use.
More information on fuzzymatcher:
- https://pypi.org/project/fuzzymatcher/
- https://github.com/RobinL/fuzzymatcher
- Fuzzymatcher uses sqlite3’s Full-Text Search to find potential matches. (pip install pysqlite3, the ‘FTS’ extensions found in the anaconda/dll folder)
- It then uses a probabilistic record linkage to score matches.
- The output is the list of the matches it has found and the associated score.
The code was straightforward – join two data frames and identify the joining columns:
import fuzzymatcher
hand_typed_list_join_on = ['Holiday']
official_list_join_on = ['Holiday Name']
matched = fuzzymatcher.fuzzy_left_join(hand_typed_list_df,
official_list_df,
hand_typed_list_join_on, #left join on
official_list_join_on, #right join on
left_id_col='Holiday',
right_id_col='Holiday Name')
matched

The results are mixed but workable. Your Seinfeld-loving employee entering Festivus was correctly not matched. MLK, Veteran’s, and President’s Days were the most problematic.
The scoring system is not intuitive. At first glance, I would think I would drop all matches with values less than zero. I see from the results that negative values aren’t a hard cutoff.
Let’s take a look at the next package.
fuzzywuzzy (most fun package name ever)
Fuzzywuzzy info:
- https://pypi.org/project/fuzzywuzzy/
- "Fuzzy string matching like a boss."
- It uses Levenshtein Distance to calculate the differences between sequences.
I added some additional code to create a nicely formatted output data frame that evaluates the results and clarifies if the matching is acting as expected, based on the lower-limit scoring set.
Based on several iterations, I decided on a score of greater than 86 to accept the match.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# create a results dataset so we can review
results_df = pd.DataFrame(columns = ['Hand_Typed_Name', 'Matched_Name', 'Match_Score', 'Allow_Flag'
, 'Validation_value', 'Matched_Flag', 'Success_Code'])
# iterate over each element on the lists
for index, row in hand_typed_list_df.iterrows():
result = process.extractOne(row.Holiday, official_list)
# does the potential match value equal the validation value?
if str(result[0]) == row.Validation:
res = 'Yes'
else:
res = 'No'
# does the match score meet the minimum threshold set?
if result[1] > 86:
allow = 'Yes'
else:
allow = 'No'
# are my match and scoring cutoff flags consistant?
if res == allow:
final = 'Passed'
else:
final = 'Needs Review'
# append the data to the results dataframe
new_row = { 'Hand_Typed_Name' : str(row.Holiday)
, 'Matched_Name' : str(result[0])
, 'Match_Score' : str(result[1])
, 'Allow_Flag' : str(allow)
, 'Validation_value' : str(row.Validation)
, 'Matched_Flag' : str(res)
, 'Success_Code' : str(final)
}
results_df = results_df.append(new_row, ignore_index=True)
results_df.head(40)

The scoring scheme is much easier to understand than fuzzmatcher. As you can see from the scores, it is not an exact science as to the cut-off value. The MKL value (index 26) received a low score of 57, but the match was successful. On the other hand, there are three scores of 86. One successfully matched while two others did not.
Conclusion
Using Fuzzy Matching to clean up datasets is a handy skill to have in most workplaces. There are almost always messy text fields that some poor soul is manually updating each month before reporting. Build an automated process to do a good 80% of the work before deferring to the human for the edge cases. You’ll be the office hero!