Tutorial for Matching Sequences With the FuzzyWuzzy Library

This tutorial will go over how to match strings by their similarity. FuzzyWuzzy can save you ample amounts of time during the Data Science process by providing tools such as the Levenshtein distance calculation. Along with examples, I will also include some helpful tips to get the most out of FuzzyWuzzy.
String matching can be useful for a variety of situations, for example, joining two tables by an athlete’s name when it is spelled or punctuated differently in both tables. This is where FuzzyWuzzy comes in and saves the day! Instead of trying to format the strings in order to match, Fuzzywuzzy uses a some similarity ratio between two sequences and returns the similarity percentage. For a more detailed description, take a look over the documentation. Let’s start by importing the necessary libraries and go over a simple example. Although it isn’t required, python-Levenshtein is highly recommended with FuzzyWuzzy. It makes the string matching process 4–10x faster but the results may differ from difflib, a module providing classes and functions for comparing sequences.
#Installing FuzzyWuzzy
pip install fuzzywuzzy
#Import
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Str_A = 'FuzzyWuzzy is a lifesaver!'
Str_B = 'fuzzy wuzzy is a LIFE SAVER.'
ratio = fuzz.ratio(Str_A.lower(), Str_B.lower())
print('Similarity score: {}'.format(ratio))
#Output
Similarity score: 93
We used the ratio()
function above to calculate the Levenshtein Distance similarity ratio between the two strings (sequences). The similarity ratio percentage here is 93%. We can say the Str_B
has a similarity of 93% to Str_A
when both are lowercase.
Partial Ratio
FuzzyWuzzy also has more powerful functions to help with matching strings in more complex situations. The partial ratio()
function allows us to perform substring matching. This works by taking the shortest string and matching it with all substrings that are of the same length.
Str_A = 'Chicago, Illinois'
Str_B = 'Chicago'
ratio = fuzz.partial_ratio(Str_A.lower(), Str_B.lower())
print('Similarity score: {}'.format(ratio))
#Output
Similarity score: 100
Using the partial ratio()
function above, we get a similarity ratio of 100. In the scenario of Chicago
and Chicago, Illinois
this can be helpful since both strings are referring to the same city. This function is also useful when matching names. For example, if one sequence was someone’s first and middle name, and the sequence you’re trying to match on is that person’s first, middle, and last name. The partial_ratio()
function will return a 100% match since the person’s first and middle name are the same.
Token Sort Ratio
FuzzyWuzzy also has token functions that tokenize the strings, change capitals to lowercase, and remove punctuation. The token_sort_ratio()
function sorts the strings alphabetically and then joins them together. Then, the fuzz.ratio()
is calculated. This can come in handy when the strings you are comparing are the same in spelling but are not in the same order. Let’s use another name example.
Str_A = 'Gunner William Kline'
Str_B = 'Kline, Gunner William'
ratio = fuzz.token_sort_ratio(Str_A, Str_B)
print('Similarity score: {}'.format(ratio))
#Output
Similarity score: 100
Token Set Ratio
The token_set_ratio()
function is similar to the token_sort_ratio()
function above, except it takes out the common tokens before calculating the fuzz.ratio()
between the new strings. This function is the most helpful when applied to a set of strings with a significant difference in lengths.
Str_A = 'The 3000 meter steeplechase winner, Soufiane El Bakkali'
Str_B = 'Soufiane El Bakkali'
ratio = fuzz.token_set_ratio(Str_A, Str_B)
print('Similarity score: {}'.format(ratio))
#Output
Similarity score: 100
Process Module
FuzzyWuzzy also comes with a handy module, process, that returns the strings along with a similarity score out of a vector of strings. All you need to do is call the extract()
function after process.
choices = ["3000m Steeplechase", "Men's 3000 meter steeplechase", "3000m STEEPLECHASE MENS", "mens 3000 meter SteepleChase"]
process.extract("Men's 3000 Meter Steeplechase", choices, scorer=fuzz.token_sort_ratio)
#Output
[("Men's 3000 meter steeplechase", 100),
('mens 3000 meter SteepleChase', 95),
('3000m STEEPLECHASE MENS', 85),
('3000m Steeplechase', 77)]
Similarly to the extract
function, you can also use the process module to only extract one string with the highest similarity score by calling the extractOne()
function.
choices = ["3000m Steeplechase", "Men's 3000 meter steeplechase", "3000m STEEPLECHASE MENS", "mens 3000 meter SteepleChase"]
process.extractOne("Men's 3000 Meter Steeplechase", choices, scorer=fuzz.token_sort_ratio)
#Output
("Men's 3000 meter steeplechase", 100)
String Replacement With FuzzyWuzzy
Take a look at the dataframes below, df_1
to the left and df_2
to the right df_1
contains the athletes that participated in the Summer Olympic Games. This dataframe has a name
column where the athlete names are strings. If I wanted to get the results for the events these athletes participated in, I would scrape the tables and put them in a dataframe. From there, I could perform a left join with the results of the events to the dataframe(left) below. To do this I need to specify the column or index levels to join the values on.


Here, we run into the problem where some of the names from the first dataframe are not in the same format as the second dataframe. If I were to try and left join the second dataframe to the first on the name column, the values will not find a match and therefore, the values won’t be where we need them. Here, we can cast the names from each dataframe into a list, and then create a function with FuzzyWuzzy to return a dictionary holding the strings we need to replace in order to find matches for the values.
#Casting the name column of both dataframes into lists
df1_names = list(df_1.name.unique())
df2_names = list(df_2.name.unique())
#Defining a function to return the match and similarity score of the fuzz.ratio() scorer. The function will take in a term(name), list of terms(list_names), and a minimum similarity score(min_score) to return the match.
def match_names(name, list_names, min_score=0):
max_score = -1
max_name = ''
for x in list_names:
score = fuzz.ratio(name, x)
if (score > min_score) & (score > max_score):
max_name = x
max_score = score
return (max_name, max_score)
#For loop to create a list of tuples with the first value being the name from the second dataframe (name to replace) and the second value from the first dataframe (string replacing the name value). Then, casting the list of tuples as a dictionary.
names = []
for x in doping_names:
match = match_names(x, athlete_names, 75)
if match[1] >= 75:
name = ('(' + str(x), str(match[0]) + ')')
names.append(name)
name_dict = dict(names)
name_dict
#Output
{'Abdelatif Chemlal': 'Abdelatif Chemlal',
'Abdelkader Hachlaf': 'Abdelkader Hachlaf',
'Abderrahim Goumri': 'Abderrahim Al-Goumri',
'Abraham Kiprotich': 'Abraham Kipchirchir Rotich',
'Abubaker Ali Kamal': 'Abubaker Ali Kamal',
'Adil Kaouch': 'Adil El-Kaouch',
'Adrián Annus': 'Adrin Zsolt Annus',
'Ahmad Hazer': 'Ahmad Hazer',
'Ahmed Faiz': 'Ahmed Ali',
'Ahmed Mohamed Dheeb': 'Mohammed Ahmed',
'Ak Hafiy Tajuddin Rositi': 'Ak Hafiy Tajuddin Rositi',
'Aleksandr Bulanov': 'Aleksandar Rakovi',
'Aleksey Lesnichiy': 'Aleksey Lesnichy',
'Alemayehu Bezabeh': 'Alemayehu Bezabeh Desta',
'Alemitu Bekele': 'Alemitu Bekele Degfa',
'Alex Schwazer': 'Alex Schwazer',
'Alicia Brown': 'Alicia Brown',
'Alissa Kallinikou': 'Alissa Kallinikou',
'Allison Randall': 'Allison Randall',
'Amaka Ogoegbunam': 'Amaka Ogoegbunam',
'Amantle Montsho': 'Amantle Montsho',
'Amina Aït Hammou': 'Amina "Mina" At Hammou',
'Amine Laâlou': 'Amine Lalou',
'Anastasios Gousis': 'Anastasios "Tasos" Gousis',
'Anastasiya Soprunova': 'Anastasiya Valeryevna Soprunova',
'Antonio David Jiménez': 'AntonioDavid Jimnez Pentinel',
'Anzhelika Shevchenko': 'Anzhelika Viktorivna Shevchenko}
#Using the dictionary to replace the keys with the values in the 'name' column for the second dataframe
df_2.name = df_2.name.replace(name_dict)
As you can see from the output above, when casting the list of tuples as a dictionary, we can easily replace the original string with a new one. In doing this, when I go to join the dataframes together, the values will be in the correct location on the matching name
.
combined_dataframe = pd.merge(df_1, df_2, how='left', on='name')
Conclusion
This post introduced the FuzzyWuzzy library for string matching in Python. There are many different use cases for FuzzyWuzzy and it can definitely save you time when finding a string match. I would recommend spending some time playing around with the different functions and methods to find the most optimal solution to your problem. Thank you so much for taking the time to check out my blog!
References
- _Arias, F. (2019). Fuzzy String Matching in Python. Retrieved October 27, 2020, from https://www.datacamp.com/community/tutorials/fuzzy-string-python?utm_source=adwords_ppc_
- Gitau, C. (2018, March 05). Fuzzy String Matching in Python. Retrieved October 27, 2020, from https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe
- Fuzzywuzzy. (n.d.). Retrieved October 27, 2020, from https://pypi.org/project/fuzzywuzzy/
- Ztane. (n.d.). Ztane/python-Levenshtein. Retrieved October 27, 2020, from https://github.com/ztane/python-Levenshtein/
- Difflib – Helpers for computing deltas¶. (n.d.). Retrieved October 27, 2020, from https://docs.python.org/3/library/difflib.html