How To Do Fuzzy Matching on Pandas Dataframe Column Using Python?

Sankarshana Kadambari
Towards Data Science
3 min readSep 18, 2019

--

Fuzzy String Matching With Pandas and FuzzyWuzzy

Fuzzy string matching or searching is a process of approximating strings that match a particular pattern. It is a very popular add on in Excel. It gives an approximate match and there is no guarantee that the string can be exact, however, sometimes the string accurately matches the pattern. How close the string is to a given match is measured by the edit distance. FuzzyWuzzy uses Levenshtein Distance to calculate the edit distance.

How to Install FuzzyWuzzy Package

To install FuzzyWuzzy you can use the pip command as follows

Pip install fuzzywuzzy
Pip install python-Levenshtein

One can also use the conda to install FuzzyWuzzy. The following command will install the library.

conda install -c conda-forge fuzzywuzzy
conda install -c conda-forge python-levenshtein

let’s Import the packages now as we have successfully installed the above-mentioned libraries.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

For now, let me give you an insight into how it is used.

In[1]:fuzz.ratio(“Sankarshana Kadambari”,”Sankarsh Kadambari”)
Out[1]:92
In[2]:fuzz.partial_ratio("Sankarshana Kadambari","Sankarsh Kadambari")
Out[2]:83

Fuzz has various methods where you can compare strings such as ratio(),partial_ratio(),Token_Sort_Ratio(),Token_Set_Ratio().Now the next question when to use which fuzz function. You can read it here the scenarios are very well explained. Also, I will provide references in the end where it is explained further in detail along with code.

Now the agenda of the blog how to use this in Pandas Dataframe between two columns and then export it to excel?

import pandas as pd 
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def checker(wrong_options,correct_options):
names_array=[]
ratio_array=[]
for wrong_option in wrong_options:
if wrong_option in correct_options:
names_array.append(wrong_option)
ratio_array.append(‘100’)
else:
x=process.
extractOne(wrong_option,correct_options,scorer=fuzz.token_set_ratio)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array

In the above code snippet, I have used token_set_ratio because it suits my requirement also I have added an if block to reduce the number of iterations by checking the name exactly in the second column because at times there is a good chance of occurrence. You can try various other methods in the scorer parameter I have shared the sources at the bottom where the working of remaining fuzz methods can be studied in detail.

I won't say this is the only way to implement the fuzz logic but you can also try yourself how to increase the scalability of the code along with reducing the execution time on the other hand.

Now let us pass the arguments in the method and create an output excel file with results.

df_Original_List=pd.read_excel(“Original_list.xlsx”,sheet_name=’Sheet1',usecols=[“Original Name Column”])df_To_beMatched=pd.read_excel("To be matched.xlsx",sheet_name='Sheet1',usecols=["To be Matched Column"])

Cleaning the columns is very important before passing it to the function in order to avoid errors. The main runtime errors are created by NAN Values in the column. I handle them in the following way.

str2Match = df_To_beMatched[‘To be Matched Column’].fillna(‘######’).tolist()
strOptions =df_Original_List[‘Original Name Column’].fillna(‘######’).tolist()

Now let us pass the arguments in methods.

name_match,ratio_match=checker(str2Match,strOptions)df1 = pd.DataFrame()
df1[‘old_names’]=pd.Series(str2Match)
df1[‘correct_names’]=pd.Series(name_match)
df1[‘correct_ratio’]=pd.Series(ratio_match)
df1.to_excel(‘matched_names.xlsx’, engine=’xlsxwriter’)

Voila… there it is. I hope this article is useful for you in your respective domains. It is a very powerful function used in excel but now it can be used in python as well for text analytics or analysis.

Other sources where you can find remaining methods implemented and their respective working.

  1. https://medium.com/@categitau/fuzzy-string-matching-in-python-68f240d910fe
  2. https://www.geeksforgeeks.org/fuzzywuzzy-python-library/
  3. https://www.datacamp.com/community/tutorials/fuzzy-string-python
  4. https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49
  5. https://galaxydatatech.com/2017/12/31/fuzzy-string-matching-pandas-fuzzywuzzy/

--

--

Good implementation skills and a zeal to achieve help me in programming. Accomplishing common goal in a team has always been my priority.