How to perform approximate string matching in one line of code

Fuzzy string matching of customer name lists using Python (for everybody)

Lorenzo Tracanna
Towards Data Science

--

Image by Tolga Ulkan from Unsplash

Did you ever try to join two tables using keys or lookup values that don’t match perfectly? Maybe you work in finance or marketing and you have to match two customer lists from two different data sources. While the customer names are meant to be the same, their spelling differs, making it an insurmountable challenge for the VLOOKUP function. If you are going to match non-matching lists of strings (e.g., customer names, products, transactions, addresses), keep reading — this article might save you hours of work.

The article starts by introducing the scientific method used to perform the approximate matches, the Levenshtein distance. Then, it goes through a case study, which shows how to import data from Excel into Python, perform the matches in one line of code, and export the resulting table of 1:1 matches and similarity scores to Excel.

Methods: Levenshtein distance and FuzzyWuzzy

Fuzzy string matching or approximate string matching is a technique that, given a target string, will find its closest match from a list of non-exact matches. If you attempted to use Excel’s approximate VLOOKUP to carry out fuzzy matching, you would know that it works with a sorted list of numbers but not with strings. The reason is that the Excel function returns a match from the lookup range that is lower than the lookup value, and whose successor is greater than the lookup value. The Excel logic is difficult to apply to strings, so we turn to the Levenshtein distance to carry out approximate string matching.

The Levenshtein distance is a robust and effective measurement of the distance between two sequences. The algorithm calculates the edit distance — the minimum number of edits (insertions, deletions, or substitutions) needed for a single-word sequence to match the target sequence. The metric has never been improved since its invention in 1965, and it counts numerous applications, including the comparison of genomes of different species. This article doesn’t provide a mathematical explanation of the algorithm; hopefully, you trust MIT researchers who consider it the best edit distance algorithm we will ever get.

We could write a function that computes the Levenshtein distance but the FuzzyWuzzy package already did that for us. Once installed, the package comes with a range of methods, which ultimately compute the Levenshtein distance similarity ratio. Let’s go through a case study that will get you started on approximate string matching.

Case study: approximate matching of two customer name lists

Problem statement

Let’s assume that we are tasked with matching two lists of customer names. The customer names come from two different sources: list A is neat, while list B is polluted. The 30 customer names should match 1:1 but the ones in list B are spelled out wrong. In addition, list B contains blank cells, a punctuation mark, and duplicate names (see image below). The fictitious dataset is generated by picking company names randomly from the S&P 500 Index.

Image by Author. Fictitious lists of customer names

One option to solve this task would be to perform the matches manually (Accenture:Accent llxure, Adobe:Addobbe, etc.). Yet, if the lists are too long or the task needs to be repeated, one should opt for a more efficient solution, leveraging the Levenshtein distance through the FuzzyWuzzy package.

Step 1: Data import and data cleaning

The first step is to import the customer names table from Excel into Python and get the customer lists ready for matching. If you already know how to do this, skip to step 2, but remember to import the relevant packages.

Before importing the FuzzyWuzzy package, we have to install it. We can use the command line pip install fuzzywuzzy in Anaconda Prompt. To increase by 4–10x the speedup of the strings matching, we are suggested to install the Levenshtein Python C (use the command line pip install python-Levenshtein ).

Now, let’s import the relevant packages:

# Import packages
import os
import pandas as pd
from fuzzywuzzy import fuzz

If the location of your input file is not the same as your Python’s file, you need to change directory. Then, we use pandas.read_excel() to import the customer lists into a pandas DataFrame. Using a list comprehension, we loop through the customer names in list A to remove the blanks at the bottom of the list. Similarly, we use a list comprehension to remove the missing values from list B and retain only the unique customer names. Slimming down the lists will reduce the workload for the fuzzy string matching algorithm.

# Change directory
# os.chdir(r"Your file path")
# Import the customers data
filename = "customers lists input.xlsx"
customers = pd.read_excel(filename)
# Clean customers lists
A_cleaned = [customer for customer in customers["list_A"] if not(pd.isnull(customer))]
B_cleaned = [customer for customer in customers["list_B"].unique() if not(pd.isnull(customer))]

Voilà, the customer names are ready to be matched.

Step 2: Perform the approximate match with FuzzyWuzzy in just one line

The goal is to generate a tuples list of similarity scores and matching strings from list B in one line of code. For this purpose, we use a nested list comprehension, which is considered more “pythonic” than a nested for loop.

Pythonic describes a coding style that leverages Python’s unique features to write code that is readable and beautiful
(What Is Pythonic Style? | Udacity)

Starting from the outer loop in the snippet below, for each customer name in A_cleaned, we generate a list of tuples. Each tuple consists of a customer name from list B and its similarity score in relation to the customer name from list A. The list A_cleaned counts 30 customer names, so the loop generates 30 lists, each consisting of 31 tuples (there are 31 customer names in B_cleaned). Since we are only interested in the best-matching customer names from list B, we wrap the inner loop into the built-in function max(), so to turn the list of lists into a list of tuples. In this way, each customer name in list A is assigned just one tuple.

# Perform fuzzy string matching
tuples_list = [max([(fuzz.token_set_ratio(i,j),j) for j in B_cleaned]) for i in A_cleaned]

The similarity scores are generated by the function token_set_ratio(). After pre-processing the strings (tokenizing, lowercasing, and removing the punctuation), token_set_ratio() computes the Levenshtein distance, controlling for the intersection between the two strings. FuzzyWuzzy’s token_set_ratio() is the preferred ratio when comparing strings of differing lengths, as in our case.

When printing the list of tuples, part of the output looks like the snippet below:

Image by Author. Tuples list of similarity scores and matching strings

Step 3: Export the output to Excel

The final step would be creating a pandas DataFrame and exporting it to Excel. The data frame should consist of:
1) the customer names from list A
2) the matches from list B
3) the similarity scores
To create the data frame, we unpack the list of tuples into two lists. This can be achieved by unzipping the tuples with zip(*) and converting the tuples to lists with the built-in function map().

# Unpack list of tuples into two lists
similarity_score, fuzzy_match = map(list,zip(*tuples_list))
# Create pandas DataFrame
df = pd.DataFrame({"list_A":A_cleaned, "fuzzy match": fuzzy_match, "similarity score":similarity_score})

After creating the DataFrame, we export it to a new Excel workbook called Fuzzy String Matching:

# Export to Excel
df.to_excel("Fuzzy String Matching.xlsx", sheet_name="Fuzzy String Matching", index=False)

Check out the folder in which you saved the Jupyter Notebook, you’ll find the Excel workbook with the data frame. It seems that all the matches have been performed correctly. Part of the worksheet should look like the image below:

Image by Author. Excel worksheet output

Conclusion

Trying to match two lists of strings that don’t match exactly is a challenging task to perform in Excel. This post shows how the daunting task of approximate string matching is made easy using Python. Indeed, the matches can be performed in just one line of code by leveraging the powerful package FuzzyWuzzy and Python’s list comprehensions.

Below, you can find the Jupyter Notebook for easier replication.

About me

I am currently working as an International Finance Graduate at Novozymes. Previously, I was working as a Junior Analyst at Novo Holdings, while completing my MSc in Finance and Investments at Copenhagen Business School. I am passionate about Financial Data Science 📈🐍.

If you have any questions or suggestions regarding this post, you can write lorenzotracanna@gmail.com or connect with me on LinkedIn.

--

--