The world’s leading publication for data science, AI, and ML professionals.

Fuzzy String Match With Python on Large Dataset and Why You Should Not Use FuzzyWuzzy

FuzzyCouple: A Solution for Fuzzy Match Using TF-IDF and Cosine Similarity

All Images in this article created by Author
All Images in this article created by Author

Why we need fuzzy string match and what are the use cases?

Languages are ambiguous. The text referring to the same thing could be written slightly differently, or even misspelled. Assuming that you are trying to join two tables by the column of addresses, the same location shows in table A as "520 Xavier Ave, California City" while "520 Xavier Avenue, CA" in table B. How would you handle this issue?

There are a few more examples demonstrating the same content written in different ways:

  • The Queen’s Gambit vs Netflix The Queen’s Gambit (miniseries)
  • Toronto Raptors vs Raptors
  • Los Angeles Lakers vs Lakers
  • [email protected] vs [email protected]
  • Tesla, Inc. vs TSLA

We want to treat them as the same thing before we feed the data to Machine Learning models or apply any other data analysis methods.


Why we should not use FuzzyWuzzy?

When it comes to fuzzy string match, the first solution data scientists typically take is FuzzyWuzzy. FuzzyWuzzy package is a Levenshtein distance based method which widely used in computing similarity scores of strings. But why we should not use it? The answer is simple: it is way too slow.

The estimated time of computing similarity scores for a 406,000-entity dataset of addresses is 337 hours. It only took 21 minutes on the same dataset with another solution FuzzyCouple which we will introduce soon. FYI, my box is MacBook Pro 2019 with an Intel Core i7 processor.

337 hours vs 21 minutes on large datasets – is it a strong enough argument that convinces you to try FuzzyCouple?


How to implement FuzzyCouple with Python?

FuzzyCouple = TF-IDF + Cosine Similarity

If you say that it is the first time you heard about FuzzyCouple, you are absolutely right. Because I created the term that this method can be introduced as "I applied FuzzyCouple" instead of "I used TF-IDF as vectorizer then computed similarity scores with cosine similarity."

The ARXIV dataset used in this project contains about 41,000 research papers related to machine learning, NER, and computer vision published between 1992 and 2018.

Import libraries:

Load the dataset:

Create an index column; keep the title column only from the original dataset:

Data Cleaning:

  • replace symbols: n " ‘ / ( ) { } | @ , ; # with space
  • remove multiple spaces
  • to lower case
  • remove stop words
import re
text = 'Supporting "Temporal 'Reasoning by Mapping #@{ Calendar Expressions to Minimaln Periodic Sets'
REPLACE_BY_SPACE_RE = re.compile('[n"'/(){}[]|@,;#]')
text = re.sub(REPLACE_BY_SPACE_RE, ' ', text)
text = re.sub(' +', ' ', text)
text = text.lower()
print(text)
Replace the symbols in REPLACE_BY_SPACE_RE with space
Replace the symbols in REPLACE_BY_SPACE_RE with space

Define a function for text preparation:

Test the function:

Apply the text preparation function to the dataset:

As illustrated above, the column TITLE has removed stopwords, symbols, extra spaces, and been converted to lower cases.

EDA on the paper titles with WordCloud:

from wordcloud import WordCloud
wordcloud = WordCloud().generate(' '.join(df['TITLE']))
# plot the WordCloud image
plt.figure(figsize = (10, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Image by Author
Image by Author

As shown in the plot, the most common words used in research paper titles are neural, network, learning, analysis, classification, model, using, based, algorithm.

Transform text to vectors with TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(S+)')
tf_idf_matrix = tfidf_vectorizer.fit_transform(df['TITLE'])

Check the vectors:

Compute Cosine Similarity:

Finished in 1.4 seconds:

Create a match table to show the similarity scores:

Check random samples from the match table:

Column TITLE is the strings we remove stopwords and symbols from original paper titles. Column SIMILAR_TITLE is similar to TITLE. similarity_score indicates how similar they are which ranges from 0 to 1.

The hist of similarity_score:

Check the highest-scoring titles:

As shown in the above example, the TITLEunsupervised learning disentangled representations video’ is similar to ‘inferencing based unsupervised learning disentangled representations’ with a high score of 0.94.

Let’s check the corresponding titles in the original papers:

As we have seen, they are both research papers about disentangled representations with unsupervised learning algorithms.


In this article, we introduced FuzzyCouple which is a super-fast solution for fuzzy string matching at scale. As mentioned at the beginning, text or languages can be ambiguous. FuzzyCouple is an efficient and practical method for identifying the "same thing" in unstructured data.

Reference:

  • Super Fast String Matching in Python by Ven Dan

Sign up for Udemy course 🦞:

Recommender System With Machine Learning and Statistics

https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D
https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D

Related Articles