A Journey into BigQuery Fuzzy Matching — 4 of [1, ∞) — A Tribute to FuzzyWuzzy

Brian Suk
Towards Data Science
9 min readNov 19, 2019

--

This is part of an ongoing series, and there’s other goodness lurking out there! All the fuzzy goodness!

Okay, I know in the last article I said we would use this one to start going into adding address elements and match groups (and I promise they’re still on their way), but I wanted to take a slight detour and add another set of functions before we get there.

Let’s revisit the Levenshtein Distance function. Earlier we used it to calculate the edit distance between two strings. It is designed to determine the number of changes that it takes for one string to become another, helping us figure out if two strings were actually supposed to be the same. We saw how we can use it with words, but what happens when we start to use these on increasingly complex strings?

Before we go into some examples, let’s first establish how to calculate the similarity ratio to more easily understand these comparisons. It’s a simple ratio of the common characters over the total characters.

((|a| + |b|) — ldist(a, b)) / (|a| + |b|)

Let’s create a function to calculate this since we will be using this pretty frequently.

CREATE OR REPLACE FUNCTION
dq.dq_fm_ldist_ratio(a STRING, b STRING) AS (
/*
* Data Quality Function - Fuzzy Matching
* dq_fm_ldist_ratio
* input: Two strings to compare.
* returns: The Levenshtein similarity ratio.
*/
(LENGTH(a) + LENGTH(b) - `dq.dq_fm_LevenshteinDistance`(a, b))
/ (LENGTH(a) + LENGTH(b))
);

To see how these functions behave with increasingly complex strings, we’re going to need a dataset to test this out with. We are going to use a dataset of room descriptions from Expedia and Booking.com. We’ll load this into a new dataset (I’ll be using fuzzy_matching) and take a quick look.

So what are we supposed to call these, then?

These string pairs represent the same room, but you can see that they might be called different things. When we dig deeper through the data, we see some more complex combinations.

a != b

So let’s try using the tools that we have and apply them to this dataset to see how they fare (spoiler alert: it won’t be accurate).

Well, it figured one of them out at least.

Now before we go any further, it’s important to note that the Levenshtein Distance can be implemented in two different ways depending on how you weigh things. Some implementations treat substitutions as a single operation, with equal weighting as replacements and additions. Other implementations give substitutions a weight of two (one subtraction + one addition). When comparing any implementations it’s important to note how it’s weighed, or you may end up with differences in calculations you can’t quite explain. In ours, we gave substitutions as scoring weight of one.

Back to the results. As expected, the edit distance is very high since we are treating the entire string as a single token. Because of that, the match ratios are going to be quite poor. What we need to do is tokenize these strings, and perform smarter comparisons with them.

Enter FuzzyWuzzy. The string matching library, not the bear who had no hair.

FuzzyWuzzy is a Python library that takes strings, tokenizes them, and then matches them in different ways. It’s a very powerful package to utilize, but Python isn’t usable by BigQuery. So let’s see if we can’t replicate some of this functionality in SQL!

Let’s start off with the token sort ratio. This is handy when you have string tokens that are fairly similar but are out of order. What the token sort ratio does is break up the strings into individual tokens, sort them in alphabetical order, concatenate them as a single string, and then calculate the ratio based upon that. To accomplish this functionality, we are first going to create a couple helper functions that splits, cleans, and then sorts the array.

CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_clean_tokenize(a STRING) AS(
/*
* (Helper) Data Quality Function
* dq_hf_gh_clean_tokenize
* This function removes all non-alphanumeric characters.
* input: Uncleaned string
* returns: String of tokenized and cleaned string.
*/
ARRAY(
SELECT
tokens
FROM
UNNEST(
SPLIT(
REGEXP_REPLACE(
UPPER(
TRIM(a)
), '[^a-zA-Z0-9 ]+', ''
), ' '
)
) tokens
)
);

Add another function to sort it.

CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_clean_sort_tokenize(a STRING) AS(
/*
* (Helper) Data Quality Function
* dq_hf_gh_clean_sort_tokenize
* input: Uncleaned string
* returns: String of tokenized, sorted, and cleaned string.
*/
ARRAY(
SELECT
x
FROM
UNNEST(`dq.dq_hf_gh_clean_tokenize`(a))
AS x
ORDER BY x)) ;

From here we can re-assemble the arrays, and do our comparison.

CREATE OR REPLACE FUNCTION
dq.dq_fm_ldist_token_sort_ratio(a STRING, b STRING) AS(
/*
* Data Quality Function - Fuzzy Matching
* dq_fm_ldist_token_sort_ratio
* input: Two strings to compare.
* returns: The Levenshtein similarity ratio with sorted tokens.
*/
`dq.dq_fm_ldist_ratio`(
ARRAY_TO_STRING(`dq.dq_hf_gh_clean_sort_tokenize`(a),''),
ARRAY_TO_STRING(`dq.dq_hf_gh_clean_sort_tokenize`(b), '')));

With these two in place, let’s go ahead and add this to the dataset to see how it behaves. We are looking for entries with a high edit distance (indicating that there is a high number of raw replacements as everything is out of order) and a high sort ratio, which means the individual tokens are fairly similar.

We’re starting to get some better matches!

These results show you that you can use this to match strings with out of order tokens, while allowing for some level of tolerance for character change.

What happens when the strings have some commonality between them, but both the token length and the actual content of the rest of the strings widely vary? This is where token sets come in! The folks at SeatGeek, who developed and open sourced the FuzzyWuzzy Python package, have put up a really good description of it on their blog. I recommend reading their blog post to get a greater understanding of how this all works, but here is the gist of it.

1 — For strings a and b, tokenize both strings.
2 — Create three strings:
t0 = sorted intersection of tokens from a and b
t1 = t0 + sorted remainder of tokens from a
t2 = t0 + sorted remainder of tokens from b
3 — Return the highest ratio from ldist(t0,t1), ldist(t0,t2) and ldist(t1,t2)

Why does this work? As the folks at SeatGeek explain it, “the intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar.” Let’s try this with one pair of room descriptions, and see how this works.

a = "Deluxe Room, 1 Queen Bed (High Floor)"
b = "Deluxe Queen Room — High Floor With Free Wi-Fi"

Cleaning, sorting, and tokenizing them gives us the following:

token_a =
['1','BED','DELUXE','FLOOR','HIGH','QUEEN','ROOM']
token_B =
['DELUXE,'FI','FLOOR','FREE','HIGH','QUEEN','ROOM','WI','WITH']

The tokens that intersect are, ‘DELUXE’, ‘FLOOR’, ‘HIGH’, ‘QUEEN’, and ‘ROOM’. This gives us the following three strings and comparisons:

t0 = 'DELUXEFLOORHIGHQUEENROOM'
t1 = 'DELUXEFLOORHIGHQUEENROOM1BED'
t2 = 'DELUXEFLOORHIGHQUEENROOMFIFREEWIWITH'
`dq.dq_fm_ldist_ratio`(a, b) = 0.64
`dq.dq_fm_ldist_ratio`(t0, t1) = 0.92
`dq.dq_fm_ldist_ratio`(t0, t2) = 0.8
`dq.dq_fm_ldist_ratio`(t1, t2) = 0.83

So this giving us a confidence score of 0.92, which is much higher than the original 0.64. This is possible because the intersection consists of the majority of one the original strings, so naturally, the comparison is going to be high. If the intersection is small, but the remainder of the tokens are still similar, that still allows us to have a high confidence score by comparing t1 and t2. We are covered from multiple angles!

Let’s now figure out how to put this into BigQuery functions. We already have a function from earlier that cleans, tokenizes, and sorts the strings. We now need to construct something that finds the sorted intersection.

CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_find_array_intersection(a ARRAY<STRING>, b ARRAY<STRING>) AS(
/*
* (Helper) Data Quality Function
* dq_hf_gh_find_array_intersection
* input: Two arrays to compare
* returns: Array with the common elements
*/
ARRAY(
SELECT
*
FROM
UNNEST(a) AS a
INTERSECT DISTINCT
SELECT
*
FROM
UNNEST(b) AS b )) ;

We’re also going to need something that calculates the difference between the two sets of tokens.

CREATE OR REPLACE FUNCTION
dq.dq_hf_gh_find_array_difference(a ARRAY<STRING>, b ARRAY<STRING>) AS(
/*
* (Helper) Data Quality Function
* dq_hf_gh_find_array_difference
* input: Two arrays to compare
* returns: Array with elements a - b.
*/
ARRAY(
SELECT
*
FROM
UNNEST(a) AS a
EXCEPT DISTINCT
SELECT
*
FROM
UNNEST(b) AS b )) ;

And now we just need to create the strings and find the maximum value of the three comparisons.

CREATE OR REPLACE FUNCTION
dq.dq_fm_ldist_token_set_ratio(a STRING,
b STRING) AS(
/*
* Data Quality Function - Fuzzy Matching
* dq_fm_ldist_token_set_ratio
* input: Two strings to compare.
* returns: The Levenshtein similarity of the maximum ratio
* between the different token sets.
*/
ARRAY(
SELECT
MAX(x)
FROM
UNNEST( [
# First ratio is sorted intersection and combined A diff B
`dq.dq_fm_ldist_ratio`(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
CONCAT(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')))
,
# Second ratio is sorted intersection and combined B diff A
`dq.dq_fm_ldist_ratio`(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
CONCAT(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(
`dq.dq_hf_gh_clean_sort_tokenize`(b),
`dq.dq_hf_gh_clean_sort_tokenize`(a)),'')))
,
# Third ratio is A diff B and B diff A
`dq.dq_fm_ldist_ratio`(
CONCAT(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),''))
,
CONCAT(
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(
`dq.dq_hf_gh_clean_sort_tokenize`(a),
`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')
,
ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(
`dq.dq_hf_gh_clean_sort_tokenize`(b),
`dq.dq_hf_gh_clean_sort_tokenize`(a)),'')))
] ) AS x)[OFFSET(0)]
);

Let’s now run this on our room_data table and see what we get.

Hooray! Matches!

With a few functions, you’re now able to perform some more sophisticated string comparisons. By performing edit distance comparisons on token sets, you can go beyond single word matching to look at broader entities. This is useful not only for things like products and descriptions, but can also be applied to things like addresses. Address matching is a huge part of entity resolution, and matching methods like this are an important part of that. Let’s just take a quick peek at what that looks like.

We’ll use the address of the Empire State Building, and purposefully alter the address in a few different ways. The listed address is 20 W 34th St, New York, NY 10001.

These are okay.

These matches are pretty decent considering we are just looking at pure token string comparisons. With these examples, we are getting a lot of variance because the strings are short, and changing something as small as “St” to “Street” can have a big impact.

If the domain of the entity is known beforehand, such as “we know this is going to be address data,” then it does help to have some tokens get pre-processed and standardized. For example, with addresses in the United States we can usually assume that “St” and “Street” are the same, “Ave” and “Avenue” are the same, and so on. This kind of domain specific pre-processing can help us increase our match percentages between two sets of data!

And with that, we’ve got a few more additions to the fuzzy matching toolbox. Hopefully, this was a worthy tribute to FuzzyWuzzy. With these building blocks, we can start to apply them to complex datasets and use them to build some basic match groups. We’ll start to look at those in the next round, so stay tuned!

--

--

Avid 2020 bed-to-couch traveler, cloud tech, big data, random trivia, Xoogler. My employer isn’t responsible for what’s here. NYC. linkedin.com/in/briansuk