This scenario came when I had to match product reviews from a few Malaysian-based online shopping platforms. And what makes us Malaysian’s special is how we spell and give product reviews.

I have decided to share this challenge that I came across, and together with a few scenarios that may make this topic relevant for others.
Imagine if today, you had a requirement to join 2 or more tables from different sources, and it could be because of these scenarios:
- 2 companies merged and there is a need to combine their products data for the stock count.
- You are comparing some reviews from different apps of a list of products.
- Finding feedbacks from one form, and trying to combine similar words into one.
And one of the issues is you have no control over the data, could not get them to re-index their product IDs or keys as businesses require you to produce them asap, or perhaps it is just for reporting purposes and the business has no intention of re-indexing to avoid service disruption.
Magic portal to help you jump to your desired topic on this page
- Today’s challenge
- What is Levenshtein Distance and Jaccard Index
- What each algorithm solves and fails to solve
- UDF in Big Query for Jaccard Index and Levenshtein
- Today’s challenge Jaccard Index use case
- Takeaway points
Today’s challenge
To give you more insight on our challenge today, here is an example of 2 tables we will be trying to join today. We want to get the sum of occurrences for each of the reviews. And instead of Levenshtein Distance, we will be focussing on using Jaccard Index for the sake of this introduction. While this may seem nonsensical, these are legit reviews that I pulled from Malaysia’s top few online shopping platforms.


Doing a quick Google search will end up with a common way to solve this challenge, which is to use an algorithm called Levenshtein Distance. While Levenshtein Distance works great for most scenarios, I would like to introduce another algorithm called Jaccard Index/Jaccard similarity. This is not a debate of the "better" or "perfect" algorithm, but an introduction to another approach. Before diving into the practical, we would need to know what both Levenshtein Distance and Jaccard Index algorithm solves.
What is Levenshtein Distance and Jaccard Index
In layman terms, Levenshtein calculates how many steps/edits does it take to change one string to the other, and Jaccard calculates how much the first string intersects the second string. I have illustrated the difference between how both Algorithms function below.

In the illustration above, Levenshtein has to perform a deletion on the extra "a" from thaanks, thus incurring one edit to make the first string similar to the second string. Whereby Jaccard will just compare if each unique alphabet exists in the other string. Jaccard does not take the number of "a" into consideration, therefore the 2 "a" in "thaanks" are considered as an "a".
What each algorithm solves and fails to solve
Levenshtein distance is one of the many algorithms that some auto-correct software uses.

"thank" and "thaank" have an edit distance of 1, thus having the auto-correct software suggesting "thank" as the alternative. Whereby in some events the user decides to input "thaaaaaaaaaaanks", auto-correct software using Levenshtein distance may not be able to solve it due to having large edit distance (10 steps deletion of "a" to make it "thanks").

As shown in the screenshot above, my browser did not suggest any alternatives as to the word I input probably has a huge edit distance from any words in their dictionary. I am using my Chromium-based browser as an example of what Levenshtein will show, however, I am not suggesting that there are using Levenshtein Distance as their main algorithm, and they might have more complicated algorithms to handle grammar and different languages.
So how will Jaccard handle "thaaaaaaaaaaanks" against "thanks"? First, both words will be made into a set of unique characters, and the unique character for both strings is "thanks". Intersecting "thanks" with "thanks" results in all the characters intersecting, or 1. However, Jaccard may not return a desirable result when both different words have the same unique characters. Take "good" and "god" as an example, both string has unique characters of "god", so we will have 1 as a result since all the characters intersect.
Now that you have an idea of how both algorithms function, let us see how we can apply the Jaccard algorithm for today’s challenge.
UDF in Big Query for Jaccard Index and Levenshtein
Or you can use the same logic and apply it to other databases.
What is a UDF in Big Query? User Defined Functions (UDF) which could be in the form of SQL or Javascript allows you to create functions like programming languages in Big Query.
We are going to create the Jaccard Index algorithm in Javascript, and compare the results with Levenshtein Distance. bqutil
is a project from Google Cloud Platform that hosts several pre built functions that we can use. Their Github repository can be found here. At this time of writing, I have created a pull request to add Jaccard Index as a bqutil
UDF for public use.
A temporary function called jaccard will be created that accepts 2 strings and returns a float ranging from 0–1. If a character from the first string sa exists in the second string sb, increase the intersection size by 1. Once that loop is completed, calculate the index by taking the length of both strings and subtracting the intersection size, then divide the result by the intersection size again to get the percentage of characters intersecting. Finally, return a 2 decimal point float.

4 test cases are then run to compare the results between Jaccard Index and Levenshtein Distance. As shown, Jaccard will be able to correct for the first and second example (thanks and best product example) with a similarity of 1, or 100% similar. Whereby Levenshtein states that 10 and 8 steps are required to correct the second string to the first. On the third example of doge and dodge, Jaccard indicated that both are similar words, which we know it is not and Levenshtein indicated 1 step is required to correct dodge to doge and vice versa.
Today’s challenge Jaccard Index use case
- Create a UDF
- Create the 2 tables to emulate our challenge today
-
Cross join both tables so we can calculate the distance of each review from the second table against each review from the first table
For those who are still confused about the cross joins, I have illustrated it below with based on our current challenge.

A cross join will match each record from one table with each record from another table. We then run the jaccard function against each record and only display records that have a Jaccard index of more than 0.8, or 80% similar.

Using Jaccard Index and a threshold of 80% similarity, it worked well in favor of our dataset. Each record from the second table has its occurrence added correctly to the first table.
Takeaway points
- There is no competition of the "best" algorithm
- Each algorithm has it’s own use case
- Jaccard Index handle’s very specific cases and should not be generalized as a "goto" algorithm without consideration for every Fuzzy Matching project
Using Jaccard Index, I was able to selectively correct at least three-quarters of reviews from my dataset based on a customized "Malaysian slang" dictionary I defined as my t1 from the challenge above. I hope this introduces you, the readers, to another option that you can look into if Levenshtein does not work in your favor just like me.