Finding Needles in a Haystack — Search Indexes for Jaccard Similarity

From basic concepts to exact and approximate indexes

Published in

Towards Data Science

15 min readAug 18, 2023

Finding Needles in a Haystack. Image by the author using Midjourney.

Vector databases are in the news for being the external memory of large language models (LLMs). The vector databases today are new systems built on decade-old research called approximate nearest neighbor (ANN) indexes. These indexing algorithms takes many high-dimensional vectors (e.g., float32[]), and built a data structure that supports finding the approximate neighbors of a query vector in the high-dimensional space. It is like Google Map finds your neighbors’ houses given your house’s latitude and longitude, except the ANN indexes operate in a much higher dimensional space.

This line of research has a history that dates back decades. In the late 90s, ML researchers were hand-crafting numerical features for multimedia data such as image and audio. Similarity search based on these feature vectors became a natural problem. For a while, researchers crowded this area. This academic bubble broke when a seminal paper, When is “Nearest Neighbor” Meaningful?, basically told everyone stop wasting time because nearest neighbor in high-dimensional space of hand-crafted features is mostly not meaningful — an interesting topic for another post. Still to this day, I keep seeing research papers and vector database benchmarks publishing performance numbers on the SIFT-128 dataset, which consists of exactly the hand-crafted feature vectors with meaningless similarity.

Despite the noise around hand-crafted features, there has been a fruitful line of research focusing on one high-dimensional data type with meaningful similarity: set and Jaccard.

In this post I cover search indexes for Jaccard similarity over sets. I will start with the basic concepts, and then move on to exact and approximate indexes.

Set and Jaccard

A set is just a collection of distinct elements. The songs you liked on Spotify is a set; the tweets your retweeted last week is a set; and the distinct tokens extracted from this blog post also form a set. Set is a natural way to represent data points in application scenarios like music recommendation, social network, and plagiarism detection.

Let’s say on Spotify, I follow these artists:

[the weekend, taylor swift, wasia project]

and my daughter follows these artists:

[the weekend, miley cyrus, sza]

A reasonable way to measure the similarity of our music tastes is to see how many artists we both follow — the intersection size. In this case we both follow the weekend, so the intersection size is 1.

Each set represents a user’s following list. The intersection shows the common followings shared by both users. Image by the author.

However, you could imagine another pair of users each follows 100 artists, and the intersection size is also 1, but the similarity in their tastes should be much smaller than the similarity between my daughter’s and mine. To make the measurement comparable across different pairs of users, we normalize the intersection size with the union size. This way, the similarity between my daughter’s and my followings is 1 / 5 = 0.2, and the similarity between the other pair of users’ followings is 1 / 199 ~= 0.005. This is called Jaccard similarity.

For a set A and a set B, the formula for Jaccard similarity is:

Jaccard similarity formula for set A and B.

Why is set a high-dimensional data type? A set can be encoded as a “one-hot” vector, whose dimensions 1-to-1 map to all possible elements (e.g., all artists on Spotify). A dimension in this vector has a value of 1 if the set contains the element corresponding to this dimension, and 0 otherwise. So, the vectorized set of my followed artists look like the following:

High-dimensional vector representation of the set of followings. Image by the author.

where the second, third, and the third from the last dimensions are the weekend, taylor swift, and wasia project. There are over 10 million artists on Spotify, so a vector like this is extremely high dimensional and very sparse — most dimensions are 0s.

Inverted Indexes for Jaccard Search

People want to find things quickly, so computer scientists invented data structures called indexes to make search performance satisfactory for software applications. Specifically, a Jaccard search index is built on a collection of sets, given a query set, it returns k sets that have the highest Jaccard with the query set.

Search index for Jaccard is based on a data structure called inverted index. An inverted index has an extremely simple interface: input a set element, say the weekend, it returns a list of IDs of sets that contain the input element, e.g., [ 32, 231, 432, 1322, ...]. The inverted index is essentially a lookup table whose keys are all possible set elements, and values are lists of set IDs. In this example, each list in the inverted index represents the IDs of the followers of an artist.

Inverted index contains lists of set IDs matching the query set. Image by the author.

Original sets are stored in a separate table for lookup by their set IDs. Image by the author.

You can see the reason why this is called “inverted index”: it allows you to go from a set element to find sets that contain the element.

Exact Search Algorithm

Inverted index is incredibly powerful data structure for speeding up search. Using inverted index, when search, instead of going through all the sets and compare each with the query set — very expensive if you have millions of sets, you only need to process the IDs of the sets that share at least one element with the query set. You can obtain the set IDs directly from the inverted index lists.

This idea is implemented by the following search algorithm:

def search_top_k_merge_list(index, sets, q, k):
    """Search top-k Jaccard using inverted index.

    Args:
        index: an inverted index, key is set element
        sets: a lookup table for sets, key is set ID
        q: a query set
        k: search parameter k

    Returns:
        list: at most k set IDs.
    """
    # Intialize an empty lookup table for candidates.
    candidates = defaultdict(0)

    # Iterate over set elements in q.
    for x in q:
        ids = index[x]  # Get a list of set IDs from the index.
        for id in ids:
            candidates[id] += 1  # Increment count for intersection size.

    # Now candidates[id] stores the intersection size of set with ID id.

    # A simple routine for calculating Jaccard using intersection size and
    # set sizes, based on Inclusion-Exclusion principle.
    jaccard = lambda id: candidates[id] / (len(q) + len(sets(id) - candidates[id]))

    # Find the top-k candidates order by Jaccard.
    return sorted(list(candidates.keys()), key=jaccard, reverse=True)[:k]

In plain English, the algorithm walks through every inverted index list matched by elements in the query set and uses a candidate table to keep track of the number of times each set ID appears. If a set ID appears n times, the indexed set has n overlapping elements with the query set. In the end, the algorithm uses all the information in the candidate table to calculate Jaccard similarities, and then returns the IDs of the top-k most similar sets.

The candidate table in the **search_top_k_merge_list** algorithm is used to keep track of the overlap counts of indexed sets found through the inverted index.

The search_top_k_merge_list algorithm can be fast when: (1) the number of elements in the query set is small and (2) the number of IDs in the inverted index lists for the query elements are small. In the Spotify scenario, this could be the case if most people follow a few artists (likely true) and all artists have more-or-less the same number of followers (not true). We all know it is a fact that a few top artists are followed by most people and most artists have few followers. After all, the music industry follows the Pareto Distribution.

Taylor Swift has 78 million followers on Spotify, and The Weekend has 67 million. Having them on my following list means the search_top_k_merge_list algorithm will need to walk through at least 145 million set IDs and the candidate table candidates will grow to this astronomical size. Despite the fact that computers today are fast and powerful, on my Intel i7 machine, creating a table of this size still takes at least 30 seconds (Python) and dynamically allocates 2.5 GB of memory.

Most people follow some of these super star artists. So, if you use this algorithm in your search application, you will for sure get a gargantuan cloud hosting bill for large resource usage and terrible user experience for high search latency.

Branch and Bound Optimization

Intuitively, the previous algorithm search_top_k_merge_list processes all potential candidates in a breadth-first fashion because it only uses the inverted index to compute intersection. This algorithm performs poorly due to super star artists with millions of followers.

Another approach is to be more selective about potential candidates. Imagine interviewing candidates for a job and you are the hiring manager. You cannot afford to interview all potential candidates who send you CVs, so you sort the candidates into buckets based on your job criteria and start interviewing candidates who hit the criteria your care about the most. As you are interviewing them one by one you evaluate whether each one actually hit all or most of your criteria and stop interviewing when you found someone.

This approach also works when it comes to finding similar sets of followed artists. The idea is that you want to start from the artist with the least number of followers in your query set. Why? Simply because those artists give you fewer candidate sets to work with, so that you can process less lists of artists from the inverted index and find your best k candidates more quickly. In my Spotify’s following list, wasian project only has 1 million followers — way less than taylor swift. Those much smaller number of people that follows wasian project has the same potential to be in the best k candidates than those much larger number of people that follow taylor swift.

The key insight here is that we do not want to process all potential candidate lists but stop when we processed enough. The tricky part is to know when to stop. The following is a modified version of the previous algorithm that implements the idea.

import heapq


def search_top_k_probe_set(index, sets, q, k):
    # Initialize a priority heap to store the current top-k candidates.
    heap = []

    # Initialize a set for tracking probed candidates.
    seen_ids = set()

    # Iterate over elements in q from the least to the most frequent based
    # on the lengths of their lists in the inverted index.
    for i, x in enumerate(sorted(q, key=lambda x: len(index[x]))):
        ids = index[x] # Get a list of set IDs from the index.
        for id in ids:
            if id in seen_ids:
                continue  # Skip seen candidate.
            s = sets[id]
            intersect_size = len(q.intersection(s))
            jaccard = intersect_size / (len(q) + len(s) - intersect_size)
            # Add the candidate to the priority heap.
            if len(heap) < k:
                heapq.heappush(heap, (jaccard, id))
            else:
                # Only candidates with higher Jaccard than the k-th
                # current candidate will be added in this operation.
                heapq.heappushpop(heap, (jaccard, id))
            seen_ids.add(id)
        # If any new candidate from the remaining lists cannot have higher
        # Jaccard than any of the current best k candidates, we do not need
        # to do any more work.
        if (len(q) - i - 1) / len(q) (<= min(heap)[0]:
            break

    # Return the best k candidates.
    return [id for _, id in heapq.nlargest(k, heap)]

The search_top_k_probe_set algorithm computes Jaccard similarity for every new candidate it found. It keeps track of the current best k candidates at all times, and it stops when the upper bound Jaccard of any new candidate is no greater than the minimum Jaccard of the current best k candidates.

The **search_top_k_probe_set** algorithm walks through the inverted index lists and computes the Jaccard similarity for every candidate set it encounters and keeps track of the current top-k candidate sets. It stops when the maximum Jaccard similarity of any set in the unseen lists is no greater than the minimum similarity of the current top-k candidates. Image by the author.

How to compute the upper bound Jaccard? After processing n lists of candidates, for any unseen candidate, their maximum intersection with the query set is at most equal to the number of remaining unprocessed lists: |Q|-n. We are giving it the most benefit of doubt by saying that such candidate may show up in every single one of the |Q|-n remaining lists. Now we can use simple math to derive the upper bound Jaccard of such candidate X.

The formula for calculating the upper bound of Jaccard similarity between an unseen candidate indexed set X and query set Q, after walking through n lists of candidates.

This clever technique is called Prefix Filter in set similarity search research literature. I wrote a paper about it that goes into much more details and further algorithmic optimizations. I also created a Python library SetSimilaritySearch that implements a much more optimized version of the the search_top_k_probe_set algorithm that also supports cosine and containment similarity measures.

Approximate Index for Jaccard Search

In the last section, I explained two search algorithms that work on inverted index. These algorithms are exact, meaning that the k best candidates they return are the true k best candidates. Sounds trite? Well, this is a question we should ask ourselves whenever we design search algorithms on large-scale data, because in many scenarios it is not necessary to get the true k best candidates.

Think about the Spotify example again: do you really care if the result of a search may miss a few people with similar taste as yours? Most people understand that in everyday applications (Google, Spotify, Twitter, etc.), the search is never exhaustive or exact. These applications are not mission critical enough to justify exact search. This is why the most widely used search algorithms are all approximate.

There are mainly two benefits of using approximate search algorithms:

Faster. You can cut many corners if you no longer need exact results.
Predicable resource consumption. This one is less obvious, but for several approximate algorithms, their resource usage (e.g., memory) can be configured a priori independent of data distribution.

In this post, I write about the most widely used approximate index for Jaccard: Minwise Locality Sensitive Hashing (MinHash LSH).

What is LSH?

Locality Sensitive Hashing indexes are true wonders in computer science. They are algorithmic magic powered by number theory. In machine learning literature, they are k-NN models, but unlike typical machine learning models, LSH indexes are data-agnostic so their accuracy conditioned on similarity can be determined a priori before ingesting new data points or changing data distribution. So, they are more similar to inverted indexes than a model.

An LSH index is essentially a set of hash tables each with a different hash function. Just like a typical hash table, an LSH index’s hash function takes a data point (e.g., a set, a feature vector, or an embedding vector) as input, and outputs a binary hash key. Except for this, they cannot be more different.

A typical hash function outputs keys that are pseudo-randomly and uniformly distributed over the entire key space for any input data. For example, MurmurHash is a well-known hash function that outputs near-uniformly and randomly over a 32-bit key space. This means that for any two inputs, such as abcdefg and abcefg, as long as they are different, their MurmurHash keys should not be correlated and should have the same probability to be any one of the keys in the entire 32-bit key space. This is a desired property of a hash function, because you want even distribution of keys over hash buckets to avoid chaining or constantly resizing your hash table.

An LSH’s hash function does something opposite: for a pair of similar inputs, with similarity defined through some metric space measure, their hash keys should be more likely to be equal, than another pair of hash keys of dis-similar inputs.

What does this mean? It means that an LSH hash function has higher probability of hash key collision for data points that are more similar. Effectively, we are utilizing this higher collision probability for similarity-based retrieval.

MinHash LSH

For every similarity/distance metric, there is an LSH hash function. For Jaccard, the function is called Minwise Hash Function, or MinHash function. Given an input set, a MinHash function consumes all elements with a random hash function and keeps track of the minimum hash value observed. You can build an LSH index using a single MinHash function. See the diagram below.

The mathematical theory behind MinHash function states that the probability of two sets having the same minimum hash value (i.e., hash key collision) is the same as their Jaccard.

h(A) is the hash values of all elements in A by random hash function h.
min(h(A)) is the minimum hash value of all elements in A.

It is a magical result, but the proof is quite simple.

A MinHash LSH index with a single MinHash function does not give you satisfactory accuracy because the collision probability is linearly proportional to the Jaccard. See the following plot to understand why.

Collision probability of a single MinHash function over Jaccard between query set and indexed set. The Y-axis is the collision probability and the X-axis is the Jaccard between the query set and an indexed set. For example, an indexed set having Jaccard = 0.8 with the query set has 80% probability to be retrieved by the index; another indexed set having Jaccard 0.2 with the query set has 20% probability to be retrieved. Image by the author.

Imagine we draw a threshold at Jaccard = 0.9: results with higher Jaccard than 0.9 with the query set is relevant, wheres results with lower than 0.9 Jaccard are irrelevant. In the context of search, the notion of “false positive” means that irrelevant results are returned, wheres the notion of “false negative” means that relevant results are not returned. Based on the plot above and looking at the area corresponding to false positive: if the index only uses a single MinHash function, it is going to produce false positives at a very high probability.

Boosting the Accuracy of MinHash LSH

This why we need another LSH magic: a process called boosting. We can boost the index to be much more attuned to the relevancy threshold specified.

Instead of only one, we use m MinHash functions generated through a process called Universal Hashing — basically m random permutations of the same hash function of 32-bit or 64-bit integer. For every indexed set, we generate m minimum hash values using universal hashing.

Imagine you list the m minimum hash values for an indexed set. We group every r number of hash values into a band of hash values, and we make b such bands. This requires m = b * r.

Minimum hash values of an indexed set in MinHash LSH with m= 16, b = 4 and r= 4. Image by the author.

The probability that two sets having “band collision” — all the hash values in a band collide between two sets, or r contiguous hash collisions, is Jaccard(A, B)^r. That’s a lot smaller than a single hash value. However, the probability of having at least one “band collision” between two sets is 1 — (1-Jaccard(A, B)^r)^b.

Why do we care about 1 — (1-Jaccard(A, B)^r)^b? Because this function has a special shape:

Boosted probability function for retrieval over Jaccard for MinHash LSH Index using b = 32 and r = 32. Image by the author.

In the plot above, you can see by using m MinHash functions, the “at-least-one band collision” probability is an S-curve function with a steep rise around Jaccard = 0.9. Assuming the relevancy threshold is 0.9, the false positive probability of this index is much smaller than the index that uses only one random hash function.

Because of this, an LSH index always uses b bands of r MinHash functions to boost accuracy. Each band is a hash table storing pointers to indexed sets. During search, any indexed set collides with the query set on any band is returned.

A MinHash LSH Index using b = 4 and r = 4. Each band is a hash table whose hash key is a concatenation of minimum hash values from 4 MinHash functions. Image by the author.

To build an MinHash LSH index, we can specify a prior a relevancy threshold and acceptable false positive and negative probabilities conditioned on Jaccard similarity, and calculate the optimal m, b and r, before indexing any data points. This is a great advantage of using LSH over other approximate indexes.

You can find my implementation of MinHash LSH in the Python package datasketch. It also has other MinHash-related algorithms like LSH Forest and Weighted MinHash.

Final Thoughts

I have covered a lot of topics in this post, but I barely scratched the surface of search indexes for Jaccard similarity. If you are interested in reading more about these topics, I have a list of further readings for you:

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman. The 3rd chapter goes into detail about MinHash and LSH. I think it is a great chapter for gaining the intuition of MinHash. Be aware the application described in the chapter is focused on n-gram based text matching.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. The preliminary section of this paper explains the intuitation behind the search_top_k_merge_list and search_top_k_probe_set algorithms. The main section explains how to take cost into consideration when input sets are large, such as a table column.
Datasketch and SetSimilaritySearch libraries respectively implement the state-of-the-art approximate and exact Jaccard similarity search indexes. The issues list of the datasketch project is a treasure trove of application scenarios and practical considerations when applying MinHash LSH.

What about Embeddings?

In recent years, due to breakthroughs in representation learning using deep neural networks like Transformers, similarity between learned embedding vectors is meaningful when the input data is part of the same domain that the embedding model trained on. The main differences between that scenario and the search scenario described in this post are:

Embedding vectors are dense vectors with typically 60 to 700 dimensions. Every dimension is non-zero. In contrast, sets, when represented as one-hot vectors are sparse: 10k to millions of dimensions, but most dimensions are zeros.
Cosine similarity (or dot-product on normalized vectors) is typically used for embedding vectors. For sets we use Jaccard similarity.
It is hard to specify a relevancy threshold on similarity between embedding vectors, because the vectors are learned black-box representations of the original data such as image or text. On the other hand, Jaccard similarity threshold for sets is much easier to specify because sets are the original data.

Due to the above differences, it is not straightforward to compare embeddings and sets because they are distinctively different data types, even though you could classify both of them as high-dimensional. They are suitable for different application scenarios.