Locality Sensitive Hashing in NLP

A hands-on tutorial on how to speed up document retrieval by reducing the search space through Locality Sensitive Hashing (LSH)

Raj Sangani
Towards Data Science

--

Photo by Markus Winkler on Unsplash

The problem at hand

During one of my recent projects, I needed to analyze whether the medical problems similar to complaints presented by a patient had been diagnosed and documented earlier and what ailments these complaints indicated. In order to compare the current patient’s symptoms and troubles with other records in the past, I needed to compare the current problem text with all other documents in the dataset and find the ones most similar to it. For n items, this would involve n comparisons with the current document. Moreover, I used DistilBERT to encode these sentences; hence the comparison (cosine similarity) is also proportional to the embedding dimension (768) of these documents. The medical dataset I have used is available here. This is what the data looks like.

Can we improve traditional search?

Instead of comparing the vector with all the documents, wouldn’t it be nice if we could reduce the search space somehow? This is exactly what locality sensitive hashing (LSH) helps us do. Here is how we can use LSH to reduce the search space.

The intuition behind LSH using Hyperplanes

Using lines to split the search space (Image By Author)

In the diagram above, we have various data points denoted by ‘X’s. Say we wanted to find the point most similar to point A. First, we randomly draw two lines (the red and blue lines). Now we can see where a data point lies with respect to each line. A data point can either be on side 0 of a line, or it can lie on the other side (side 1). First, we find out where all data points lie with respect to the two lines. Note that I have only shown 4 points in the table below to minimize clutter. More importantly, since our setting is 2-dimensional, I have used a line. In higher dimensions, these lines are analogous to hyperplanes. More formally, in an n-dimensional setting, the hyperplane consists of (n-1) dimensions.

Here is how our current setting looks in a table

Position of the points with respect to planes (Image By Author)

From the table, we can see that points A and B lie on the same side of both the planes. We need to quantify this table and assign each point a hash value. Here is how a has value is calculated.

Hash Value (Image By Author)

Here is what we get when we apply this formula to our setting.

Hash Values (Image By Author)

We can see that points A and B have the same hash value and hence would lie in the same entry (hash bucket) if we were to construct a hash table. In simple terms, hash tables are data structures that allow instant lookups. They are mostly implemented through the dictionaries in Python, where the keys are the hash values, and the dictionary values are our vectors (data points). Let us construct a hash table using the 4 points above.

Hash Table (Image By Author)

Now we can clearly see that points A and B are in the same entry in our hash table. This indicates that these points are similar. If we want to find points similar to point A, all we need to do is look at points that have the same hash value as A. This simple intuition works very well in practice, but there are a few technical details that we must talk about before moving on to the implementation.

How shall I determine the number of planes?

In the simple setting above, we randomly drew two lines, but when we have more data, what shall we do? There is a tradeoff between the number of planes and speed. The more planes we have, the more time it will take to execute LSH. Here is how to work out the number of planes to be used.

When we have n documents ( or n vectors), we would ideally want that each hash entry (bucket) has no more than 16 vectors. In that case, we would need n/16 buckets. Now each plane divides the search space into two parts (same side and opposite side). Hence if we had p planes, we would want that 2^p is equal to n/16. Solving for p, we can find the number of planes. Note the can be changed depending on the problem and size of the search space, and one should ideally try out different values to see what gives us the best results.

Mathematically determining which side of the plane a vector lies on

Above I had indicated two sides manually. In higher dimensions, the dot product helps us determine which side a vector lies on with respect to the plane. Every plane has a normal vector, a vector perpendicular to all points lying in the plane. We must take a vector and calculate its dot product with the normal vector to see which side of the plane it lies on. A non-negative dot product indicates that it lies on the same side as the normal, and a negative dot product indicates that it lies on the opposite side of the normal. In general, a plane represented as Ax+By+Cz=D has the normal <A, B, C>.

Here is a simple diagram to help you visualize this process.

Determining where a vector lies with respect to a hyperplane (Image By Author)

In the diagram above, we can see that vector V1 lies on one side of the plane, whereas vector v2 lies on the other side of the plane.

Improving accuracy by creating multiple hash tables

Since we randomly generate the planes, it is possible that some similar vectors are assigned to different buckets. To minimize the effect of this stochasticity, we can create multiple hash tables with different planes. While searching for a vector, we can then look for vectors that have the same hash value like it in all these hash tables. This would relatively increase the search space but greatly improve accuracy. (Note: the search space still remains much smaller than the original search space). In the code, I have created 25 hash tables.

Let’s dive in!

Before we start, this is what our code is going to represent. Also, note that I have not shown the entire code since it is verbose. Feel free to check out the entire code here.

A visual overview of the code (Image By Author)

1. Encode the entire text dataset into vectors (768 dimensional) using DistilBERT.

2. Determine the number of planes to use. Also, set the number of hash tables (Denoted by n_repeats)

3. Now, randomly generate planes and create a function to find the hash of a vector.

#Generate 11 planes randomly. This gives us a 768 X 11 dimensional matrix
planes_l = [np.random.normal(size=(embedding_dims, n_planes)) for i in range(n_repeats)]
print(len(planes_l))
planes_l[0].shape

4. Now, create multiple hash tables for improved accuracy

5. Create a function that takes as input our problem text and finds vectors similar to it using LSH

Notice how we add potential candidates with the same hash value as our query vector with respect to all 25 hash tables

6. From this reduced set of vectors, find the three most similar vectors using cosine similarity

7. Compare results between using the entire search space and using the reduced search space after LSH

Here is our query vector for which we want to find historical documents containing similar complaints and conditions.

Here is the search using the entire space

Here is the search using the reduced space after LSH

Wow, we got exactly the same results but in a drastically lesser time! We reduced wall time from 576ms to 53ms!

Conclusion

Using LSH, we reduced the search space from 25971 documents to 2042 documents, reduced search time from 576ms to 53ms and still got the same results! Note that these results may not always be the same since we are approximating the reduced search space but are still supposed to be fairly similar to our query vector.

If you liked this article, here are some more!

Check out my GitHub for some other projects. You can contact me here. Thank you for your time!

--

--