Hands-on Tutorials

Using transformer-based models for searching text documents is awesome; nowadays it is easy to implement using the huggingface library, and results are often very impressive. Recently I wanted to understand why a given result was returned— my initial thoughts went to various papers and blog posts relating digging into the attention mechanisms inside the transformers, which seems a bit involved. In this post I test out a very simply approach to get a glimpse into the context similarities picked up by these models when doing contextual search with some simple vector math. Let’s try it out.
For the purpose of this post I’ll use a model from the [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
library which has specifically been optimized for doing semantic textual similarity searches. The model essentially creates a 1024-dimensional embedding for each sentence passed to it, and the similarity between two such sentences can then be calculated by the cosine similarity between the corresponding two vectors. Say we have two questions A and B, which get embedded into 1024-dimensional vectors A and B, respectively, then the cosine similarity between the sentences is calculated as follows:

i.e. a cosine similarity of 1 means the questions are identical (the angle is 0), and a cosine similarity of -1 means the questions are very different. For purposes of demonstration, I embedded a set of 1700 questions from the ARC question classification dataset. The full notebook can be found on google colab here. The essential part of doing the sentence embeddings can be seen in the following snippet:
With this, we can easily perform searches in our database of questions; say we have a database of 1700 questions, which we’ve embedded into a 1700×1024 matrix using the above snippet. The first step would be to L2 normalize each row – this essentially means we normalize each question vector to have the length of 1, which simplifies our previous equation such that the cosine similarity between A and B is simply the dot product of the two vectors. Following the embeddings created in the previous snippet, we can pretend that the first question in our dataset is our query, and try to find the closest matching entry from the rest of the questions:
In my sample dataset the first question (query) was "Which factor will most likely cause a person to develop a fever?" and the identified most similar question was "Which best explains why a person infected with bacteria may have a fever?". This is a pretty good match 🙌 – both sentences relate to a person developing a fever. However, how do we know that the reason the algorithm picked that specific match was not just because they both start with the word "Which"?
The thing to remember is that by design the transformer models actually output a 1024-dimensional vector for each token in our sentences – these token embeddings are mean-pooled to generate our sentence embeddings. As such, to get more information on the context used for finding the match in our search query, we could calculate the cosine distance between each token in our query and our search match and plot the resulting 2D matrix:
Which results in the following plot:

Now we can see the cosine similarity between each token in the query and each token in the best search result. It is clear that indeed the "fever" keyword is picked up and is a major part of the "semantic context" which resulted in the search result – however, it is also clear that there are additional components entering into the semantic context, e.g. "to develop a" and "have a" combine with a high cosine similarity score, and the keyword "person" is also picked up, while the initial word "which", which present in both sentences, is less important.
This simple technique of calculating the cosine similarity between all token embeddings gives insight into the contributions of each token towards the final similarity score, and is thus a quick way of explaining what the model is doing when it returns a given search result. It must be noted that when doing the actual search we take the mean of all the token embeddings prior to calculating cosine similarity between different sentence embeddings, which is different to what we’re doing here – even so, seeing how each token embedding from the query is aligned against token embeddings in the best match gives insights into what composes those sentence embeddings used for semantic searching.