Quick Semantic Search using Siamese-BERT Networks

Use the Sentence-BERT library to create fixed-length sentence embeddings suitable for semantic search on a large corpus.

Aneesha Bakharia
Towards Data Science

--

Image by https://pixabay.com/users/PDPhotos-16/?utm_source=link-attribution&utm_medium=referral&utm_campaign
Image by PDPhotos and downloaded from Pixabay.

I’ve been working on a tool that needed to embed semantic search based on BERT. The idea was simple: get BERT encodings for each sentence in the corpus and then use cosine similarity to match to a query (either a word or another short sentence). Depending upon the size of the corpus returning a BERT embedding for each sentence takes a while. I was tempted to use a simpler model (eg ELMO or BERT-As-A-Service) until I came across the “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” 2019 ACL paper by Nils Reimers and Iryna Gurevych. Source code for the paper was available from Github and PyPi had the Sentence-BERT library ready to be pip installed (if you use Python). In this blog post, I’ll briefly describe the problem the paper solves and present a Google Collab notebook that illustrates just how easy it is to achieve semantic search functionality using a SOTA BERT model.

Why Siamese and Triplet Network Structures?

BERT uses cross-encoder networks that take 2 sentences as input to the transformer network and then predict a target value. BERT is able to achieve SOTA performance on Semantic Textual Similarity tasks but both sentences must be passed through the full network. In a corpus consisting of 10000 sentences, finding similar pairs of sentences requires about 50 million inference computations taking approximately 65 hours (if you have the right computation capacity power eg. V100 GPU). Sentence-BERT impressively can provide the encodings for 10000 sentences in approximately 5 seconds!!!

Sentence-BERT finetunes a pre-trained BERT network using Siamese and triplet network structures and adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding vector. The produced embedding vector is more appropriate for sentence similarity comparisons within a vector space (i.e. that can be compared using cosine similarity).

… Why not BERT-As-A-Service?

The popular BERT-As-A-Service library runs a sentence through BERT and derives a fixed-sized vector by averaging the outputs of the model. BERT-As-A-Service has also not been evaluated in the context of producing useful sentence embeddings.

Semantic Search with the Sentence-BERT Library

I’ve created a Google Colab notebook to illustrate using the Sentence-BERT library to create a semantic search example. The Github repo is available here: SiameseBERT_SemanticSearch.ipynb

A gist version of the Collab Notebook is displayed below:

--

--

Data Science, Topic Modelling, Deep Learning, Algorithm Usability and Interpretation, Learning Analytics, Electronics — Brisbane, Australia