The world’s leading publication for data science, AI, and ML professionals.

Neural Search with Haystack – Semantic Search At Its Finest

A quick and easy guide to building your own custom search engine with minimal setup

Photo by Luca Huter on Unsplash
Photo by Luca Huter on Unsplash

Semantic searches are pretty common in our everyday lives, especially with the number of times we use Google as the medium for answers to our most, if not all, questions. Up until a few years back, building a moderately good search engine without huge computational resources was a pain. Simply put, it was way too hard and required a significant amount of expertise over advanced NLP concepts.

The advent of the BERT transformer model and its successors changed that. Although we still do need a good GPU to fine tune these models, some awesome libraries like Haystack have now made it extremely easy to build a production ready application around a dataset of our choice.

In this article, I’ll be showing how you can quickly spin up your own custom search engine from a publicly available dataset from, for instance, Kaggle.

Let’s get started! 👇

But first, what exactly is Haystack?

Simply put, it is an awesome library to build production ready, scalable search systems and question answering systems.

It utilises the latest transformer models thus proving to be one of the best possible NLP toolkits to both researchers and developers.

It provides different components used in building such systems and those are quite intuitive to learn and experiment with. I’ll be going over some of them in this article in the hopes of giving you a good starting point!

Getting the dataset

We will be using a subset of this dataset from Kaggle. It contains about 1.7 million arxiv research papers from STEM with various features for each such as their abstract, authors, publication dates, title, etc. It’s available in json format which I’ve gone ahead and compiled a smaller version of.

The smaller version is available here, at my GitHub repo. Go ahead and download the arxiv_short.csv onto your machine.

I’ve made sure to only keep 50,000 documents and only three columns, for simplicity:

  • authors – the authors of the papers
  • abstract – the paper abstract
  • title – the paper title

Read it with pandas:

import pandas as pd
df = pd.read_csv('arxiv_short.csv')
df.head()

Let’s take a look at the data:

arxiv dataset head rows
arxiv dataset head rows

We are now ready for setting up our haystack engine.

Installing Haystack

This step requires only two lines of code. Make sure you’re in a virtual environment first.

pip install git+https://github.com/deepset-ai/haystack.git
pip install sentence-transformers

We’ll use haystack for building the actual search engine and sentence-transformers for creating sentence embeddings from our abstract text column on which our search engine will be based upon.

Now, let’s go ahead and build different components for our haystack search engine!

Initialising the Document Store and Retriever

A Document Store stores our searchable text and its metadata.

For example, here our text will be the abstract column from our dataset and the remaining two columns – title and authors – will consist of our metadata.

It’s fairly simple to initialise and build:

document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)

FAISS is a library used for efficient similarity search and clustering of dense vectors, and because it takes no additional setup, I’m prefering to use it here over Elasticsearch.

Now comes the Retriever.

A Retriever is filter that can quickly go through your full document store and make out a set of candidate documents from it, based upon a similarity search to a given query.

At its heart, we are building a semantic search system, so getting relevant documents from a query is the core of our project.

Initialising a Retriever is also as easy:

retriever_faiss = EmbeddingRetriever(document_store_faiss, embedding_model='distilroberta-base-msmarco-v2',model_format='sentence_transformers')

Here, distilroberta is simply a transformer model – a variation of BERT model – that we use here to make embeddings for our text. Its made available as part of the sentence-transformers package.

Now, we want to write documents into our document store.

Building the Document Store

We simply pass the columns of our dataframe to the document store.

document_store_faiss.delete_all_documents() # a precaution

document_store_faiss.write_documents(
df[['authors', 'title', 'abstract']].rename(columns={ 
'title':'name',
'author' : 'author',
'abstract':'text'
}
).to_dict(orient='records'))

A little bit of renaming is needed here because haystack expects our documents to be written in this format:

{ 'text': DOCUMENT_TEXT, 'meta': {'name': DOCUMENT_NAME, ...}     }, ... and so on.

Finally, after this is built, we provide our document store into our retriever.

document_store_faiss.update_embeddings(retriever=retriever_faiss)

These two steps take longer the higher your dataset size is.

We’re almost done! The only thing left is to make a function to retrieve documents matching with a query!

Concluding – Fetching results and trying it out

We define a simple function to fetch 10 relevant documents (research paper abstracts) from our data.

def get_results(query, retriever, n_res = 10):
    return [(item.text, item.to_dict()['meta']) for item in    retriever.retrieve(q, top_k = n_res)]

Finally, we test it!

query = 'Poisson Dirichlet distribution with two-parameters'
print('Results: ')
res = get_results(query, retriever_faiss)
for r in res:
    print(r)

You can see the results like this:

search results
search results

And there you have it – a custom semantic search engine built on a dataset of your choice!

Go ahead and play with it! Tweak some parameters – for example, try changing the transformer model in the retriever object.

The entire code is also available as a notebook in here:

yashprakash13/haystack-search-engine

Thanks for reading! 🙂


Learning Data Science alone can be hard, follow me and let’s make it fun together. Promise. 😎

Also, here is the codebase of all my Data Science stories. Happy learning! ⭐️


Here is another article of mine that you might want to give a read:

Deploying An ML Model With FastAPI – A Succinct Guide


Related Articles