The world’s leading publication for data science, AI, and ML professionals.

Search (Pt 2) – Semantic Horse Race

Cutting edge NLP vs traditional search power

Photo by Noah Silliman on Unsplash
Photo by Noah Silliman on Unsplash

In this non-technical article, we will compare contextual search to the keyword based approach. For the former, we will utilise some of the recent developments in NLP to search through a large corpus of news. We will focus on explaining the differences, pros and cons of the approach vs its traditional counterpart.

This is a three part series on Search.

In Pt 1 – A gentle introduction one we provided an overview of the basic building blocks of search.

Finally, Pt 3 (Elastic Transformers) contains the purely technical considerations of how to build an index as an Elasticsearch engine with contextual text embeddings. For the current discussion, we will use some results from that search index.

In this article, we will

  • Understand how keyword and contextual searches compare and where the latest in NLP can help us with search
  • Consider some examples test out different queries and how the two differ
  • Finally, we will consider pros and cons of the approaches altogether
Image by the author, using gifox
Image by the author, using gifox

In the previous article we took a bird’s eye perspective of how search works, building blocks and how they differ. Here, we will consider a practical comparison between contextual and keyword search. With contextual search, we mean specifically text embeddings using NLP Transformers, which we will dive into in a bit.

Traditional keyword search tends to use specific keyword frequencies to identify a good fit for a search query. This, however, can be limiting in some cases, specifically, if the keywords we are using are not representative of the documents we are searching. For instance, we might look for "natural disasters", but the available documents might contain many examples of specific disasters such as "hurricane", "earthquake", etc. without referring to those as our explicit word choice "natural disaster". This is where contextual embeddings can help.

Enter text embeddings for contextual search

Text embeddings are the mathematical representations of words, sentences or documents as vectors. This means that we can represent text in ways that allow us to perform mathematical operations on them. The way that we can say that the number 2 is closer to 3 than it is to 10 and all of them are closer to each other than they are to 100. If we are able to encode the entire meaning of a document in such a way we can use those properties to find similarities between documents, group them together in meaningful clusters, etc.

In most cases, the way such representations are ‘learned’ by machines is by giving them a lot of text to read and urge them to learn what words or sentences "go together". Famously:

You shall know a word by the company it keeps – Firth, J. R

Some pitfalls and transformers to the rescue

Earlier tools on text embeddings used to be able to only capture single words mostly (Word2Vec, GloVe, etc), however, that was still problematic with examples where a phrase might have different meaning than the words within. ‘Natural’ and ‘disaster’ mean something very different when put together. Recent advances in NLP, have brought forth a fleet of contextual embedding models, such as ELMo, BERT, etc. (one good overview here). Often, however, these come as pre-trained tools which still need fine tuning for a specific task. Such a task may be: sentence sentiment identification, sentence similarity, question-answering, etc. Huggingface’s transformers library made the use of a lot of these tools very accessible.

Hare, we will use sentence-transformers — a Python library which (among others) brings us SBERT models pre-trained for the task of sentence semantic similarity, allowing us to compare the meaning of whole sentences. (definitely check out the paper too). This is the crucial enabler of what we do, as without powerful document level embeddings (in this case the news headline is the document) we are not able to extract meaningful comparisons between the query and documents searched. Note, that SBERT is not the only way to do this, some other approaches are, eg USE, InferSent, etc. Also this approach for search tries to match the query with results that mention similar words from the query. In this way it is similar to keyword search which looks for documents which use the same words of the query but does not replace ‘question-answering’, where we are looking for a specific answer to a question.


And now… the actual horse race …

Here, I will compare the results we get if we query the two approaches side-by-side. I will try to compare the differences and draw some conclusions from those.

Note that a more formal evaluation of a similar task is done in IR-BERT (the task is to find the best supporting articles to a given (query) article, Task 1 here). The authors demonstrate that the contextual approach (SBERT) clearly outperforms the pure keyword approach (BM25), however, results are mixed when more advanced keyword weighing techniques are being used.

Firstly, in many cases, simple keyword search will suffice. Consider searching for "Apple Inc". We do not need a semantic engine for that one. As are looking for is a "named entity", exact matches make sense and approximate matches might be outright wrong. In a way, when we are very specific about what we are looking for, looking for specific names, dates, ids, etc., an exact match may do just fine. However, if we are missing the exact terms of what we need (see ‘natural disaster’ example before) we would benefit from a result based on context rather than exact matches.

The setup: I compared top results between keyword and contextual search side by side on a few queries. In green I highlight the results clearly relevant for search, amber – ambiguous ones and red – irrelevant ones. Notice that we used A Million News Headlines – sourced from ABC news, i.e. the news have some Australian focus.

Virus threat

Lets search (topically) for "virus threat".

Image by the author
Image by the author

Both approaches yield good results. Note that result #5. "WHO highlights dangers of vector borne diseases" doesn’t include any of our search terms but is highly relevant still. Here are more examples where contextual search results do not contain ANY of the keywords. We see relevant mentions of: outbreak, infection, parasite, etc

Image by the author
Image by the author

Natural Disaster

The corpus has plenty of exact matches of "Natural disaster" so the first 10 mentions are pretty close to each other in both forms of search. However, consider the contextual results without exact matches to the keywords. We can see relevant mentions such as: flood, freak storm, catastrophic fire, etc

Image by the author
Image by the author

"Regulatory risk banking reform"

Expanding the query further makes things more complicated. Firstly, about the query: the ask here is a bit vague as the user is obviously interested in banking reform and regulation but not specific about what, who, when. It is likely in the context of a wider research trying to compare and contrast cases. Hence, a good result (independent of search approach) would be looking for diverse examples. Coincidentally diversity of results is also a metric considered in the IR-BERT where contextual search outperforms

Image by the author
Image by the author

Looking at contextual results, it seems to be clear the topic is banking, but narrowing down to regulatory / reform topics is leading to debatable results. There are mentions of RBA (Australia’s Central Bank) which, however, are not always related to regulatory risks as well as mentions of the Royal Commission (is commission investigating a range of financial misconduct in financial services) which ** is relevant to finance but are not always concerned with banking. On the keyword search side, however, despite lack of good documents, the ones returned consistently will have at least 2 of the words in the query, which often seems to make the result "semantically relevant" – as long as a couple of the keywords are present, results are likely relevant. Notice also how explainability of keyword search helps us immediately rule out some case**s. For instance, keyword result #8 "Govt Internet Regulatory Plan Criticised" – we know exactly why this was suggested and can therefore rule it out quickly. In contrast, contextual result #5 "RBA considers cap on credit card surcharges" is not clear as to why exactly the model considers it a good hit

The good and the bad…

What can we say about the benefits and the shortcomings of contextual search when compared to keyword search

Pros

Contextual bit – as we have seen, keyword search can be (sometimes) limiting. This is particularly relevant when doing research or just starting to explore a topic as we often do not relevant terms, keywords and entities for our search

Side-by-side capabilities – in the current setup, we are able to easily switch between key-word and contextual search. Some work can further be done to get both results ranked side-by-side. Some assumptions are needed, however. Notice that keyword search can also be used as an initial filter for good candidates to improve the overall speed of the solution. More on speed considerations in Pt 3.

Cons

Explainability – we cannot easily reverse-engineer why the model decided to show a specific result. Seeing our search words in the results helps make clear decisions on whether we agree with the results. This is an ongoing area of research and development. Recently, Google released a tool for interpretability of contextual embeddings. However, the use here would not be a straightforward application as we need to also assess what is the specific driver of a search match.

Context is contextual – talking about semantic search and meaning can be highly domain specific to what we are actually looking for. For instance, we might be interested in "vaccines" in the context of our own health, or as investors deciding companies support, or as researchers – looking for technical details of trials, methods, etc. These aspects can partially be dealt with by refining the query, but the quality of results will be limited by both the available documents in the index as well as the way in which contextual embeddings have been trained. It is a particular concern that the system not equipped to indicate to the user that the quality of a domain-specific query may be poor – it will dutifully always provide lists of ranked results. In practice, domain specific search engines help solve this. One example would be a specialised tool for COVID searchers based on domain specific articles and research. A couple of examples of such tools are Corona Papers and Google’s COVID Research Explorer both using CORD-19 (COVID-specific database of research articles) to contextualise the text embeddings used

CONCLUSION

We have seen an outline of what are text embeddings and how they can help contextual search.

We illustrated with some contrastive examples and saw how contextual search helps us find themes and terms which we did not anticipate with our query.

Finally, we argued that some of the shortcomings of the approach – lack of explainability as well as that domain-specific queries may may not work well.

All that remains is to show the technical implementation and considerations around building such a tool -which we see in Part 3


Hopefully, this was useful. Thank you for reading. If you feel like saying Hi do reach out via LinkedIn


Related Articles