Search (Pt 1) — A Gentle Introduction

From basic building blocks to DIY search engine

Published in

Towards Data Science

9 min readSep 6, 2020

The ability to search the entire web in less than a second for whatever we fancy knowing is one of the greatest achievements of recent history. But how does it work? What are its building blocks? And, most importantly, … can we hack together our own version of it? The latter is important because search is inevitably personal: it is all about our focus, preferences, resources at our disposal and even emotions. Plus, it’s really cool!

In this three part series, I will:

Pt 1. Provide a gentle introduction to Search using both Google and Elasticsearch as examples
Pt 2. We will explain some state-of-the-art NLP techniques, compare results to traditional approaches and discuss pros and cons
Pt 3. Provide a hacker's guide to building your own search engine with Elasticsearch engine containing 1 million news headlines & employing state-of-the-art NLP for enhanced semantic searches…

Search - in a nutshell

When we talk about search nowadays we often mean Semantic Search. What is semantic search, you ask? Imagine searching for the word “virus threat”. A simple lexical search approach will come back with documents containing the words exactly and with in particular order of importance. Additionally documents about "security threat" will also be considered relevant as they contain part of the query.

Semantic search, on the other hand, is also able to pick up on the idea of “disease”, “infection” and “corona” - we have a far wider and potentially more accurate search reflecting the "meaning" of what we are looking for instead of sticking to its specific keywords. In this section, I have often sourced ideas from the work by Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). “Semantic search on text and knowledge bases”. In that text they state:

Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query

The diagram below shows core components of a search engine of this kind

We focus on semantic search on text with some additional annotations (such as names, dates, links, etc) as opposed to say search on structured databases. This is essentially the typical web search we use all the time.

Please note, the article deals with a search that produces lists of relevant documents or individual facts, not additional steps such as ranking based source quality, eg PageRank, results summarisation, etc.

Query Types

These can be broken down into:

Keyword - these are shorthand searches, not proper sentences but where the set of keywords and sometimes order carries semantic meaning, for instance Neil Armstrong date of birth, pasta recipe under 10mins
Structured/Semi-structured - special syntax used in a query. It can represent either the full query or just refinements to it. For instance, this might be a restriction to only search a specific source, e.g. news from AP only. In other cases this might restrict the languages of the results or state mandatory elements of the query
Natural language & natural questions - fully or mostly grammatically formed questions: “What is Neil Armstrong’s date of birth?”. This is the most natural way to interact with search, however, it also poses many difficulties. For instance, we could be asking multiple questions at once "Where can I park and what are opening hours?" or pose philosophical queries instead of fact searching ones "What is the meaning of life?". As you can see from the examples, the scope of questions is quite broad. While those make sense to us, algorithms tend to specialise in narrow tasks, hence the need for various algorithms working in concert that are able to determine which results are appropriate.

Query processing

These are the different types of transformations the system might need to perform on the original entry before passing it on to the search algorithm. Those could be

Extractive - where specific names, entities, places are extracted to further help the search and compare with values in the document metadata or against knowledge bases. For instance, in the below query Neil Armstrong the information box on the left is a result from invoking google’s Knowledge Base because the query was successfully matched with an entry from it

Filters and constraints - in cases where semi-structured queries specify some restrictions on the results, e.g. only news in English, the scope of the search will be translated to the search engine
Other transformations are modifications to the search, e.g. for wildcard or fuzzy search. In which case the original query may be transformed into one or many variants. For instance, with fuzzy search, we might allow for some number of character modifications to the key words entered until we find the most likely word searched. See below, the result in Google when I look for Neil Armslong. Even though a gentleman by the name Armslong probably exists and is important, the system considers it is far more likely we made a typo.

Search and Rank

Finally, one or more types of search & ranking approaches may be used. These will either be able to find an answer or return a ranked list of results matching the query. Ranking makes sure that more pertinent results are higher up - those might be results that mention keywords of the search more often than other results or contain relevant information to the query in their title or opening paragraphs. There are:

Keyword searches - the most common types, where exact or very close to literal matches are made. The predominant part of searches is still done this way. What makes them semantic - they would use term occurrences to rank higher documents that appear more relevant to a keyword and recognize when some of the keywords are rare ranks hits on those higher than hits on more 'common' words in the query. A number of algorithms are available: BM25, tf-idf, various Learning to Rank methods, etc.
Contextual searches - I refer to any search based on textual embeddings that attempts to use the query entirely and find contextually relevant results. This is opposed to relying on any specific keywords or phrases individually to determine results. We will focus on this a bit more later, as it is central to this series. Some recent advances in NLP techniques here will help us improve the quality of search significantly.

Lets quickly have a face-off - keyword vs contextual search. Searching for “virus thread”, on news headlines, the left set of results are from a keyword approach while the ones on the right are from contextual search. The latter gives us a number of results which are not matching any search term directly like example 5 on the right: “WHO highlights dangers of vector borne diseases”

Knowledge base - as seen above, entries from a knowledge base can be matched directly to entries in a knowledge base and used further for generating a result. More advanced techniques can also apply where a keyword or natural language query can be transformed into a query to a knowledge base. For instance, ‘Astronauts on the moon’ would return another knowledge base result

Question-answering - traditionally, search engines have used modifications from the processing step to transform a natural question to a more keyword-like query and process it as such. More recently, advances in NLP have shown strong performance by algorithms that directly pinpoint whether and where an answer to a natural question can be found within a specific document. Unlike the other search paradigms from above, question answering focuses on providing an actual (single) answer as opposed to a list of documents (like the others in this list). Here is what happens when we ask about the moon landing as a natural question. In addition to a list of answers we get a specific answer.

However, the technique works similarly from a not-so-natural question ‘year of first moon landing’

Finally, slight modifications to the query can break the result and we no longer get an explicit answer, we even land somewhere else completely

Putting it all together

To summarize, any query type can pass through a number of different modifications and be run through any of a number of search mechanisms to produce candidate results. Each of these approaches will express the confidence in their results, however, different confidence scores may not be comparable between different algorithms. At this stage, a further decision algorithm will be able to determine which answers are well suited and "confident" enough to be passed on to the user as the final list of answers.

A functioning search engine can have any or at least one of each of the three steps of the process. We have seen that Google uses most of them under the hood, but what about making our own...

I should reveal my secret agenda…

I actually wanted to hack my own search engine all along.

The tool of choice is Elasticsearch, primarily because it actually comes out of the box with a lot of search features. At the same time, it is very well supported and gets you a long way in terms of open source features.

Here is a diagram of what we get out of the box with Elastic for the purposes of this discussion. Note that you should not trust me on this summary completeness as I have a specific objective in mind.

You will notice that Elastic can handle any query type (even though they will all be handled by default by a keyword search mechanism) and allows for further modifying your queries to fuzzy, wildcard and quite a few others. If the data allows this, one can also apply any number of structured conditions on the results: date of publishing, source, etc.

In terms of search & ranking, there is a lot of flexibility to keyword search but not much else.

Overall, this is a very impressive list of out of the box features. As it turns out, with some extra legwork we can even add contextual search. Which is what we will do next...

Conclusion

We have explored the major building blocks of search, how they work together and the impact on search results. Different types of queries may trigger different search algorithms with a result that is a mixture of approaches. Looking at examples from Google we see that the same user experience (typing into a simple text box) is serviced by a number of techniques.

In the following articles, we will compare contextual and keyword search side-by-side (Pt 2) and finally in will combine a few different tools to extend the capabilities of Elasticsearch with additional contextual capabilities to build our own semantic search engine (Pt 3).

…

Btw, Neil Armslong

I hope you enjoyed reading this, we will be back with more in Pt 2, next week. In the meantime if you feel like saying Hi or just like to tell me I am wrong, feel free to reach out via LinkedIn

Special thanks to Rich Knuszka for valuable feedback.

Please note that I have no affiliation with Google or Elasticsearch and the opinions and analysis are my own