How we built an AI-powered search engine (without being Google)
And how you can too!

Coauthored by Jack Pertschuk, Check out our Github
In this article, I’ll be recounting the difficulties of creating a generalizable, AI-powered search engine, and how we developed our solution, NBoost.
An exploding field
AI Information Retrieval (IR) is a booming area of research. Research in this field focuses on retrieving the most relevant search results based on the meaning of the search result, not just the keywords. Cutting-edge studies generally involve taking existing deep neural networks (such as Google’s BERT), and training them to rank search results. However, problems are abundant (which I’ll be talking about below). Building a robust, scalable semantic search engine is no small feat, so it’s really no wonder Google makes so much money.
The hurdles
- It’s hard to beat existing solutions. Existing search engines such as Elasticsearch make use of text matching algorithms such as Best Match 25. These algorithms work by accounting for term frequency and other word patterns. They actually work surprisingly well. Therefore, they’re hard to beat.
- Even if you do beat existing solutions, it’s hard to generalize. A frequently encountered problem in machine learning is training a model so much on a specific task so much that it cannot draw conclusions about a new task. This is called overfitting. Even if your model comes up with better search results for research articles than text-based search engines, that doesn’t mean that it will work as well on cooking recipes.
- State-of-the-Art (SoTA) models are often slow and unscalable. Even if you’ve got the perfect model that both beats text-matching algorithms, and works on many different domains, it may be too slow to use in production. Generally, SoTA models (such as BERT) have to be run on special hardware (a GPU) to scale to production workloads. This hardware is computationally expensive (and therefore fiscally). To build a search engine that ranks millions of documents, you can’t just tell a large model to rank every search result one by one.

How we did it
As I mentioned previously, there’s a massive amount of research going into studying the implications of machine learning in search engines. This means that researchers are competing to earn top spots on the Information Retrieval benchmarks such as MS MARCO. Some of these models more than double the quality of search results compared to existing search engines. We used these models, learned from them, and created our own (with top benchmarks). This is how we beat existing solutions.
We realized that none of this would be very useful if we couldn’t scale it. That’s why we built NBoost. When you deploy NBoost, you deploy a cutting edge model that sits in-between the user and the search engine, a sort of proxy. Every time the user queries the search engine, the model reranks the search results and returns the best ones to the user. We also built in support for deploying NBoost to the cloud and scaling with as many computers as needed, via the Kubernetes engine. This combats the scalability problem.
From the get-go, we wanted to create a platform that could be a foundation for domain-specific search engines. Therefore, we needed to make sure that NBoost was generalizable enough to be applied on different applications/datasets within a domain of knowledge. The NBoost default model was trained on millions of bing queries (MS MARCO). We found that our default model increased the relevancy of search results by 80% over out-of-the-box Elasticsearch. To test the generalizability of the model on a different corpus, we tested it on wikipedia queries (TREC CAR), a dataset that it had not seen before. It was a pleasant surprise when the frontend revealed that the default model boosted search results by 70% on the different dataset.

You can reproduce our results here.
You can too!
While we were building NBoost, we went out of our way to make our tools open source and easy to use. We made it available via pip, Docker, and Helm (Kubernetes). Our models are hosted on Google Buckets, and are installed automatically when you run NBoost them via nboost --model_dir <model>. You can find the list of available models on our benchmarks table.
You can follow our tutorial to create your own AI search engine!

