
This year’s Berlin Buzzwords was particularly focusing on what can become the future of search – vector search, amongst other really cool topics, like scaling Kafka, distributed systems tracing with Opentelemetry and increasing the job satisfaction. A few sessions dedicated to the topic of vector search included some impressive demos of dense retrieval techniques to enable question answering in your text lake.
In the Ask Me Anything: Vector Search! session Max Irwin and I discussed major topics of vector search, ranging from its areas of applicability to comparing it to good ol’ sparse search (TF-IDF/BM25), to its readiness for prime time and what specific engineering elements need further tuning before offering this to users. If you are interested to get your hands "dirty" with the vector search, you can start by reading a series of blog posts I did on this topic: in Solr, Lucene and Elasticsearch and jump to the GitHub repo (during the conference I’ve got participants reaching out to me to share that they run internal demos using it – so you might too).
Update: the recording of the AMA session is here:

https://www.youtube.com/watch?v=blFe2yOD1WA
For this blog post I decided to pick 3 most important questions from the friendly Internet audience (we circled an online form prior to the event to collect a set of very interesting and deep questions on vector search) and give my portion of the answers expanding slightly around the edges (and augmenting with papers and code where possible).
We see some patterns that have emerged in the space of dense retrieval both from the research side as well as in industry. What are your thoughts on what’s coming next in dense retrieval? Where are things heading and what will people need to do to prepare?
There was a recent paper from Google Research on training embedding model on byte level that will help solve various daunting issues with misspelled queries. Another paper applies Fourier Transform to improving the speed of BERT: 7x faster with 92% accuracy. So the community is moving ahead on solving various issues with the embeddings, and that will fuel further growth of Dense Retrieval / reranking and vector search at large. One thing to pay attention to is whether these models generalize and according to BEIR benchmarking paper, dense retrieval methods don’t generalize well. They will beat BM25 only when the model has been trained for the same domain. In comparison, the fastest methods are reranking based, like ColBERT, with one condition: prepare to allocate 10x more disk space to store the index vs a BM25 index (specific figure: 900 GB vs 18 GB). When it comes to productizing bleeding edge research, you need to holistically assess the impact of a particular search method to scalability, search speed, indexing footprint besides thinking how neural search will co-exist with your current search solution. Will you still allow pre-filtering? Will users have a say on what results they see on the screen? Is your UX going to support this search engine paradigm shift smoothly and still keep user efficiency on par with the current level during the transition?
Also, when you think about building blocks of the vector search for your domain, pick similarity metric carefully: cosine metric favors shorter docs in ranking, while dot product favors longer documents, so a combination of these or even a dynamic metric selection process might be needed. This goes back to carefully designing the whole search stack and / or choosing a search vendor. Performance of vector search at large is an unsolved issue, so you need to be looking for the model configuration that works best for your search engine and pay less attention to the error margins reported by big players. I can also recommend reading good survey papers, like https://arxiv.org/abs/2106.04554 to see the wide picture first. For those of you who would like to practice on billion scale datasets and see the power and limitations of existing ANN algorithms, there is an excellent Big ANN competition announced as part of NeurIPS 2021.
A lot of ML applications use faiss/annoy/nmslib behind a simple webservice for ANN retrieval e.g. in recommender systems. This works well for simple applications, but when efficient filtering is required it seems you need to take the leap to a full fledged search system (elastic, vespa etc) Do you think there’s a unserviced niche for a "faiss plus filter" tool, or do you think the additional benefits of a search system like vespa pays for the additional complexity it brings
Different vendors offer different approaches to building an ANN index. In Elasticsearch world you have two options:
- Elastiknn plugin implementing LSH.
- Open Distro (Open Search) fork of Elasticsearch with HNSW method implemented off-heap graph search.
Elastiknn supports pre-filtering the results using a field filter, like color:blue. OpenDistro implements pre-filtering by re-using the familiar functionality of Elasticsearch, like script scoring and painless extensions. Btw, I experimented with both of these in my previous blog post (mentioned in the beginning), and implemented both indexing and search components to demonstrate both of these implementations.
Whichever method you choose, you need to carefully select hyper-parameters to bring the best performance in terms of indexing speed vs recall vs memory consumed. HNSW scales well to multicore architectures, and it has a bunch of heuristics to avoid local minima as well as it builds a well-connected graph. But building a graph for each segment in Lucene might become super expensive in terms of RAM and disk usage, so you should consider merging segments into one before serving queries (so think about allocating more time for such an optimization than compared to pure BM25 index). I think combining filtering with ANN as one single phase in search is a wise decision, because a multi-step retrieval will likely suffer from low speed or low recall or both. Besides, users will likely want to control the boundaries of document space in the situations when they know in advance that this particular search will likely yield a very high number of documents in return (like, for example, an industry research or patent prior art investigation).
Is here a content length "sweet spot" where dense vectors have a clear advantage over sparse vectors (plain tf*idf)?
You can attack the problem of long docs by paying more attention to what is written in them with the help of your domain expert team. Are first and second paragraph the most important? Are specific chapters in the document important to specific information needs? If yes, get them annotated for significance and semantic role and load into a multi-field inverted index and use BM25 as your baseline. By the way, when measuring the quality of search, you can re-use the industry’s standards, like DCG@P, NDCG@P, AP@T – choosing the right scorer for which you optimize the search quality is an art in itself, but if you want to start, head over to Quepid (with hosted or on-prem deployment – pure open source), connect it to Solr / Elasticsearch and start working with rating queries to understand your current search quality today. Believe me, this investment will pay off and generate a ton of ideas where to improve, leading to more structure of evolving your search engine. Here is a live demo I did for students this year on movie search using Quepid: https://www.youtube.com/watch?v=OOYsWn3LWsM&t=1068s
Dense retrieval has a natural limit of 512 word pieces, beyond which the model is simply not performing well both in terms of indexing speed and in terms of what semantics you can compress to with a long text. "All neural approaches have limitations with document lengths as they have limit of 512 word pieces." – from BEIR paper.
There have been many more questions from the audience during the AMA session. Please watch the recording to learn more and have fun with Vector Search!