Hybrid Search 2.0: The Pursuit of Better Search

A Journey of Learning, Improvement, and the Quest for the Ultimate Hybrid Search System

Published in

Towards Data Science

7 min readSep 30, 2023

The idea of combining the strength of text and vector search has gained momentum in the field of search systems as a way to improve search relevancy and accuracy. I discussed using OpenSearch to build a hybrid search engine in a recent blog post. By integrating text-based lexical search with vector-based semantic search, we were able to enhance both the latency and accuracy of our search results.

I’ve recently been contemplating the drawbacks of the hybrid system and possible improvements. In this article, I’ll examine three weak areas in the prior system and suggest improvements that will strengthen it overall and deliver better results.

Please read my prior blog post before continuing as I refer to the steps described there.

The normalization function was slightly biased; it weighed text search higher and gave it more significance in the final results.

Distance-based algorithms such as K-Nearest Neighbors (KNN) calculate distances between data points, whereas BM25 is based on the frequencies of occurances of keywords. Both return scores that are completely on different scales. This can lead to biased results and inaccurate rankings. Our normilization procedure always produced a perfect score (1) for at least one document in the lexical search result set, hence in our situation the results were biased in favor of the lexical search.
To tackle this problem, let’s explore two commonly used functions: Min-Max normalization and Z-Score normalization. The Z-Score method scales the data to achieve a zero mean and unit variance, while Min-Max normalization rescales the data to fit within a specific range.
The key idea is that I can gain a basic understanding of how scores are distributed for each search type with similar queries if I compute the parameters used in these normilizing functions beforehand and apply them during the normlizing stage. The two functions formulas are:

Considering your index’s structure can help you decide which one to choose since each has advantages of its own. If your documents are more similar to one another and the top-k results of a typical query return documents that are very similar to one another and clustered together within the index, as seen in the graph below, Min-Max may be a better option.

However, Z-Score is more suited if the results are more evenly distributed and have some characteristics of a normal distribution, as shown in the example below.

Both approaches call for certain parameters to be determined; we must compute the mean, standard deviation, minimum score, and maximum score. We must determine these values separately for each search type because vector and semantic results have different scoring systems. Let’s run 1000 random queries, a vector search, and a semantic search to do this. If you don’t have queries, you can use OpenSearch’s scroll API to extract text fields to be used as queries from different parts of your index. Set k to a significant value, for example k=1000, to better understand the ratios within our score range. However, be careful not to set k too high since this may have an impact on the Min-Max function. Simply compute the necessary parameters after collecting all of these scores.

    # Lexical Search
    text_scores = []
    for query in queries:
        response = client.search(
            index=INDEX_NAME,
            body={
                "query": {
                    "match": {
                        "caption": query
                    }
                },
                "size": 1000
            }
        )
        scores = [hit['_score'] for hit in response['hits']['hits']]
        text_scores.append(scores)

    # Vector search
    vector_scores = []
    # Vectorize queries using SentenceTransformer
    query_embeddings = model.encode(queries)
    # Perform vector search
    for query_embedding in query_embeddings:
        request_body = {
            "size": 1000,
            "query": {
                "script_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script": {
                        "source": "knn_score",
                        "lang": "knn",
                        "params": {
                            "field": "caption_embedding",
                            "query_value": query_embedding.tolist(),
                            "space_type": "cosinesimil"
                        }
                    }
                }
            }
        }
        response = client.search(
            index=INDEX_NAME,
            body=request_body
        )
        scores = [hit['_score'] for hit in response['hits']['hits']]
        vector_scores.append(scores)    

    vector_score_mean = np.mean(vector_scores) # Calculate the mean
    vector_score_std = np.std(vector_scores, ddof=1)  # Calculate standard deviation
    vector_score_min = np.min(vector_scores)  # Calculate minimum score
    vector_score_max = np.max(vector_scores)  # Calculate maximum score

    text_score_mean = np.mean(text_scores) # Calculate the mean
    text_score_std = np.std(text_scores, ddof=1)  # Calculate standard deviation
    text_score_min = np.min(text_scores)  # Calculate minimum score
    text_score_max = np.max(text_scores)  # Calculate maximum score

The process is shown in the diagram below:

Set aside the parameters you’ve extracted for the lexical and vector results. Each index will need to have this done separately once. Finally, in the normalization step, we’ll use these parameters in the following way:

Normlizing using the Z-Score function (Diagram by the Author)

2. Scores that didn’t appear in either set were “unfairly” handled and weren’t given adequate consideration as potential matches.

Previously, any result specific to one set received an arbitrary low score, which significantly affected its final ranking. Let’s attempt an alternative strategy to solve this issue. We will instead award the lowest score from the missing result set to documents that only show up in one type of result set. For instance, we assign the document c5a44d-fa8d-4bba (colored in yellow) the lowest semantic search score among the top k results because it only showed up in my lexical (keyword) search results. In this approach, we guarantee that these results are still legitimate by providing a score that is within the range of other scores.

3. The old method of results evaluation lacked a strong data-driven or scientific basis. As a result, we lacked a way to select tuning parameters like boost based on data.

Let’s tackle both of these problems at once. The task at hand involves comparing two sets of promising results — lexical and semantic results — and claiming that hybrid results outperforms them. To accomplish this, we’ll experiment on the MS MARCO. MS MARCO is a dataset curated by Microsoft Research which comprises an extensive collection of 3.2 million web documents extracted from web pages and over 350,000 queries sourced from real Bing web search queries. Initially designed for benchmarking question-and-answer (Q&A) systems, this dataset contains queries that resemble real-world questions. Given the Q&A nature of the dataset, the relevance labels are straightforward, with each result assigned just one “relevant” label (1).

In our scenario, the MS MARCO document ranking challenge uses the mean reciprocal rank (MRR) for the top 100 results (MRR@100) as the relevance metric for its ranking challenge. It calculates the reciprocal rank (1/rank) of the first relevant document and averages this across all queries.

We’ll go on by performing three different kinds of searches: lexical, semantic, and hybrid. We’ll experiment with various boost levels, changing them by 0.1 each time. We want to assess the Mean Reciprocal Rank at 100 (MRR@100) for all our search experiments therefore we set our top-k to 100. For each query, we will compare the results with the “correct” labeled document ID and determine its rank.

def find_doc_id_rank(doc_id, document_ids):
    if doc_id in document_ids:
        position = document_ids.index(doc_id) + 1
        return 1 / position
    else:
        return 0

Finally, we will add together each rank to determine the Mean Reciprocal Rank (MRR).

The MRR@100 scores for each boosting ratio between 0 and 1 (in increments/subtractions of 0.1) are shown in the tables below:

MRR@100 using the Z-Score function (Screenshot by the Author)

MRR@100 using the Min-Max function (Screenshot by the Author)

The findings presented above clearly show that the Min-Max approach performed best when boosting somewhat in the direction of lexical search. Nevertheless, combining it with semantic results enhanced its accuracy. On the other hand, the Z-Score function, which evenly distributes the boost while keeping a 50–50 split between lexical and semantic search, produced the best overall results, demonstrating that the hybrid search approach is the ideal choice.

In our quest for a more effective hybrid search system, we’ve faced challenges head-on and made significant improvements. We’ve taken steps to address inherent challenges, and the results are clear: the combination of lexical and semantic search techniques holds immense promise. This isn’t the final destination; it’s a stepping stone towards a more efficient, user-centric search experience. With data-backed methodologies at our disposal, I plan to keep improving the search system and embrace the ever-evolving challenges of information retrieval.

Many thanks to Yaniv Vaknin and especially Daphna Idelson for all of their help putting this together!

Hybrid Search 2.0: The Pursuit of Better Search

A Journey of Learning, Improvement, and the Quest for the Ultimate Hybrid Search System

Written by Noam Schwartz