How BERT Determines Search Relevance

Understanding BERT’s limitations and biases will help you better understand how BERT and Search view the world and your content.

Published in

Towards Data Science

10 min readAug 30, 2020

On October 25, 2019, Pandu Nayak, VP of Search for Google announced:

by applying BERT models to both ranking and featured snippets in Search, we’re able to do a much better job helping you find useful information. In fact, when it comes to ranking results, BERT will help Search better understand one in 10 searches in the U.S. in English, and we’ll bring this to more languages and locales over time. [1]

Google’s remarks and explanations raise some key questions:

How much better is BERT than prior search relevance efforts?
How are BERT models created? How are they fine tuned?
What are the limitations and biases of BERT models?
How might these biases color how BERT sees webpage content?
Could a person use BERT to determine how well her content would perform for a particular query?
How does one “apply a BERT model” for a query and possible target pages to come up with a ranking?

How much better is BERT than prior search relevance efforts?

In 2015, Crowdflower (now Appen←Figure-Eight←Crowdflower) hosted a Kaggle competition [2] where data scientists built models to predict the relevance for search results given a query, a product name and a product description. The winner, ChenglongChen pocketed $10,000 when his best model took first place by scoring 72.189% [3]. Although the competition has been closed for five years, the data set is still available and the Kaggle competition scoring functionality still works for the private leaderboard (it just doesn’t award any site points). I pulled the data, fine tuned a BERT classification model, predicted a submission, and it scored 77.327% [4].

This winning result, although years late, shows how BERT has dramatically leapfrogged past the prior state of the art. The contest winner used an ensemble of 12 (!) machine learning models to vote on the best result:

In contrast, my higher scoring result used one BERT model and a relatively simple pipeline:

For my first model (and last model), featurization was “just take the first 505 tokens” across the three pieces of data (query, product title, product description) — without any special processing — and those are the results you see. In this article, we’ll look at how and why BERT can perform well with terrible and dirty input later in this article.

The Kaggle Crowdflower Search Relevance data set has 20,571 labeled samples, and generating a submission requires predictions on 22,513 test samples. Although this is a small amount of data, and the domain is restricted to eCommerce products — data that BERT base wasn’t trained on — the BERT classifier nonetheless was able to start learning and predicting with groundbreaking accuracy.

How are BERT models created? How are they fine tuned?

BERT is an acronym for Bidirectional Encoder Representations from Transformers [5], and it’s a language model. A language model encodes words and the log probabilities of words occurring together. The original BERT models did this by being trained on English Wikipedia and the Toronto BookCorpus. The training goals were next sentence prediction, and masked word prediction.
The next sentence task chooses some neighboring sentences and gives them positive weights; and then chooses some random sentences and gives them negative weights: in this way, the BERT model learns to tell whether or not two sentences occurred in sequence. Many people theorize that this gives BERT a basis for some the Natural Language Understanding (NLU) that the model displays. In practice, BERT seems to know which words and sentences go together.
The masked word task randomly hides a word and rewards BERT for being able to predict the missing word. This task, combined with network dropout, allows BERT to learn to infer a larger context from surrounding words.
In practice, BERT is commonly used as the base layer for a more complex model; for example, an additional final layer is typically added and then the new layer is fine tuned to act as a classifier.

I will not explain here the mechanics of the transformer model, read about it here [5]. The details of the best fine tuning techniques are still being worked out (judging by the number of Arxiv papers being published), and although hyperparameter tuning depends on your data, further exploration will surely be rewarding. However, before we rush to obsess over details, let us not miss the main point: when a new model with suboptimal hyperparameter tuning beats the previous state of the art by a large margin, search engine companies adopt it. Perfect is the enemy of the good. And sometimes the new good enough, is so good it causes companies to immediately adopt it as a strategic advantage even if the optimal fine tuning regime hasn’t been determined publicly.

To understand why BERT is so good at predicting search relevance, we’ll have to look into some its internals, limitations and biases.

What are the Limitations and Biases of BERT models?

1. Limit of 512 tokens ~ words

The BERT baseline model accepts a maximum of 512 tokens. Although it’s possible to construct a BERT model with less tokens, e.g. 256 tokens for tweets — or define and train a BERT model from scratch, e.g. with 1024 tokens for larger documents, the baseline is 512 for virtually all of the commonly available BERT models.

If your page is longer than 512 tokens or words, search engines might:

Just take the first 512 tokens
— if your page doesn’t make its point in the first 512 tokens, the engine may not even see it (probably already true).
Reduce your page content to under 512 tokens via summarization algorithms (TextRank, Deep Learning, etc) or by applying algorithms to drop out unimportant words and sentences — but these computations are costly, so they might not be done for most pages.

Note: Although we say 512 tokens/words, in practice, BERT will typically look at 505 tokens (assuming a 4 word query, with the required 3 BERT token separators). In practice the number of tokens of your content under consideration by a search algorithm may be far less than 505, as we’ll see.

2. Not all words are tokens: many common words become single tokens; but longer and unfamiliar words are broken up into subtokens.

A good illustration of this can be seen with some words that have variations between the British and American English spellings. Sometimes the subword tokenization can be quite costly:

bert_tokenizer.tokenize(‘pyjamas’), bert_tokenizer.tokenize(‘pajamas’)
[‘p’, ‘##y’, ‘##ja’, ‘##mas’], [‘pajamas’]
bert_tokenizer.tokenize(‘moustache’), bert_tokenizer.tokenize(‘mustache’)
[‘mo’, ‘##ust’, ‘##ache’], [‘mustache’]

Sometimes, there is no difference:

[‘colour’], [‘color’]

but often the less familiar spellings yield multiple tokens:

[‘aero’, ‘##plane’], [‘airplane’]
[‘ars’, ‘##e’], [‘ass’]
[‘jem’, ‘##my’], [‘jimmy’]
[‘orient’, ‘##ated’], [‘oriented’]
[‘special’, ‘##ity’], [‘specialty’]

Rarely, but sometimes, the British spelling variation becomes tokenized with fewer tokens:

[‘potter’], [‘put’, ‘##ter’]

3. Outright misspelling are implicitly penalized:

bert_tokenizer.tokenize(‘anti-establishment’)
[‘anti’, ‘-’, ‘establishment’]
bert_tokenizer.tokenize(‘anti-establisment’)
[‘anti’, ‘-’, ‘est’, ‘##ab’, ‘##lism’, ‘##ent’]

Although these penalties may seem shocking, they actually indicate how forgiving BERT is; the model will try to make sense of just about anything you give it, instead of dropping misspelled words or ignoring something it hasn’t seen before. Also, these biases are not a plot against British language spelling variations, but rather a side effect of the training data: a BERT model and its BERT tokenizer typically have a limited vocabulary (typically 30,000 words, including subtokens) carefully chosen so that virtually any word can be encoded, and many of the most common words are promoted to being represented as individual tokens. This popularity contest of words and tokens is based on the original training data. The original BERT models were trained on English Wikipedia and some additional texts from the Toronto BookCorpus (11,038 books, 47,004,228 sentences). Clearly, British spelling variations weren’t dominant in that corpus.

If you’re analyzing documents with British English spelling variations, it would probably be profitable to normalize the spellings prior to feeding them into a BERT model. A well-trained model can generalize about things it hasn’t seen before, or only has been partially trained on, but the best model performance occurs with familiar data.

With many other language models, and word vectors, it’s easy to identify if the word is new and if the language model has been trained on it, and these types of words have their own term: OOV, out of vocabulary. But it’s not easy to determine if BERT has never seen a word or been repeated trained with it, since so many words are broken up by subtokens. But this minor weakness is a source of great strength: in practice BERT can synthesize a word’s meaning based on the history and understanding of similar neighboring tokens.

4. BERT will ignore some items. Categorically, emojis are unknown to BERT.
- Typically, BERT tokenizes emojis as Unknown (literally ‘[UNK]’), and if these aren’t dropped when compressing your page, they don’t add any value when the model sees them.

toker.tokenize(‘😍 🐶 ❤️’)
[‘[UNK]’, ‘[UNK]’, ‘[UNK]’]

How might these biases color how BERT sees webpage content?

Fundamentally, since BERT models accept a limited amount of tokens (typically < 505), if your page uses unusual words or uncommon spellings, your page content will be split into more tokens, and in effect, the BERT model will end up seeing less of your page than a similar page that uses more common words and popular spellings.

This does not mean that you should aim to create pages that exactly mimic the style of Wikipedia. For a long time, search engines have preferred articles with general appeal, using common words and standardized spellings, written more akin to the news or Wikipedia articles than an aimless wandering of verbiage. So in a sense, the use of BERT natively supports the best practices of writing content for search engines.

Why is BERT so good at predicting search results?

Fundamentally both of BERT’s training objectives work together: word masking helps BERT build a context for understanding, and the next sentence prediction, well, — isn’t the problem of content relevance often a matter of determining how well one search query “sentence” is paired with one search result “sentence”?

We have already seen how BERT has the ability to synthesize meaning from subword tokens and neighboring words. This skill gives BERT an advantage since 15% of search queries contain words that have never been seen before [1]. BERT is a natural predictor for the meaning of unknown terms needed to determine search relevance.

Could a person use BERT to determine how well her content would perform for a particular query?

In short, probably not; to understand why, lets deep dive on how BERT is likely used to assess how well a query and a page match. On a high level, to answer this, they might pick a number of pages to check and run your query against those pages to predict the relevance.

Most search queries are four words or less, and most page summaries are less than five hundred and five words (otherwise it isn’t much of a summary). Search relevance scores are commonly segmented as: 1. off topic, 2. okay, 3. good, and 4. excellent. [2]

When ML engineers build a model to estimate how well a query matches to search results, it’s common that they will train on about 1 million examples. Why so many? A deep learning model needs a lot of data to be able to generalize well and predict things it hasn’t seen before. If you’re trying to build an all purpose general search engine, you’ll need a lot of data. However, if your search space is smaller, such as just eCommerce technology, or just the products of a home improvement website, etc, then only a few thousand labeled samples may be necessary to beat the previous state of the art. Uncommon data is a regular component of search queries:

15 percent of those queries are ones we haven’t seen before
— Pandu Nayak, VP Search, Google

Several thousand labeled samples can provide some good results, and of course, a million labeled samples will likely provide great results.

How does one “apply a BERT model” for a query and a possible target page to come up with a ranking?

The Kaggle Crowdflower competition data provides interesting hints about how extra data is often used in practice. Typically more features, when available, are added to a model to make it more flexible and be able to predict across a range of inputs.
For example, earlier we formulated the search ranking problem as:

But in the Kaggle submission query data, extra information is available, or sometimes missing, so the features would be formatted as:

In some of the test cases, only the query and the product title are provided, and in real world situations there might be little or no page content provided.
For example if your company has a product page for “Sony PS6 — Founders Edition” and that page has dynamic content like recent tweets or testimonials of purchasers, user images, etc, it is quite possible a search engine might only use the page title (or some type of metadata about the page) and effectively none of the page content. The lesson is clear, when providing web content it’s important to focus on relevant information that accurately reflects your product and content first and foremost.

BERT is here to stay, and its impact on search relevance is only going to increase. Any company that provides search to their customers or in-house clients can use BERT to improve the relevance of their results. With very little data, a BERT classifier can beat the previous state of the art, and more data will help yield better results and more consistent performance.

References

[1] https://blog.google/products/search/search-language-understanding-bert
[2] Kaggle Crowdflower Search Results Relevance
[3] ChenlongChen Crowdflower Kaggle
[4] ML-You-Can-Use: Searching — Search Results Relevance using BERT
[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

(Thanks to Zhan Shi for reviewing and commenting on a draft of this article.)