TinyBERT for Search: 10x faster and 20x smaller than BERT

Speeding up the algorithm Google uses to answer your questions so it can run on a standard CPU

Jack Pertschuk

Published in

Towards Data Science

4 min readJan 17, 2020

Co-authored by Cole Thienes

Recently, Google introduced a new method of understanding searches and deciding which results you see. This method, based on the popular open-source transformer BERT, uses language understanding to pick up on the meaning behind searches in a way that traditional keyword methods aren’t able to.

We built NBoost to make it easy for people who are not Google to also use advanced search ranking models, and in the process developed TinyBERT for search, which I introduce in this article.

Particularly for longer, more conversational queries, or searches where prepositions like “for” and “to” matter a lot to the meaning, Search will be able to understand the context of the words in your query. You can search in a way that feels natural for you.
- Pandu Nayak, VP of Search @ Google

Making BERT Smaller and Faster

BERT has been shown to improve search results, but there’s a catch: it takes a huge number of computers to run these query understanding models. This is especially true when speed matters and millions of searches have to be processed. This challenge is so formidable that Google even built their own hardware (Cloud TPUs) to run the models on. And the code they use to run these TPUs in production is private, so anyone else who wants to run it is out of luck.

In order to run these models on standard hardware, we use knowledge distillation, a process by which a larger teacher network is used to a train a smaller student network which maintains most of the accuracy but uses fewer, often smaller layers, making it smaller and faster.

https://nervanasystems.github.io/distiller/knowledge_distillation.html

TinyBERT Architecture

We used the code from this repo for knowledge distillation and modified it for training and evaluation on the MS Marco dataset. We initially trained a teacher bert-base-uncased network in PyTorch with the MS Marco training triples set. Then we used it as a teacher to train a smaller student BERT network with only 4 hidden layers instead of the standard 12. Additionally, each of these layers is only of size 312 instead of 768, making the model even more lightweight. We use a feedforward binary classification layer at the end of BERT to produce the scores for search ranking.

BERT for search scores pairs of (question, answer) or (search, search result) and then ranks results based on these scores

The following is a sample bert_config.json for the tinyBERT architecture we use, with the notable differences from standard bert_config bolded.

{
 “attention_probs_dropout_prob”: 0.1,
 “cell”: {},
 “emb_size”: 312,
 “hidden_act”: “gelu”,
 “hidden_dropout_prob”: 0.1,
 “hidden_size”: 312,
 “initializer_range”: 0.02,
 “intermediate_size”: 1200,
 “max_position_embeddings”: 512,
 “num_attention_heads”: 12,
 “num_hidden_layers”: 4,
 “pre_trained”: “”,
 “structure”: [],
 “type_vocab_size”: 2,
 “vocab_size”: 30522
}

Evaluating the Model

[1] MRR Reranking top 50 results from BM25. Bing queries refers to MS Marco. Speed on K80 GPU.

MS Marco is the largest publicly available source of real world search engine usage data, making it ideal for evaluating search and question answering models. It shows real world bing results and info about what users ultimately ended up clicking. When BERT Base was first used on MSMarco, it beat the state of the art by 0.05 MRR (a lot). BERT based solutions are still at the top of the leaderboard. Our goal was to find a way to achieve this boost from a model that was fast enough to use in the real world.

Enter, TinyBERT. While not as effective as BERT Base for reranking, our experiments show that it retained 90% of the MRR score of BERT Base (0.26 vs 0.29 reranking top 50 from BM25) while making the model ~10x faster and ~20x smaller. However, results based on academic benchmarks such as MS Marco often lack real world generalizability and hence should be taken with a grain of salt.

TinyBERT for Search: 10x faster and 20x smaller than BERT

Speeding up the algorithm Google uses to answer your questions so it can run on a standard CPU

Making BERT Smaller and Faster

TinyBERT Architecture

Evaluating the Model

Written by Jack Pertschuk