Maximizing BERT model performance

An approach to evaluate a pre-trained BERT model to increase performance

Ajit Rajasekharan
Towards Data Science
14 min readNov 4, 2020

--

Figure 1. Training pathways to maximize BERT model performance. For application domains where entity types — people, location, organization etc. are the dominant entity types, training pathways 1a-1d would suffice. That is, we start off with a publicly released BERT model (bert-base/large-cased/uncased, or the tiny bert versions) and optionally train it further (1c — continual pre-training) before fine-tuning it for a specific task (1d — supervised task with labeled data). For a domain where person, location, organization etc. are not the dominant entity types, use of original BERT model for continual pre-training (1c) with a domain specific corpus, followed by fine tuning may not boost performance as much as pathway 2a-2d, given the vocabulary in the 1a-1d pathway is still the original BERT model vocabulary with an entity bias towards people, location organization etc. Pathway 2a-2d trains a BERT model from scratch using a vocabulary that is generated from the domain specific corpus. Note: Any form of model training - pre-training, continual pre-training or fine tuning, modifies both model weights as well as the vocabulary vectors — the different shades of same color model(shades of beige)as well as vocabulary(shades of blue/green) in the training stages from left to right illustrates this fact. The box labeled with a “?”, is the focus of this article — evaluate a pre-trained or a continually pre-trained model to improve model performance. Image by Author

TL;DR

Training a BERT model from scratch on a domain specific corpus such as biomedical space with a custom vocabulary generated specific to that space has proven to be critical to maximize model performance in biomedical domain. This is largely because of…

--

--