Swiss army knife for unsupervised task solving

BERT is a prize addition to the practitioner’s toolbox

Ajit Rajasekharan

Published in

Towards Data Science

14 min readJan 9, 2021

**Figure 1. Few reasons why BERT is a valuable addition to a practitioner’s toolbox in addition to its well known use of fine-tuning for downstream tasks.** (1) BERT’s learned vocabulary of vectors (in say 768 dimensional space) serve as **targets** that masked output vectors predict and learn from prediction errors during training. After training, these moving targets settle into **landmarks** that can be clustered and annotated (a one-time step) and used for classifying model output vectors in a variety of tasks — NER, relation extraction etc. (2) A model pre-trained enough to achieve a low next sentence prediction loss (in addition to the masked word prediction loss) yields quality CLS vectors representing any input term/phrase/sentence. The CLS vector needs to be harvested from the MLM head and not from the topmost layer to get the best possible representation of the input (figure below). (3) MLM head decoder bias value is a useful score of the importance of a vocabulary term and is the equivalent of a TF-IDF score for vocabulary terms. (4) BERT’s capacity to predict, in most cases, the entity type of word in a sentence indirectly through vocabulary word alternatives for that position, can be quite handy in addition to its use for NER tagging. Occasionally the predictions for a position may even be the correct instance, but this is typically unreliable directly for any practical use (5) Vector representation for any input term/phrase (and their misspelled variants) either harvested directly from BERT’s learned vocabulary or created using CLS to a large degree subsumes the context independent vectors of prior models like word2vec, Fasttext, making BERT a one-stop shop for harvesting vector representations of both context dependent and context independent vectors. The only exception to this are representations for input involving characters not present in BERT’s vocabulary (e.g. a custom BERT vocabulary carefully chosen to avoid characters from out of application domain languages like Chinese, Tamil etc.). Central to harvesting the most of all these benefits is how well a model is pre-trained with a custom vocabulary on a domain specific corpus of interest to our application. **Image created by author**

TL;DR

Natural language processing tasks traditionally requiring labeled data could be solved entirely or in part, subject to a few constraints, without the need for labeled data by leveraging the self-supervised learning of a BERT model, provided those tasks lend…

Swiss army knife for unsupervised task solving

BERT is a prize addition to the practitioner’s toolbox

TL;DR

Written by Ajit Rajasekharan