tl;dr
Language models have shown the ability to remember information implicitly from the data used for pre-training. This paper from Roberts et al. (2020) attempts to quantify it and shows how the scale of implicit knowledge varies with model size and training time.
Thanks a lot to Sundeep Teki for the feedback, and for helping me write this blog!
Introduction
Recent work from Patroni et al. (2019) has shown how language models construct internal knowledge bases from the data used for pre-training. In this paper by Roberts et al. (2020), the authors try to understand this phenomenon through "closed-book question answering". Unlike the recent works in the Question Answering (QA) space, the authors do not share any context or external source of knowledge to the model to answer the questions (hence the name – closed book question answering). In this paper, the authors focus on making the models look up their parameters for the information stored whilst pre-training. Furthermore, the authors also explore how this behavior changes while scaling model size (number of parameters) or training data (both of which have proven to increase performance on downstream tasks earlier).

Background
Question answering: Typically the model is provided with an external source of information to look up the details pertinent to the question. These questions can be historical facts, information that can be interpreted from an external source, etc. This kind of question answering is referred to as "open book question answering". The model is expected to output the span of the text (range/coordinates) or the text itself in this process.
A simpler version of this task is when instead of an entire "external knowledge source", the model is provided with a specific context input as well. Here, instead of searching through the huge external corpus, the model can learn to "lookup" the answers from the context inputted. This version of question answering is referred to as reading comprehension.
In this paper, the authors target a much more ambitious set of problems they refer to as close book question answering. Here, the model is expected to learn to look inside itself for the memorized content for the answers instead of an oracle text as context or large external corpus.
Transfer learning: Large-scale language models have shown to have increased performance with the help of pre-training on a large unlabelled corpus of data. Such pre-training steps are argued to provide the model with language information or certain "world knowledge" in an unsupervised fashion. Recently popular transfer learning models are derived from Transformers (Vaswani et al., 2017) and a particular variant of encoder-only transformers (similar to BERT (Devlin et al., 2018)) is popular in question answering regime. This is owing to the fact that question answering is typically attempted with a context input or an external knowledge base where the encoder models are expected to predict a single token that would have the answer.
Although, this would not be possible for the closed-book question answering, and therefore the authors use a framework called T5 (Text To Text Transfer Transformers) which models every problem as a text-to-text problem (Raffel et al., 2019). That is, instead of extracting information, the model is expected to generate the information.
Experiment
Dataset: The study looks at 3 datasets, namely – Natural Questions (Kwiatkowski et al., 2019); WebQuestions (Berant et al., 2013); TriviaQA (Joshi et al., 2017). The paper makes use of only the questions from the dataset and ignores the accompanying matching documents. Furthermore, all the results from TriviaQA (which has private test sets) are obtained by submitting them to the leaderboard.
Training: The authors make use of the T5 model (Raffel et al., 2019); T5 wasn’t pre-trained on the question-answer datasets. Further, the performance was measured as a factor of model size – Base, large, 3B, and 11B variants. Moreover, the results were reported based on the T5.1.1 checkpoints as well, which were pre-trained on unlabelled data only. The validation was done on a set aside 10% of the dataset, where the best performing checkpoint obtained from training on the 90% was used. The model’s predictions were chosen by selecting the most likely token at each timestep.
Salient Span Masking (SSM): The method of masking longer phrases, which includes named entities, dates, and more, from Guu et al. (2020) has shown to achieve better performance with BERT-based models. In this paper, the authors fine-tune the T5 model with 100k additional steps, in a similar fashion.
Results: The major takeaways from this study are –
- The performance increased as the model size increased from base to 11B variant.
- The SSM strategy gave a significant boost in performance.
- There is a significant reduction in memory and computational cost because, unlike typical open-domain QA models that look up the large corpus of knowledge, the model described in this paper only looks "inside" itself.
- The model also beats the best baseline in the multiple answers task as well, albeit lagging behind on the recall when compared to the SOTA model.
Human Evaluations: Owing to the free-form answer generation used by the model, there were multiple false negatives in the evaluation. This is because, if there is no exact match between the ground truth and output although they are semantically the same, would be counted as a wrong output prediction. The authors look at 150 randomly sampled examples for evaluation and they could notice that 20 were misclassified as false, 20 were annotated wrong and 17 were unanswerable. Ignoring the unanswerable from the evaluations, the model gave a score of 57.8.

Conclusion
In this paper, the authors show that large language models pre-trained on unstructured data can attain remarkable results in "closed domain" question answering. This entails many interesting works along with the directions of making resource constraints (smaller models) that emulate the performance of the larger ones, looking at the "knowledge" accessed for model interpretability, and more importantly, understanding if the models "learn" facts as a consequence of pre-training based on maximum likelihood loss.