Understanding ULMFiT — The Shift Towards Transfer Learning in NLP

An understanding of what ULMFiT is; what it does and how it works; and the radical shift that it caused in NLP

Akhilesh Ravi
Towards Data Science

--

Image by Hitesh Choudhary from Unsplash

Natural language processing has picked up pace over the last decade and with the with the increasing ease of implementing deep learning, there have been major developments in this field. However, it was lagging behind the level of expertise attained in the field of computer vision.

The main reason for this was that transfer learning made many computer vision tasks much simpler than they actually were — pretrained models like VGGNet⁸ or AlexNet⁹ could be fine-tuned to fit most computer vision tasks. These pretrained models were trained on a huge corpus like ImageNet. They had been made to capture the general features and properties of images. Thus, with some tweaking, they could be used for most tasks. Hence, models did not have to be trained from scratch for each and every computer vision task. Moreover, since they were trained on a huge corpus, the accuracy or results for many tasks was exceptional and would outperform smaller models trained from scratch for the particular tasks.

On the other hand, models had to be trained separately and one-by-one for each NLP task. This was time-consuming and limited the scope of these models. “Recent approaches (in 2017 and 2018) that concatenate embeddings derived from other tasks with the input at different layers still train the main task model from scratch and treat pretrained embeddings as fixed parameters, limiting their usefulness.”¹ There was a lack of knowledge of how to properly fine-tune language models for various NLP tasks.

Universal Language Model Fine-Tuning (ULMFiT)

In 2018, Howard and Ruder et. al.¹ provided a novel method for fine-tuning of neural models for inductive transfer learning¹ — given a source task in which the model is trained, the same model is to be used to obtain good performance on other tasks (NLP tasks) as well.

Figure 2: Main Stages in ULMFiT — Figure from Analytics Vidya

Choosing the base model
The ideal source task was seen as language modelling and was considered as the analogous to ImageNet for NLP tasks. This is because of the following reason: “It captures many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations, and sentiment. In contrast to tasks like machine translation (MT) and entailment, it provides data in near-unlimited quantities for most domains and languages.”¹ Moreover, language modelling can be trained to adapt to the particular unique features of the target task and language modelling is a component of various other NLP tasks.
Generally, a good language model (LM) like the AWD-LSTM⁷, is chosen as the base model. It is generally expected that the better the base model, the better will be the performance of the final model on various NLP tasks after fine-tuning.

General-domain LM pretraining
The pre-training is to be done on a large corpus of for language which effectively catches the main properties and aspects of language. This would be something like the Image-Net corpus, but, for language. This stage has to be performed only once. The resulting pretrained model can be reused for the next stages for all tasks.
The pre-training is done so model already understands the general properties of language and has to be tweaked a little to suit the specific task. In fact, it was found that pre-training was especially useful for small datasets and medium-sized datasets.

Target task LM fine-tuning
This stage is done to make the model fit the model to the specific target task. When a pretrained model is used, then, the convergence at this stage is faster. In this stage, discriminative fine-tuning and slanted triangular learning rates are used for fine-tuning the language model.

Discriminative Fine-Tuning
“As different layers capture different types of information, they should be fine-tuned to different extents.”¹ Thus, for each layer, a different learning rate is used. The learning of the last layer, ηᴸ is fixed. Then, ηˡ⁻¹ = ηˡ/2.6 is used to obtain the rest of the learning rates.

Figure 3: Weight update for each layer where l=1, 2, …, L is the layer number, ηˡ is the learning rate for the lᵗʰ layer, L is the number of layers, θˡₜ is the weights of the lᵗʰ layer at iteration t and ∇(θˡ) [ J(θ) ] is the gradient with regard to the model’s objective function. — Image and Equation for ULMFiT Research Paper

Slanted triangular learning rates
The learning rates are not kept constant throughout the fine-tuning process. Initially, for some epochs, they are increased linearly with a steep slope of increase. Then, for multiple epochs, they are decreased linearly with a gradual slope. This was found out to give good performance of the model.

Figure 4: Slanted Triangular Learning Rates — Change of learning rate with each iteration — Image from Wordpress

Target task classifier fine-tuning
“Finally, for fine-tuning the classifier, we augment the pretrained language model with two additional linear blocks.” Each block has the following
1. batch normalization
2. dropout
3. ReLU activation for the intermediate layer
4. softmax activation for the output layer to predict the classes

Some Additional Steps which improve the Model Further

Concat Pooling
In general, in text classification, the important words are only a few words and may be a small part of the entire document, especially if the documents are large. Thus, to avoid loss of information, the hidden state vector is concatenated with the max-pooled and mean-pooled form of the hidden state vector.
Gradual Unfreezing
When all layers are fine-tuned at the same time, there is a risk of catastrophic forgetting. Thus, initially all layers except the last one are frozen and the fine-tuning is done for one epoch. One-by-one the layers are unfrozen and fine-tuning is done. This is repeated till convergence.
Using Bidirectional Model
Ensembling a forward and a backward LM-classifier improves the performance further.

Figure 5: Block Diagram of ULMFiT in Text Analysis — Image from by HU-Berlin from GitHub

Advantages of ULMFiT
ULMFiT-based models (which have been pre-trained) perform very well even on small and medium datasets compared to models trained from scratch on the corresponding dataset. This is because they have already captured the properties of the language during pre-training.
The newly proposed methods for fine-tuning the LM and the classifier prove to give better accuracy than using the traditional methods of fine-tuning a model.

Figure 6: ULMFiT Performance Comparison on IMDB Dataset — Image from FastAI

ULMFiT revolutionized the field of NLP, improving the scope of deep learning in NLP and making it possible to train models for various tasks in much less time than before. This set the base for transfer learning for NLP and paved the way for ELMo, GPT, GPT-2, BERT and XLNet.

Note: I had initially written this article as an assignment in my Natural Language Processing course.

Watch videos explaining concepts in Machine Learning, AI and Data Science on my YouTube channel.

References

[1] Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. *
[2] Figure 1: https://miro.medium.com/proxy/0*2t3JCdtfsV2M5S_B.png
[3] Figure 2: https://cdn.analyticsvidhya.com/wp-content/uploads/2018/11/ulmfit_flow_2.png
[4] Figure 4: https://yashuseth.files.wordpress.com/2018/06/stlr-formula2.jpg?w=413&h=303
[5] Figure 5: https://humboldt-wi.github.io/blog/img/seminar/group4_ULMFiT/Figure_5.png
[6] Figure 6: http://nlp.fast.ai/images/ulmfit_imdb.png
[7] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017b. Pointer Sentinel Mixture Models. In Proceedings of the International Conference on Learning Representations 2017.
[8] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[9] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

* Many phrases, sentences and paragraphs in this article are summaries of sentences or paragraphs form [1]. Figure 3was also taken from [1].

--

--

UG student at Indian Institute of Technology Gandhinagar; Pursuing a dual major in CS and EE; ML, Computer Vision and Data Science Enthusiast