Understanding language modelling(NLP) and Using ULMFIT

Text classification using ULMFIT with code

Himanshu Kriplani
Towards Data Science

--

A language model(LM) is a probability distribution over sequences of words. This language models are used as base model[as in Transfer learning] for various Natural language processing task including text classification, summarization, text generation and more.

It is recommended to have a basic idea of deep learning and transfer learning before continuing this blog.

LMs make it easier to map context. Man to Woman mapping is equivalent to Uncle to Aunt and King to Queen. This mappings are converted to joint probability distribution among vocabulary. This is known as a statistical LM.

Given such a sequence, say of length m, LM assigns a probability P{w1,w2,…,wm}to the whole sequence.

Statistical LMs face the problem of data sparsity because of the length of the vocabulary. Whereas a neural language model(NLM) would take only sequence of words as input which solves the problem of data sparsity.

Without diving into mathematics of langauge models lets understand how ULMFIT works which will give an idea upon Neural Langauge models.

Universal Language Model FIne-Tuning(ULMFIT) is a transfer learning technique which can help in various NLP tasks. It has been state-of-the-art NLP technique for a long time, but then it was dethroned by BERT[which recently got dethroned by XLNet in text classification]. We will look into BERT in detail in the next part of the blog.

The pros of ULMFiT

Deep learning requires a lot of dataset. Specifically when doing transfer learning, we have a large dataset on which our base model is build and we transfer learn the parameters of the neural network to our domain specific dataset. When we have a smaller domain specific dataset, the models overfit. To solve this problem, Jeremy Howard and Sebastian Ruder suggest 3 different techniques in there paper on Universal Language Model Fine-tuning for Text Classification for fine-tuning in transfer learning LMs for NLP specific tasks

  • Discriminative fine-tuning
  • Slanted triangular learning rates
  • Gradual unfreezing

Lets understand stages involved in creating a text classifier using ULMFiT with which we will understand the 3 novel techniques suggested by the authors.

ULMFiT involves 3 major stages: LM pre-training, LM fine-tuning and Classifier fine-tuning.

The method is universal in the sense that it meets these practical criteria:

  1. It works across tasks varying in document size, number, and label type.
  2. It uses a single architecture and training process.
  3. It requires no custom feature engineering or pre-processing.
  4. It does not require additional in-domain documents or labels.

AWD-LSTM is an state-of-the-art language model, a regular LSTM (with no attention, short-cut connections, or other sophisticated additions) with various tuned dropout hyperparameters. Authors use AWD-LSTM as a LM in there Architecture.

(a) LM pre-training

The LM is trained on a general-domain corpus to capture general features of the language in different layers

We pre-train LM on a large general-domain corpus and fine-tune it on the target task using novel techniques. So, authors have used Wikitext-103 a dataset of 28k preprocessed articles consisting of 103Million words. In general, the dataset should be so huge that the LM learns all the properties of the langauge. This is the most expensive in terms of compute resources and time too. Hence we do this just once.

(b) LM fine-tuning

In almost all the cases the target task dataset will have a different distribution w.r.t. the general domain corpus. In this stage, we fine-tune the model on the target task dataset to learn its distributions by using discriminative fine-tuning and slanted triangular learning rates.

As different layers grasp different information, author suggest to fine-tune each layer to a different extent.

In Stochastic Gradient Descent we update θ at each time step t.

Regular SGD

In discriminative fine-tuning, we use θ1, θ2,… θL instead of singel θ value for respective L layers.

Discriminative Learning rate

Using the same learning rate (LR) or an annealed learning rate throughout training is not the best way to achieve this behaviour. Instead, we propose slanted triangular learning rates (STLR)

The slanted triangular learning rate schedule used for ULMFiT as a function of the number of training iterations.

In STLR, authors suggest to increase the learning rate linearly and decay it in the following manner.

Where,

  • T is number of training iterations
  • cut_frac is the fraction of iterations
  • cut is the iteration when we switch from increasing to decreasing the LR
  • p is the fraction of the number of iterations we have increased or will decrease the LR respectively
  • ratio specifies how much smaller the lowest LR is from the maximum LR ηmax
  • ηt is the learning rate at iteration t

STLR has been used to achieve state-of-the-art results in CV

(c) Classifier Fine-tuning

Fine-tuning being the most vital state of transfer learning needs to be done with maximum care. Because an aggressively done fine-tuning could over-fit our model and vice versa could make our model under-fit. Authors have suggested Gradual unfreezing approach to deal with this major issue.

We start with unfreezing only the last layer as its contains the most general knowledge. After fine-tuning unfrozen layers for one epoch, we go for next lower layer and repeat till we complete all layers until convergence at the last iteration.

BPTT for Text Classification (BPT3C) Language models are trained with back-propagation through time (BPTT) to enable gradient propagation for large input sequences. In order to make fine-tuning a classifier for large documents feasible, authors propose BPTT for Text Classification (BPT3C): Divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; Keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, authors suggest to use variable length backpropagation sequences.

So, Now we are done with overwhelming theories. Lets dive into the code!!

We will implement a text classification on quora’s question text to find out insincere questions. The dataset is available at kaggle.

Code Coming Soon. Till then, Try out yourself.

--

--