NLP Classification with Universal Language Model Fine-tuning (ULMFiT)

In this article, we will see how to build an NLP Classification model using ULMFiT which outperforms previous approaches to Text Classification.

Published in

Towards Data Science

6 min readAug 9, 2020

Text classification is one of the important applications of NLP. Applications such as Sentiment Analysis and Identifying spam, bots, and offensive comments come under Text Classification. Until now, the approaches used for solving these problems included building Machine Learning or Deep Learning models from scratch, training them on your text data, and fine-tuning it with hyperparameters. Even though such models give decent results for applications like classifying whether a movie review is positive or negative, they may perform terribly if things become more ambiguous because most of the time there’s just not enough amount of labeled data to learn from.

But wait a minute? Isn’t the Imagenet using the same approach to classify the images? Then how has it able to achieve great results with the same approach? What if, instead of building a model from scratch, we use a model that has been trained to solve one problem (classifying images from Imagenet) as the basis to solve some other somewhat similar problem (text classification). As the fine-tuned model doesn’t have to learn from scratch, it gives higher accuracy without needing a lot of data. This is the principle of Transfer Learning upon which the Universal Language Model Fine-tuning (ULMFiT) has been built.

And today we are going to see how you can leverage this approach for the Sentiment Analysis. You can read more about the ULMFiT, its advantages as well as comparison with other approaches here.

The fastai library provides modules necessary to train and use ULMFiT models. You can view the library here.

The problem we are going to solve is the Sentiment Analysis of US Airlines. You can download the dataset from here. So without further ado, let’s start!

Firstly, let’s import all the libraries.

Now we will convert the CSV file of our data into Pandas Dataframe and see the data.

Now we check if there are any nulls in the dataframe. We observe that there are 5462 nulls in the negative_reason column. These nulls belong to positive + Neutral sentiments which makes sense. We verify this by taking count of all non-negative tweets. Both the numbers match. The reason negativereason_confidence count doesn’t match with negativereason count is that the 0 values in the negativereason_confidence column correspond to blanks in negativereason column.

If we look at the total count of data samples, it’s 14640. The columns airline_sentiment_gold, negativereason_gold & tweet_coord have large amounts of blanks, i.e. in the range of 13000–14000. Thus it can be concluded that these columns will not provide any significant information & thus can be discarded.

Now that we have the relevant data, let’s start building our model.

When we are making NLP model with Fastai, there are two stages:

Creating LM Model & fine-tuning it with the pre-trained model
Using the fine-tuned model as a classifier

Here I’m using TextList which is part of the data bloc instead of using the factory methods of TextClasDataBunch and TextLMDataBunch because TextList is part of the API which is more flexible and powerful.

We can see that since we are training a language model, all the texts are concatenated together (with a random shuffle between them at each new epoch).

Now we will fine-tune our model with the weights of a model pre-trained on a larger corpus, Wikitext 103. This model has been trained to predict the next word in the sentence provided to it as an input. As the language of the tweets is not always grammatically perfect, we will have to adjust the parameters of our model. Next, we will find the optimal learning rate & visualize it. The visualization will help us to spot the range of learning rates & choose from while training our model.

By default, the Learner object is frozen thus we need to train the embeddings at first. Here, instead of running the cycle for one epoch, I am going to run it for 6 to see how accuracy varies. The learning rate I picked is with the help of the plot we got above.

We got very low accuracy, which was expected the rest of our model is still frozen but we can see that the accuracy is increasing.

We see that the accuracy improved slightly but still looming in the same range. This is because firstly the model was trained on a pre-trained model with different vocabulary & secondly, there were no labels, we had passed the data without specifying the labels.

Now we will test our model with random input & see if it’ll accurately complete the sentence.

Now, we’ll create a new data object that only grabs the labeled data and keeps those labels.

The classifier needs a little less dropout, so we pass drop_mult=0.5 to multiply all the dropouts by this amount. We don’t load the pre-trained model, but instead our fine-tuned encoder from the previous section.

Again we perform similar steps as Language mode. Here I am skipping the last 15 data points as I’m only interested till 1e-1.

Here we see that the accuracy has drastically improved if we compare with the Language model in step 1 when we provide labels.

Now we will partially train the model by unfreezing one layer at a time & differential learning rate. Here I am using the slice attribute which will divide the specified learning rates among 3 groups of models.

We see that the accuracy is improving gradually which is expected as we are gradually unfreezing the layers. More layers providing more depth.

Finally, we will unfreeze the whole model & visualize the learning rate to choose & use that for final training.

We see that we have achieved maximum accuracy of 80% by the end of this model.

For our final results, we’ll take the average of the predictions of the model. Since the samples are sorted by text lengths for batching, we pass the argument ordered=True to get the predictions in the order of the texts.