News classification: fine-tuning RoBERTa on TPUs with TensorFlow

A tutorial on multiclass text classification using Hugging Face transformers.

Gabriele Sgroi, PhD
Towards Data Science

--

Photo by AbsolutVision on Unsplash

Fine-tuning large pre-trained models on downstream tasks is a common practice in Natural Language Processing. In this tutorial, we will use a pre-trained RoBERTa model for a multiclass classification task.

RoBERTa: A Robustly Optimized BERT Pretraining Approach, developed by Facebook AI, improves on the popular BERT model by modifying key hyperparameters and pretraining on a larger corpus. This leads to improved performance compared to vanilla BERT.

The transformers library by Hugging Face allows to easily deploy pre-trained models for a variety of NLP tasks with few lines of code. There is a variety of Auto Model classes that wrap up the pre-trained models implementing automatically the necessary architectural changes needed for common downstream tasks. Furthermore, these models can be cast as Keras models allowing easy training through the Keras API.

In this tutorial, we will fine-tune RoBERTa on the News Category Dataset, hosted on Kaggle, to predict the category of news from their title and a short description. The dataset contains 200k news headlines obtained from the year 2012 to 2018.

Full code is available in the following public Colab notebook.

0. INSTALL AND IMPORT DEPENDENCIES

We first have to install the transformers library, this can be easily done through pip install transformers.

Next, we import the libraries needed for the rest of the tutorial.

1. INSTANTIATE THE TPU

The model has been trained using Colab free TPUs. TPUs will allow us to train our model much faster and will also allow us to use a larger batch size. To enable TPU on Colab click on “Edit”->“Notebook Settings” and select “TPU” in the “Hardware Accelerator” field. To instantiate the TPU in order to use it with TensorFlow we need to run the following code

To make full use of the TPU potential, we have set a batch size that is a multiple of the number of TPUs in our cluster. We will then just need to instantiate our model under tpu_strategy.scope().

2. DATA EXPLORATION

Let’s load the data. We will concatenate the headline and the description into a single input text that we will feed to our network later.

The news headlines are classified into 41 categories, let’s visualize how they are distributed.

Categories distribution

We see that we have a lot of categories with few entries. Furthermore, some categories may refer to closely related or overlapping concepts. Since there is a significant number of categories to predict, let’s aggregate the categories that refer to similar concepts. This will make the classification task a little bit easier.

We are thus left with 28 aggregated categories distributed as follows

3. DATA PREPROCESSING

We have now to preprocess our data in a way that can be used by a Tensorflow Keras model. As a first step, we need to turn the classes labels into indices. We don’t need a one-hot encoding since we will work with TensorFlow SparseCategorical loss.

Next, we need to tokenize the text i.e. we need to transform our strings into a list of indices that can be fed to the model. The transformers library provides us the AutoTokenizer class that allows loading the pre-trained tokenizer used for RoBERTa.

RoBERTa uses a byte-level BPE tokenizer that performs subword tokenization, i.e. unknown rare words are split into common subwords present in the vocabulary. We will see what this means in examples.

Here the flag padding=True will pad the sentence to the max length passed in the batch. On the other side, truncation=True will truncate the sentences to the maximum number of tokens the model can accept (512 for RoBERTa, as for BERT).

Let’s visualize how the text gets tokenized.

Input: Twitter Users Just Say No To Kellyanne Conway's Drug Abuse Cure 
Subword tokenization: ['Twitter', 'ĠUsers', 'ĠJust', 'ĠSay', 'ĠNo', 'ĠTo', 'ĠKell', 'yan', 'ne', 'ĠConway', "'s", 'ĠDrug', 'ĠAbuse', 'ĠCure']
Indices: [0, 22838, 16034, 1801, 9867, 440, 598, 12702, 7010, 858, 13896, 18, 8006, 23827, 32641, 2, 1, 1, 1]
Input: Target's Wedding Dresses Are Nicer Than You Might Think (VIDEO)
Subword tokenization: ['Target', "'s", 'ĠWedding', 'ĠD', 'resses', 'ĠAre', 'ĠNic', 'er', 'ĠThan', 'ĠYou', 'ĠMight', 'ĠThink', 'Ġ(', 'VIDEO', ')']
Indices: [0, 41858, 18, 21238, 211, 13937, 3945, 13608, 254, 15446, 370, 30532, 9387, 36, 36662, 43, 2, 1, 1]
Input: Televisa Reinstates Fired Hosts, Is Investigating Sexual Harassment Claims
Subword tokenization: ['Te', 'lev', 'isa', 'ĠRe', 'inst', 'ates', 'ĠFired', 'ĠHost', 's', ',', 'ĠIs', 'ĠInvestig', 'ating', 'ĠSexual', 'ĠHar', 'assment', 'ĠClaims']
Indices: [0, 16215, 9525, 6619, 1223, 16063, 1626, 41969, 10664, 29, 6, 1534, 34850, 1295, 18600, 2482, 34145, 28128, 2]

The character Ġ appearing in subword tokenization indicates the beginning of a new word, tokens missing it are just parts of a bigger word that has been split. RoBERTa tokenizer uses 0 for the beginning of the sentence token, 1 is the pad token, and 2 is the end of the sentence token.

As the last step in our data preprocessing, we create a TensorFlow dataset from our data and we use the first 10% of the data for validation.

3. LOADING THE MODEL AND TRAINING

Now that we have preprocessed the data, we need to instantiate the model. We will use the Hugging Face TensorFlow auto class for sequence classification. Using the method from_pretrained, setting num_labels equal to the number of classes in our dataset, this class will take care of all the dirty work for us. It will download the pre-trained RoBERTa weights and instantiate a Keras model with a classification head on top. We can thus use all the usual Keras methods such as compile, fit and save_weights. We fine-tune our model for 6 epochs with a small learning rate 1e-5 and clipnorm=1. to limit potentially big gradients that could destroy the features learned during pretraining.

We see that the validation loss saturates pretty quickly while the training loss continues to lower. The model is in fact quite powerful and starts to overfit if trained for longer.

4. EVALUATION

The model reach ~77% top-1-accuracy and ~93% top-3-accuracy on the validation set out of a total of 28 classes.

Let’s visualize the confusion matrix on the validation set

Let’s compute (weighted) precision, recall, and f1 metrics of the model. For a quick overview of these metrics, you can look at the nice posts Multi-Class Metrics Made Simple, Part I: Precision and Recall, and Multi-Class Metrics Made Simple, Part II: the F1-score.

Precision:0.769
Recall:0.775
F1 score:0.769

Let’s visualize the top-3 predictions by probability for some examples in the validation set. The probability for each prediction is indicated in round brackets.

HEADLINE: Homemade Gift Ideas: Tart-Cherry and Dark Chocolate Bar Wrappers 
SHORT DESCRIPTION: This DIY gift only LOOKS professional.
TRUE LABEL: HOME & LIVING
Prediction 1:HOME & LIVING (77.5%);
Prediction 2:FOOD, DRINK & TASTE (19.8%);
Prediction 3:STYLE & BEAUTY (0.7%);
HEADLINE: Small Parties Claim Their Share In Upcoming Greek Elections
SHORT DESCRIPTION: Some of the country's lesser-known political players believe they've spotted their chance.
TRUE LABEL: WORLD NEWS
Prediction 1:WORLD NEWS (99.2%);
Prediction 2:POLITICS (0.4%);
Prediction 3:ENVIRONMENT & GREEN (0.1%);
HEADLINE: 46 Tons Of Beads Found In New Orleans' Storm Drains
SHORT DESCRIPTION: The Big Easy is also the big messy.
TRUE LABEL: WEIRD NEWS
Prediction 1:WEIRD NEWS (55.0%);
Prediction 2:ENVIRONMENT & GREEN (14.4%);
Prediction 3:SCIENCE & TECH (10.0%);

Conclusion

We have seen how to use the Hugging Face transformers library together with TensorFlow to fine-tune a large pre-trained model using TPUs on a multiclass classification task.

Thanks to the plethora of Auto Classes in the transformers library, other common NLP downstream tasks can be performed with minor modifications to the code provided.

I hope this tutorial was useful, thanks for reading!

--

--