News classification: fine-tuning RoBERTa on TPUs with TensorFlow
A tutorial on multiclass text classification using Hugging Face transformers.
Fine-tuning large pre-trained models on downstream tasks is a common practice in Natural Language Processing. In this tutorial, we will use a pre-trained RoBERTa model for a multiclass classification task.
RoBERTa: A Robustly Optimized BERT Pretraining Approach, developed by Facebook AI, improves on the popular BERT model by modifying key hyperparameters and pretraining on a larger corpus. This leads to improved performance compared to vanilla BERT.
The transformers library by Hugging Face allows to easily deploy pre-trained models for a variety of NLP tasks with few lines of code. There is a variety of Auto Model classes that wrap up the pre-trained models implementing automatically the necessary architectural changes needed for common downstream tasks. Furthermore, these models can be cast as Keras models allowing easy training through the Keras API.
In this tutorial, we will fine-tune RoBERTa on the News Category Dataset, hosted on Kaggle, to predict the category of news from their title and a short description. The dataset contains 200k news headlines obtained from the year 2012 to 2018.
Full code is available in the following public Colab notebook.
0. INSTALL AND IMPORT DEPENDENCIES
We first have to install the transformers library, this can be easily done through pip install transformers
.
Next, we import the libraries needed for the rest of the tutorial.
1. INSTANTIATE THE TPU
The model has been trained using Colab free TPUs. TPUs will allow us to train our model much faster and will also allow us to use a larger batch size. To enable TPU on Colab click on “Edit”->“Notebook Settings” and select “TPU” in the “Hardware Accelerator” field. To instantiate the TPU in order to use it with TensorFlow we need to run the following code
To make full use of the TPU potential, we have set a batch size that is a multiple of the number of TPUs in our cluster. We will then just need to instantiate our model under tpu_strategy.scope()
.
2. DATA EXPLORATION
Let’s load the data. We will concatenate the headline and the description into a single input text that we will feed to our network later.
The news headlines are classified into 41 categories, let’s visualize how they are distributed.
We see that we have a lot of categories with few entries. Furthermore, some categories may refer to closely related or overlapping concepts. Since there is a significant number of categories to predict, let’s aggregate the categories that refer to similar concepts. This will make the classification task a little bit easier.
We are thus left with 28 aggregated categories distributed as follows
3. DATA PREPROCESSING
We have now to preprocess our data in a way that can be used by a Tensorflow Keras model. As a first step, we need to turn the classes labels into indices. We don’t need a one-hot encoding since we will work with TensorFlow SparseCategorical loss.
Next, we need to tokenize the text i.e. we need to transform our strings into a list of indices that can be fed to the model. The transformers library provides us the AutoTokenizer class that allows loading the pre-trained tokenizer used for RoBERTa.
RoBERTa uses a byte-level BPE tokenizer that performs subword tokenization, i.e. unknown rare words are split into common subwords present in the vocabulary. We will see what this means in examples.
Here the flag padding=True
will pad the sentence to the max length passed in the batch. On the other side, truncation=True
will truncate the sentences to the maximum number of tokens the model can accept (512 for RoBERTa, as for BERT).
Let’s visualize how the text gets tokenized.
Input: Twitter Users Just Say No To Kellyanne Conway's Drug Abuse Cure
Subword tokenization: ['Twitter', 'ĠUsers', 'ĠJust', 'ĠSay', 'ĠNo', 'ĠTo', 'ĠKell', 'yan', 'ne', 'ĠConway', "'s", 'ĠDrug', 'ĠAbuse', 'ĠCure']
Indices: [0, 22838, 16034, 1801, 9867, 440, 598, 12702, 7010, 858, 13896, 18, 8006, 23827, 32641, 2, 1, 1, 1] Input: Target's Wedding Dresses Are Nicer Than You Might Think (VIDEO)
Subword tokenization: ['Target', "'s", 'ĠWedding', 'ĠD', 'resses', 'ĠAre', 'ĠNic', 'er', 'ĠThan', 'ĠYou', 'ĠMight', 'ĠThink', 'Ġ(', 'VIDEO', ')']
Indices: [0, 41858, 18, 21238, 211, 13937, 3945, 13608, 254, 15446, 370, 30532, 9387, 36, 36662, 43, 2, 1, 1] Input: Televisa Reinstates Fired Hosts, Is Investigating Sexual Harassment Claims
Subword tokenization: ['Te', 'lev', 'isa', 'ĠRe', 'inst', 'ates', 'ĠFired', 'ĠHost', 's', ',', 'ĠIs', 'ĠInvestig', 'ating', 'ĠSexual', 'ĠHar', 'assment', 'ĠClaims']
Indices: [0, 16215, 9525, 6619, 1223, 16063, 1626, 41969, 10664, 29, 6, 1534, 34850, 1295, 18600, 2482, 34145, 28128, 2]
The character Ġ
appearing in subword tokenization indicates the beginning of a new word, tokens missing it are just parts of a bigger word that has been split. RoBERTa tokenizer uses 0 for the beginning of the sentence token, 1 is the pad token, and 2 is the end of the sentence token.
As the last step in our data preprocessing, we create a TensorFlow dataset from our data and we use the first 10% of the data for validation.
3. LOADING THE MODEL AND TRAINING
Now that we have preprocessed the data, we need to instantiate the model. We will use the Hugging Face TensorFlow auto class for sequence classification. Using the method from_pretrained
, setting num_labels
equal to the number of classes in our dataset, this class will take care of all the dirty work for us. It will download the pre-trained RoBERTa weights and instantiate a Keras model with a classification head on top. We can thus use all the usual Keras methods such as compile
, fit
and save_weights
. We fine-tune our model for 6 epochs with a small learning rate 1e-5
and clipnorm=1.
to limit potentially big gradients that could destroy the features learned during pretraining.
We see that the validation loss saturates pretty quickly while the training loss continues to lower. The model is in fact quite powerful and starts to overfit if trained for longer.
4. EVALUATION
The model reach ~77% top-1-accuracy and ~93% top-3-accuracy on the validation set out of a total of 28 classes.
Let’s visualize the confusion matrix on the validation set
Let’s compute (weighted) precision, recall, and f1 metrics of the model. For a quick overview of these metrics, you can look at the nice posts Multi-Class Metrics Made Simple, Part I: Precision and Recall, and Multi-Class Metrics Made Simple, Part II: the F1-score.
Precision:0.769
Recall:0.775
F1 score:0.769
Let’s visualize the top-3 predictions by probability for some examples in the validation set. The probability for each prediction is indicated in round brackets.
HEADLINE: Homemade Gift Ideas: Tart-Cherry and Dark Chocolate Bar Wrappers
SHORT DESCRIPTION: This DIY gift only LOOKS professional.
TRUE LABEL: HOME & LIVING
Prediction 1:HOME & LIVING (77.5%);
Prediction 2:FOOD, DRINK & TASTE (19.8%);
Prediction 3:STYLE & BEAUTY (0.7%); HEADLINE: Small Parties Claim Their Share In Upcoming Greek Elections
SHORT DESCRIPTION: Some of the country's lesser-known political players believe they've spotted their chance.
TRUE LABEL: WORLD NEWS
Prediction 1:WORLD NEWS (99.2%);
Prediction 2:POLITICS (0.4%);
Prediction 3:ENVIRONMENT & GREEN (0.1%); HEADLINE: 46 Tons Of Beads Found In New Orleans' Storm Drains
SHORT DESCRIPTION: The Big Easy is also the big messy.
TRUE LABEL: WEIRD NEWS
Prediction 1:WEIRD NEWS (55.0%);
Prediction 2:ENVIRONMENT & GREEN (14.4%);
Prediction 3:SCIENCE & TECH (10.0%);
Conclusion
We have seen how to use the Hugging Face transformers library together with TensorFlow to fine-tune a large pre-trained model using TPUs on a multiclass classification task.
Thanks to the plethora of Auto Classes in the transformers library, other common NLP downstream tasks can be performed with minor modifications to the code provided.
I hope this tutorial was useful, thanks for reading!