Transformers for Multi-Label Classification made simple.

BERT, XLNet, RoBERTa, etc. for multilabel classification — a step by step guide

Published in

Towards Data Science

4 min readMay 27, 2020

As a data scientist who has been learning the state of the art for text classification, I found that there are not many easy examples to adapt transformers (BERT, XLNet, etc.) for multilabel classification…so I decided to try for myself and here it is!

As an homage to other multilabel text classification blog posts, I will be using the Toxic Comment Classification Challenge dataset.

This post is accompanied by an interactive Google Colab notebook so you can try this yourself. All you have to do is upload the train.csv, test.csv, and test_labels.csv files into the instance. Let’s get started.

In this tutorial I will be using Hugging Face’s transformers library along with PyTorch (with GPU), although this can easily be adapted to TensorFlow — I may write a seperate tutorial for this later if this picks up traction along with tutorials for multiclass classification. Below I will be training a BERT model but I will show you how easy it is to adapt this code for other transformer models along the way.

Import Libraries

Load & Preprocess Training Data

The toxic dataset is already cleaned and separated into train and test sets, so we can load the train set and use it directly.

Each transformer model requires different tokenization encodings — meaning the way that the sentence is tokenized and attention masks are used may differ depending on the transformer model you use. Thankfully, HuggingFace’s transformers library makes it extremely easy to implement for each model. In the code below we load a pretrained BERT tokenizer and use the method “batch_encode_plus” to get tokens, token types, and attention masks. Feel free to load the tokenizer that suits the model you would like to use for prediction. e.g.,

BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) XLNet:
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=False) RoBERTa:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=False)

Next, we will use 10% of our training inputs as a validation set so we can monitor our classifier’s performance as it is training. Here we want to make sure we utilize the “stratify” parameter so no unseen labels appear in the validation set. In order to stratify appropriately we will take all labels that only appear once in the dataset and force them into the training set. We will also need to create PyTorch data loaders to load the data for training/prediction.

Load Model & Set Params

Loading the appropriate model can be done as shown below, each model already contains a single dense layer for classification on top.

BERT:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)XLNet:
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=num_labels)RoBERTa:
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels)

Optimizer params can configured in a few ways. Here we are using a customized optimization parameters (which I’ve had more success with), however, you could just pass “model.parameters( )” as shown in the comments.

Train Model

The HuggingFace library is configured for multiclass classification out of the box using “Categorical Cross Entropy” as the loss function. Therefore, the output of a transformer model would be akin to:

outputs = model(batch_input_ids, token_type_ids=None, attention_mask=batch_input_mask, labels=batch_labels)loss, logits = outputs[0], outputs[1]

However, if we avoid passing in a labels parameter, the model will only output logits, which we can use to calculate our own loss for multilabel classification.

outputs = model(batch_input_ids, token_type_ids=None, attention_mask=batch_input_mask, labels=batch_labels)logits = outputs[0]

Below is the code snippet of doing exactly that. Here we use “Binary Cross Entropy With Logits” as our loss function. We could have just as easily used standard “Binary Cross Entropy”, “Hamming Loss”, etc.

For validation, we will use micro F1 accuracy to monitor training performance across epochs. To do so we will have to utilize our logits from our model output, pass them through a sigmoid function (giving us outputs between [0, 1], and threshold them (at 0.50) to generate predictions. These predictions can then be used to calculate accuracy against the true labels.

Viola! We’re ready for training, now run it…my train times ranged between 20–40 min per epoch depending on the max token length and the GPU at use.

Prediction & Metrics

Prediction for our test set is similar to our validation set. Here we will be loading, preprocessing, and predicting with the test data.

Output DataFrame

Creating a dataframe of outputs that show sentences and their classification.

Bonus — Optimizing Threshold for Micro F1 Accuracy

Iterating through threshold values to maximize Micro F1 Accuracy.

That’s it! Please comment if you have any questions. Here is the link to the Google Colab notebook again in case you missed it. If you have any personal inquiries feel free to contact me on LinkedIn or Twitter.