The world’s leading publication for data science, AI, and ML professionals.

Achieving state-of-the-art for offensive tweet prediction using transformers

Utilising DistilBERT and fine-tuning for text classification

Photo by Claudio Schwarz on Unsplash
Photo by Claudio Schwarz on Unsplash

Whilst it’s easy to take for granted that tools like Hugging Face make it easy to apply complex models and transfer learning to any problem we like, I thought it would be beneficial to show that these tools can actually achieve state-of-the-art (SOTA) results in an afternoon. Otherwise, what’s the point in trying?

Our task will be to predict whether a tweet is inoffensive or not. To do this, we’ll be using the TweetEval dataset from the paper TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. We’re only going to use the subset of this dataset called offensive, but you can check out the other subsets which label things like emotion, and stance on climate change. We are performing a type of text classification, and will be using a smaller, faster version of the BERT transformer model called DistilBERT.

Dataset

The offensive config of the TweetEval dataset has a model card in Hugging Face which is describes it as consisting of:

  • text: a string feature containing the tweet.
  • label: an int classification label with the following mapping: 0: non-offensive, 1: offensive

Hugging Face Datasets

Let’s load the appropriate dataset with the load_dataset() function:

The offensive object is similar to a Python dictionary, with the keys being the dataset splits (training, validation, and testing).

Using traditional Python dictionary syntax allows us to access any of these individual datasets. These datasets will be returned as a Dataset class, which is a key structure in Hugging Face datasets. Think of a Dataset as a special type of array, meaning we can index it and get it’s length.

The key thing to understand here is that a single item from the dataset (think of this as a row for training) is a dictionary, consisting of the keys text and label, and the values in these keys are the tweet itself, and the offensive status.

From datasets to DataFrames

Whilst managing dictionaries is doable in Python, it is easier to use Pandas DataFrames, particularly since most data scientists are very familiar with it. Hugging Face allows us to convert between a standard Datasets object and a Pandas DataFrame. We can do this using set_format():

Image by author.
Image by author.

In case we forget whether 0 or 1 is the offensive label, we can convert between label integers and names. To do this, we access the dataset’s features, and then the labels using indexing, and finally the int2str function:

Image by author.
Image by author.

Imbalanced dataset – frequency of classes

There are many different strategies to deal with imbalanced data, where some labels appear much more often than others. With a simple histogram, it’s easy to see that our dataset is imbalanced:

Frequency of our two classes in the dataset (image by author).
Frequency of our two classes in the dataset (image by author).

Whilst I won’t necessarily go into how we address this problem here, I will point out that it’s important to note. Here’s a good resource for dealing with imbalanced data in classification problems.

Length of tweets and maximum model context

Different models take different amounts of context, where the context is the number of tokens used as an input sequence. The maximum input sequence length is known as maximum context size. Although context is dependent on the length of tokens and how we tokenise, we can estimate if most of our inputs (i.e. tweets) will exceed the maximum context size by examining words per tweet:

We’ll be using the DistilBERT model, which has a maximum context size of 512 tokens. With an upper limit of 70 words per tweet, this means we should be fine.

Tokenisation

When we use a Hugging Face model, especially one that was pretrained, we need to ensure that we use the same tokeniser that the model was trained on. If you’re not sure why this is, think of using a different tokeniser as encrypting each token. So the token that previously meant "ball" now has a different token, which gives the model an extra job of decrypting this additional layer.

DistilBERT (the model we’re going to use) uses the WordPiece tokeniser. We don’t need to do anything fancy to instantiate it – AutoTokenizer is a HF class that gives us a pretrained model’s associated tokeniser using the from_pretrained() method:

Since we want to tokenise the whole dataset (including its subsets), we use the inbuilt map() function, which applies a processing function to each row in the dataset. We use padding=True so the examples will all be the length of the longest item in the batch, and the rest will be filled with 0s. truncation=True simply ensures every example is less than the maximum context size.

Also note that in addition to returning the encoded tweets as input_ids, the tokeniser returns a list of attention_mask arrays. This is because we do not want the model to get confused by the additional padding tokens: the attention mask allows the model to ignore the padded parts of the input.

We finally use map to apply the tokenising across all datasets. The batched=True means we speed up the process by encoding in batches, and batch_size=None means our batches will simply be the actual datasets (e.g. training, validation). This ensures that the inputs are the same length (i.e. both the tensors and the attention masks).

Note that this mapping has added new columns to the dataset.

Fine-tuning a transformer

Unfortunately, DistilBERT is only trained on predicting masked words in a sequence. Whilst the majority of the model (the body) has a deep understanding of the English language, the last few layers (the head) have been trained specifically to predict these masked words. This is not the same as our task. We simply want to output the probability that a sequence belongs to a certain class. Since there are only two classes, we only output two probabilities.

This means we have to train hidden states in the final layers, requiring the model head to be differentiable. Instead of writing our own training loop in Pytorch as we normally do, we’ll be following a fastai approach and using the HF Transformers API for our training loop.

Loading the pretrained DistilBERT model

To be able to train the final few layers of a model, we need the actual model with the pretrained weights first. Hugging Face would normally let us grab any model with the AutoModel class, using the method .from_pretrained(). Since we need a classification head, we use AutoModelForSequenceClassification which simply whacks the appropriate architecture for classification on top of the body of pretrained weights. The only thing that requires specification is the number of classes to predict (in our case two):

Defining performance metrics

One of the less intuitive things that the trainer API requires is a compute_metrics() function, which takes in an EvalPrediction object (a named tuple of predictions and label_ids attributes) and returns a dictionary of metric:value pairs. Since we’re doing binary classification, we’ll use accuracy and F1-score as our metrics:

Training run

We need to login to the HF Hub if we want to upload our model run and save our performance. This will allow us to share our model with other users. You’ll need a write API access token, which you can read about in the documentation.

Image by author.
Image by author.

The trainer API also requires training parameters (which includes our hyperparameters). We use the TrainingArguments class from HF Transformers to do this. Importantly, we specify where all the results of our training are stored in output_dir.

Here we also set the batch size, learning rate, and number of epochs, and other important parameters. Finally, we instantiate the model using the Trainer API, and call .train() to fine-tune.

Image by author.
Image by author.

We can keep training for a few more epochs to see if we can squeeze out any more juice. To check that this result is decent, we can implement a dummy classifier in scikit-learn that just predicts the most common class, and see what accuracy it gets. Here, predicting each tweet as inoffensive gives an accuracy of 65%. To use this, you’ll need to define the X_train, y_train, etc.

So our model is certainly better than the baseline. However, to get an even better idea where our performance stands, we should compare to available baselines on this dataset.

Note that we can get a dictionary of metrics by using the predict method of the trainer on the validation dataset, then calling the metrics feature of this object:

Results comparison

A great thing about uploading models to the HF Hub is that they’re automatically graded on a website called Papers With Code. These guys bring together papers, datasets and community models to, among other things, record all the evaluation metrics for thousands of different tasks across ML.

The first thing we’ll do is login to Hugging Face in our browser and click on the particular model (it’ll be under whatever you named it).

Image by author.
Image by author.

This page lists the performance of the model, and also hosts an inference API so that you can try out the model for yourself. Additionally, it logs hyperparameters and training performance over time:

Image by author.
Image by author.

If we go down to the bottom right hand corner, we can go to Papers With Code to view the leaderboard for this dataset and this task.

Image by author.
Image by author.

Impressively, we did pretty well with just fine-tuning a pretrained model. We are now first in the world on this particular dataset, for both accuracy and F1 score.

F1 score compared to previous baselines. Note that we achieve SOTA on this dataset (image by author).
F1 score compared to previous baselines. Note that we achieve SOTA on this dataset (image by author).

We even get acknowledged on the Papers With Code page for the dataset as the state-of-the-art benchmark:

Our model _distilbert-base-uncased-finetuned-tweet_eval-offensive_ is now listed on the Papers With Code website as the benchmark for this dataset (image by author).
Our model _distilbert-base-uncased-finetuned-tweet_eval-offensive_ is now listed on the Papers With Code website as the benchmark for this dataset (image by author).

Conclusion

This run-through was a real-life demonstration of what Jeremy Howard always says: it doesn’t take a math PhD or expensive GPUs to achieve state-of-the-art results in a task in only an afternoon. Tools like Hugging Face and fastai allow us to quickly train models and iterate on them. Transfer learning using pretrained models, which are becoming more and more readily available, leverages these tools even more.

If you are interested in quickly exploring the capabilities of transformers in Hugging Face, check out this demonstration I did here.

References

[1] L. Tunstall, L. Werra, and T. Wolf, Natural Language Processing with Transformers (2022), O’Reilly Media

[2] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), arxiv

[3] F. Barbieri, J. Camacho-Collados, L.Neves, Luis Espinosa-Anke†, TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification (2020), arxiv

[4] J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Classification (2018), arxiv

[5] Hugging Face Datasets Documentation (2022)


Related Articles