Whilst it’s easy to take for granted that tools like Hugging Face make it easy to apply complex models and transfer learning to any problem we like, I thought it would be beneficial to show that these tools can actually achieve state-of-the-art (SOTA) results in an afternoon. Otherwise, what’s the point in trying?
Our task will be to predict whether a tweet is inoffensive or not. To do this, we’ll be using the TweetEval dataset from the paper TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. We’re only going to use the subset of this dataset called offensive
, but you can check out the other subsets which label things like emotion, and stance on climate change. We are performing a type of text classification, and will be using a smaller, faster version of the BERT transformer model called DistilBERT.
Dataset
The offensive
config of the TweetEval dataset has a model card in Hugging Face which is describes it as consisting of:
text
: astring
feature containing the tweet.label
: anint
classification label with the following mapping:0
: non-offensive,1
: offensive
Hugging Face Datasets
Let’s load the appropriate dataset with the load_dataset()
function:
The offensive
object is similar to a Python dictionary, with the keys being the dataset splits (training, validation, and testing).
Using traditional Python dictionary syntax allows us to access any of these individual datasets. These datasets will be returned as a Dataset
class, which is a key structure in Hugging Face datasets. Think of a Dataset
as a special type of array, meaning we can index it and get it’s length.
The key thing to understand here is that a single item from the dataset (think of this as a row for training) is a dictionary, consisting of the keys text
and label
, and the values in these keys are the tweet itself, and the offensive status.
From datasets to DataFrames
Whilst managing dictionaries is doable in Python, it is easier to use Pandas DataFrames, particularly since most data scientists are very familiar with it. Hugging Face allows us to convert between a standard Datasets
object and a Pandas DataFrame
. We can do this using set_format()
:

In case we forget whether 0 or 1 is the offensive label, we can convert between label integers and names. To do this, we access the dataset’s features, and then the labels using indexing, and finally the int2str
function:

Imbalanced dataset – frequency of classes
There are many different strategies to deal with imbalanced data, where some labels appear much more often than others. With a simple histogram, it’s easy to see that our dataset is imbalanced:

Whilst I won’t necessarily go into how we address this problem here, I will point out that it’s important to note. Here’s a good resource for dealing with imbalanced data in classification problems.
Length of tweets and maximum model context
Different models take different amounts of context, where the context is the number of tokens used as an input sequence. The maximum input sequence length is known as maximum context size. Although context is dependent on the length of tokens and how we tokenise, we can estimate if most of our inputs (i.e. tweets) will exceed the maximum context size by examining words per tweet:

We’ll be using the DistilBERT model, which has a maximum context size of 512 tokens. With an upper limit of 70 words per tweet, this means we should be fine.
Tokenisation
When we use a Hugging Face model, especially one that was pretrained, we need to ensure that we use the same tokeniser that the model was trained on. If you’re not sure why this is, think of using a different tokeniser as encrypting each token. So the token that previously meant "ball" now has a different token, which gives the model an extra job of decrypting this additional layer.
DistilBERT (the model we’re going to use) uses the WordPiece
tokeniser. We don’t need to do anything fancy to instantiate it – AutoTokenizer
is a HF class that gives us a pretrained model’s associated tokeniser using the from_pretrained()
method:
Since we want to tokenise the whole dataset (including its subsets), we use the inbuilt map()
function, which applies a processing function to each row in the dataset. We use padding=True
so the examples will all be the length of the longest item in the batch, and the rest will be filled with 0s. truncation=True
simply ensures every example is less than the maximum context size.
Also note that in addition to returning the encoded tweets as input_ids
, the tokeniser returns a list of attention_mask
arrays. This is because we do not want the model to get confused by the additional padding tokens: the attention mask allows the model to ignore the padded parts of the input.
We finally use map
to apply the tokenising across all datasets. The batched=True
means we speed up the process by encoding in batches, and batch_size=None
means our batches will simply be the actual datasets (e.g. training, validation). This ensures that the inputs are the same length (i.e. both the tensors and the attention masks).
Note that this mapping has added new columns to the dataset.
Fine-tuning a transformer
Unfortunately, DistilBERT is only trained on predicting masked words in a sequence. Whilst the majority of the model (the body) has a deep understanding of the English language, the last few layers (the head) have been trained specifically to predict these masked words. This is not the same as our task. We simply want to output the probability that a sequence belongs to a certain class. Since there are only two classes, we only output two probabilities.
This means we have to train hidden states in the final layers, requiring the model head to be differentiable. Instead of writing our own training loop in Pytorch as we normally do, we’ll be following a fastai approach and using the HF Transformers API for our training loop.
Loading the pretrained DistilBERT model
To be able to train the final few layers of a model, we need the actual model with the pretrained weights first. Hugging Face would normally let us grab any model with the AutoModel
class, using the method .from_pretrained()
. Since we need a classification head, we use AutoModelForSequenceClassification
which simply whacks the appropriate architecture for classification on top of the body of pretrained weights. The only thing that requires specification is the number of classes to predict (in our case two):
Defining performance metrics
One of the less intuitive things that the trainer API requires is a compute_metrics()
function, which takes in an EvalPrediction
object (a named tuple of predictions
and label_ids
attributes) and returns a dictionary of metric:value pairs. Since we’re doing binary classification, we’ll use accuracy and F1-score as our metrics:
Training run
We need to login to the HF Hub if we want to upload our model run and save our performance. This will allow us to share our model with other users. You’ll need a write
API access token, which you can read about in the documentation.

The trainer API also requires training parameters (which includes our hyperparameters). We use the TrainingArguments
class from HF Transformers to do this. Importantly, we specify where all the results of our training are stored in output_dir
.
Here we also set the batch size, learning rate, and number of epochs, and other important parameters. Finally, we instantiate the model using the Trainer API, and call .train()
to fine-tune.

We can keep training for a few more epochs to see if we can squeeze out any more juice. To check that this result is decent, we can implement a dummy classifier in scikit-learn that just predicts the most common class, and see what accuracy it gets. Here, predicting each tweet as inoffensive gives an accuracy of 65%. To use this, you’ll need to define the X_train, y_train
, etc.
So our model is certainly better than the baseline. However, to get an even better idea where our performance stands, we should compare to available baselines on this dataset.
Note that we can get a dictionary of metrics by using the predict
method of the trainer on the validation dataset, then calling the metrics
feature of this object:
Results comparison
A great thing about uploading models to the HF Hub is that they’re automatically graded on a website called Papers With Code. These guys bring together papers, datasets and community models to, among other things, record all the evaluation metrics for thousands of different tasks across ML.
The first thing we’ll do is login to Hugging Face in our browser and click on the particular model (it’ll be under whatever you named it).

This page lists the performance of the model, and also hosts an inference API so that you can try out the model for yourself. Additionally, it logs hyperparameters and training performance over time:

If we go down to the bottom right hand corner, we can go to Papers With Code to view the leaderboard for this dataset and this task.

Impressively, we did pretty well with just fine-tuning a pretrained model. We are now first in the world on this particular dataset, for both accuracy and F1 score.


We even get acknowledged on the Papers With Code page for the dataset as the state-of-the-art benchmark:

Conclusion
This run-through was a real-life demonstration of what Jeremy Howard always says: it doesn’t take a math PhD or expensive GPUs to achieve state-of-the-art results in a task in only an afternoon. Tools like Hugging Face and fastai allow us to quickly train models and iterate on them. Transfer learning using pretrained models, which are becoming more and more readily available, leverages these tools even more.
If you are interested in quickly exploring the capabilities of transformers in Hugging Face, check out this demonstration I did here.
References
[1] L. Tunstall, L. Werra, and T. Wolf, Natural Language Processing with Transformers (2022), O’Reilly Media
[2] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), arxiv
[3] F. Barbieri, J. Camacho-Collados, L.Neves, Luis Espinosa-Anke†, TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification (2020), arxiv
[4] J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Classification (2018), arxiv
[5] Hugging Face Datasets Documentation (2022)