Introduction
As a Data Scientist, I have never had the opportunity to properly explore the latest progress in Natural Language Processing. With the summer and the new boom of Large Language Models since the beginning of the year, I decided it was time to dive deep into the field and embark on some mini-projects. After all, there is never a better way to learn than by practicing.
As my journey started, I realized it was complicated to find content that takes the reader by the hand and goes, one step at a time, towards a deep comprehension of new NLP models with concrete projects. This is how I decided to start this new series of articles.
In this first article, we are going to take a deep dive into building a comment toxicity ranker. This project is inspired by the "Jigsaw Rate Severity of Toxic Comments" competition which took place on Kaggle last year.
The objective of the competition was to build a model with the capacity to determine which comment (out of two comments given as input) is the most toxic.
To do so, the model will attribute to every comment passed as input a score, which determines its relative toxicity.
What this article will cover
In this article, we are going to train our first NLP Classifier using Pytorch and Hugging Face transformers. I will not go into the details of how works transformers, but more into practical details and implementations and initiate some concepts that will be useful for the next articles of the series.
In particular, we will see:
- How to download a model from Hugging Face Hub
- How to customize and use an Encoder
- Build and train a Pytorch ranker from one of the Hugging Face models
This article is directly addressed to data scientists that would like to step their game in NLP from a practical point of view. I will not do much theory around transformers and even if I will write code in detail, I am expected that you already played a bit with PyTorch in the past.
Exploration and Architecture
The Training Dataset
We will work on a dataset that pairs comments and classifies them as one being "less toxic" and one being "more toxic".
The choice of relative toxicity has been made by a group of labelers.
The figure below shows a sample of data from the training set. The worker field represents the id of the labeler that made the classification.

Note: The dataset is available under an Open Source Licence, according to the Kaggle competition rules.
The ranking system
In any ML project, comprehending the task holds paramount significance as it significantly impacts the selection of an appropriate model and strategy. This understanding should be established right from the project’s kick-off.
In this particular project, our objective is to construct a ranking system. Instead of predicting a specific target, our focus is on determining an arbitrary value that facilitates efficient comparison between pairs of samples.
Let’s begin by sketching a basic diagram to represent the concept, knowing that we will go deeper into the workings of the "Model" later on.

Visualizing the task this way is crucial as it demonstrates that the project’s objective goes beyond training a simple binary classifier based on the training data. Instead of simply predicting a value of 0 or 1 to identify the most toxic comment, the ranking system aims to assign arbitrary values that allow efficient comparison between comments.
Model Training & Margin Ranking Loss
Considering the "Model" remains a black box Neural Network, we need to establish a way to utilize this system and leverage our training data to update the model’s weights. To achieve this, we need a suitable loss function.
Given that our goal is to build a ranking system, the Margin Ranking Loss is a relevant choice. This loss function is inspired by the hinge loss, which is commonly used to optimize maximum margins between samples.
The Margin Ranking Loss operates on pairs of samples. For each pair, it compares the scores produced by the "Model" for the two samples and enforces a margin between them. The margin indicates the desired difference in scores between correctly ranked samples.

In the formula above, x1 and x2 are the ranking score of two samples, and y is a coefficient equal to 1 if x1 should be ranked higher than x2, otherwise -1. "margin" is a hyperparameter of the formula which sets a minimum marges to reach.
Let’s have a look at how works this loss function:
Assuming y=1, which means the sample associated with x1 should be ranked higher than the sample associated with x2:
- If (x1 – x2) > margin, the score of sample 1 is higher than the score of sample 2 by a sufficient marge, and the right term of the max() is negative. The loss returned will then be equal to 0, and there is no penalty associated with those two ranks.
- If (x1 – x2) < margin, it means that the margins between x1 and x2 are not sufficient, or worst, that the score of x2 is higher than the score of x1. In that case, the loss will be higher as the score of sample 2 is higher compared to the score of sample 1, which will penalize the model.
With this in mind, we can now revise our training methodology as followed:
For a sample (or a batch) in the Training set:
- Forward-pass the more_toxic message(s) to the Model, get Rank_Score1 (x1)
- Forward-pass the less_toxic message(s) to the Model, get Rank_Score2 (x2)
- Compute the MarginRankingLoss with y = 1
- Update the weight of the model based on the computed loss (backpropagation step)

From Text to Features Representation: the Encoder block
Our training procedure is now set up. It’s time to go deeper into the ‘Model’ component himself. In the world of NLP, you’ll often come across three primary types of models: encoders, decoders, and encoder-decoder combinations. In this series of articles, we’ll examine these types of models more closely.
For the purposes of this specific article, our requirement is a model that can transform a message into a feature vector. This vector serves as the input to generate the final ranking score. This feature vector will be directly derived from the Encoder of a transformer architecture.
I won’t dive into the theory here as others have explained it way better (I recommend the introduction class from Hugging Face which is very well written). Just keep in mind that the key part of this process is something called the attention mechanism. It helps transformers make sense of the text by looking at other related words, even if they’re far apart.
With this architecture in place, we will be able to adjust the weights that produce the best vector representation of our texts to identify the most important features for our task, and simply connect the final layer from the transformer to a final node (called the "head") that will produce the final rank score.
Let’s update our diagram accordingly:

The Tokenizer
As you can see from the graph above, another component appeared inside of the model which we did not mention yet: a preprocessing step.
This preprocessing step is here to convert the raw text into something that can be passed through a Neural Network (numbers), and this is the role of the Tokenizer.
The tokenizer does two main things: splitting (= cutting the text into pieces, which can be words, part of words, or just letters) and indexing (= mapping each piece of text to a unique value, referenced in a dictionary so the operation can be reversed).
One really important thing to keep in mind is that there are multiple ways to tokenize a text, but if you use a pre-trained model, you need to use the same Tokenizer or the pre-trained weights will be meaningless (due to different splitting and indexing).
Another important thing is to remember that the encoder is nothing more than a Neural Network. As such, its input needs to be of a fixed size, which is not necessarily the case for your input text. The Tokenizer allows you to control the size of your token vector via two operations: padding and truncation. This is also an important parameter to consider because some pre-trained models will use a smaller, or larger, input space.
In the figure below, we add the Tokenizer and we show how the message in transformed from module to module.

And that’s it, we have exposed here all the components we need to know in order to efficiently tackle our "Comment Toxic Ranking" task. To summarize the graph above: each pair of messages (the less toxic and the more toxic) will be passed individually to the model pipeline. They will first pass through the Tokenizer, the Encoder, and the ranking layer to produce a pair of scores. This pair of scores will be then used to compute the Margin Ranking Loss, which will be used during the backpropagation step to upload the weights to the encoder and the final ranking layer and optimize them for the task.
In the next part, we are going to put our hands in the code and build the above pipeline using the Hugging Face transformers module and Pytorch.
Build, train, and evaluate the model
We have covered the theory in the previous part, it is now time to put our hands in the dirt and work on our model.
While building and training a complex deep-learning model could have been complicated in the past, the new modern frameworks.
Hugging Face is all you need
Hugging Face is an amazing company that is working on democratizing complex deep-learning models.
They build abstractions that help you build, load, fine-tune and share complex transformers models.
In the coming section, we are going to use their transformers package, which provides all the necessary tools to build pre-trained NLP models and use them for your own tasks. In the coming weeks, we are going to explore more in detail the different possibilities offered by the package
The package is compatible with both TensorFlow and PyTorch libraries.
To start, let’s install the transformers package
pip install transformers
The models available from Hugging Face are available from their Model Hub on their website. You can find all types of models as well as descriptions to understand what the model does, how many parameters, what datasets it has been trained on, etc…
In this article, we are going to use the architecture roberta-base which is a relatively light encoder trained on several English corpus.
The model description tells us a lot of very important information that is relevant to our task:
- The model has 125M parameters
- The model has been trained on several English corpus, which is important as our comment dataset is in English
- It has been trained on a Mask Language Modeling objective, which consists of trying to predict words masked in a text and using both the text before and after to make its prediction, which is not always the case (models like GPT only use the context from before the word to predict for example as they don’t have to the future of the sentence when they infer a new text).
- The model is case sensitive, which means it will make a difference between "WORD" and "word". This is particularly important in the case of a toxicity detector as letter capitalization is an important clue of toxicity.
Hugging Face can provide for each model the tokenizer used as well as the base NN in different configurations (you might not want all the weights: sometimes you want to restrict yourself to the encoder part, and the decoder part, stop at the hidden layer, etc..).
Models available from the Hugging Face hub can be cloned locally (which will make them faster to run) or loaded directly in your code, by using its repo id (for example roberta-base in our case)
Loading and testing the Tokenizer
To load the tokenizer, we can simply use the AutoTokenizer class from the transformers package, and specify which tokenizer we want to use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
In order to tokenize a text, we can simply call the "encode" or "encode_plus" methods. The "encode_plus" will not only provide you with the tokenized version of your text but also an attention mask, which will be used to ignore the part of the encoding which is purely padding.
text = "hello world"
tokenizer.encode_plus(
text,
truncation=True,
add_special_tokens=True,
max_length=10,
padding='max_length'
)
Will return a dictionary, where "input_ids" is the encoded sequence, and "attention_mask" is used to allow the transformer to ignore the padded tokens:
{
'input_ids': [0, 42891, 232, 2, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}
Among the parameters we use, we have:
- max_length: states the maximum length of the encoded sequence
- add_special_tokens: adds a and token to the text
- truncation: slices the text if it does not fit the max_length
- padding: add padding token up to the max_length
Loading a pre-train model
To load a pre-train model, Hugging Face provides multiple classes depending on your need (are you working with TensorFlow or Pytorch? What type of task are you trying to achieve).
In our case, we will work with AutoModel, which allows you to load a model architecture together with pre-trained weights directly. Note that if you work with TensorFlow, you can achieve the same by using the TFAutoModel class instead of the AutoModel class.
The AutoModel class will directly load the model architecture from RobertaModel and load the pre-trained weights associated with the "roberta-base" repo in Hugging Face.
As for the Tokenizer, we can directly load the model from the repo-id or from the path of a local repository, by using the from_pretrained method from AutoModel:
from transformers import AutoModel
robertaBase = AutoModel.from_pretrained("roberta-base")
Note that the encoder has not been trained on a particular task on its own, and we cannot use simply the model as it is. Instead, we will have to fine-tune it with our dataset.
We can double-check that robertaBase is an instance of pytorch.nn.Module, and can be integrated into a more complex PyTorch architecture:
import pytorch.nn as torch
isinstance(robertaBase, nn.Module)
>> True
You can also check its architecture by simply doing a print like you would do with a standard PyTorch module:
print(robertaBase)
>> RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): RobertaPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
Build a custom Neural Network
This last layer is actually a vector representation of the whole text we were discussing in the first part of this article, and we just have to connect it to a final node used as a ranker to complete our NN architecture.
To do so, we will simply build our own custom module by encapsulating nn.Module, as we would do with a classic NN with PyTorch.
model_name = "roberta-base"
last_hidden_layer_size = 768
final_node_size = 1
class ToxicRankModel(nn.Module):
def __init__(self, model_name, last_hidden_layer_size):
super(ToxicRankModel, self).__init__()
self.robertaBase = AutoModel.from_pretrained(model_name)
self.dropout = nn.Dropout(p=0.1)
self.rank_head = nn.Linear(last_hidden_layer_size, 1)
def forward(self, ids, mask):
output = self.robertaBase(input_ids=ids,attention_mask=mask,
output_hidden_states=False)
output = self.dropout(output[1])
score= self.fc(output)
return score
#This line check if the GPU is available, else it goes with the CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#After initiation, we send the model to the device
toxicRankModel = ToxicRankModel(model_name, last_hidden_layer_size)
toxicRankModel = toxicRankModel.to(device)
A few things to note here in the forward() method:
- We pass two main inputs to the robertBase model, input_ids and attention_mask. They were both generated by the Tokenizer.
- The AutoModel has parameters (like output_hidden_states). Depending on the parameters you chose, you can make the model behave as an encoder, or a decoder and custom the model for different NLP tasks
- Did you notice that we pass output[1] in the dropout? This is because the base model provides two inputs:
- First, the last hidden state, which contains a contextual representation (or contextual embedding) of each token that can be used for tasks like entity recognition
- Second, the output from the Pooler, which contains a vector representation of the whole text, is what we are looking for here.
Build a custom Dataset
With Pytorch, we need also to create our own Dataset class, which will be used to store the raw data, and a DataLoader, which will be used to feed the neural network by batches during training.
When building a custom dataset with Pytorch, you must implement two mandatory methods:
- len, which gives the size of the training data (important information for the data loader)
- getitem, which takes a raw input (from row "i") and preprocess it so it can be handled by the neural network (as tensor)
If you recall the diagram from the previous part, we are actually passing in parallel two inputs two the models before computing the loss: the less_toxic and the more_toxic.
The getitem method will handle the tokenization of the message and prepare the input for the transformer, converting the tokenized inputs as tensors.
class CustomDataset(Dataset):
def __init__(self, train_df, tokenizer, max_length):
#token list standard size
self.length = max_length
#Here the tokenizer will be an instance of the tokenizer
#shown previously
self.tokenizer = tokenizer
#df is the training df shown in the beginning of the article
self.more_toxic = train_df['more_toxic'].values
self.less_toxic = train_df['less_toxic'].values
def __len__(self):
return len(self.more_toxic)
def __getitem__(self, i):
# get both messages at index i
message_more_toxic = self.more_toxic[i]
message_less_toxic = self.less_toxic[i]
#tokenize the messages
dic_more_toxic = self.tokenizer.encode_plus(
message_more_toxic,
truncation=True,
add_special_tokens=True,
max_length=self.length,
padding='max_length'
)
dic_less_toxic = self.tokenizer.encode_plus(
message_less_toxic,
truncation=True,
add_special_tokens=True,
max_length=self.length,
padding='max_length'
)
#extract tokens and masks
tokens_more_toxic = dic_more_toxic['input_ids']
mask_more_toxic = dic_more_toxic['attention_mask']
tokens_less_toxic = dic_less_toxic['input_ids']
mask_less_toxic = dic_less_toxic['attention_mask']
#return a dictionnary of tensors
return {
'tokens_more_toxic': torch.tensor(tokens_more_toxic, dtype=torch.long),
'mask_more_toxic': torch.tensor(mask_more_toxic, dtype=torch.long),
'tokens_less_toxic': torch.tensor(tokens_less_toxic, dtype=torch.long),
'mask_less_toxic': torch.tensor(mask_less_toxic, dtype=torch.long),
}
We can now generate the DataLoader which will be used for the batch training of the model.
def get_loader(df, tokenizer, max_length, batch_size):
dataset = CustomDataset(
df,
tokenizer=tokenizer,
max_length=max_length
)
return DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
drop_last=True)
max_length = 128
batch_size = 32
train_loader = get_loader(train_df, tokenizer, max_length, batch_size=batch_size)
- batch_size specify the amount of sample to be loaded for the forward pass/backpropagation
- shuffle = True means the dataset is shuffled between two epochs
- drop_last means that if the last batch has not had the right amount of samples, it will be dropped. This can be important as batch normalization does not work well with incomplete batch.
Training the model
We are almost there, it’s time to prepare the training routing for one epoch.
Custom Loss
To start with, let’s define a custom loss to be used. Pytorch already provides the MarginRankingLoss, we are simply going to encapsulate it with y = 1 (as we will always pass more_toxic as x1 and less_toxic as x2.
from torch.nn import MarginRankingLoss
#Custom implementation of the MarginRankingLoss with y = 1
class CustomMarginRankingLoss(nn.Module):
def __init__(self, margin=0):
super(CustomMarginRankingLoss, self).__init__()
self.margin = margin
def forward(self, x1, x2):
#with y=1 this is how looks the loss
loss = torch.relu(x2 - x1 + self.margin)
return loss.mean()
def criterion(x1, x2):
return CustomMarginRankingLoss()(x1, x2)
Optimizer
For this experiment, we will go with a classic AdamW, which is currently state-of-the-art, and fix some of the problems from the original Adam implementation.
optimizer_lr = 1e-4
optimizer_weight_decay = 1e-6
optimizer = AdamW(toxicRankModel.parameters(),
lr=optimizer_lr,
weight_decay=optimizer_weight_decay)
Scheduler
The scheduler helps adapt the learning rate. During the start, we want a higher learning rate to converge faster to an optimum solution, and toward the end of the training, we want a much smaller learning rate to really fine-tune the weights.
scheduler_T_max = 500
scheduler_eta_min = 1e-6
scheduler = lr_scheduler.CosineAnnealingLR(optimizer,T_max=scheduler_T_max, eta_min=scheduler_eta_min)
Training Routine
We are now ready to train our NLP model for toxic comment ranking.
The training of an epoch is quite straightforward with Pytorch:
- We iterate through our data loader, which shuffles and selects the pre-processed data from the dataset
- We retrieve the tokens and the masks from the data loader
- We calculate the rank of each message by making a forward pass to our model
- When both ranks are calculated, we can compute the MarginRankingLoss (to use for the backpropagation), as well as an accuracy score which tells the % of pairs that are correctly classified (for reference only)
- We update our system (backpropagation, optimizer, and scheduler)
- We iterate until all the data in the data loader has been used.
def train_one_epoch(model, optimizer, scheduler, dataloader, device):
#Setup train mode, this is important as some layers behave differently
# during train and inference (like batch norm)
model.train()
#Initialisation of some loss
dataset_size = 0
running_loss = 0.0
running_accuracy = 0.0
progress_bar = tqdm(enumerate(dataloader), total=len(dataloader), desc="Training")
for i, data in progress_bar:
more_toxic_ids = data['tokens_more_toxic'].to(device, dtype = torch.long)
more_toxic_mask = data['mask_more_toxic'].to(device, dtype = torch.long)
less_toxic_ids = data['tokens_less_toxic'].to(device, dtype = torch.long)
less_toxic_mask = data['mask_less_toxic'].to(device, dtype = torch.long)
batch_size = more_toxic_ids.size(0)
#Forward pass both inputs in the model
x1 = model(more_toxic_ids, more_toxic_mask)
x2 = model(less_toxic_ids, less_toxic_mask)
#Compute margin ranking loss
loss = criterion(x1, x2)
accuracy_measure = (x1 > x2).float().mean().item()
#apply backpropagation, increment optimizer
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
#Update cumulative loss for monitoring
running_loss += (loss.item() * batch_size)
dataset_size += batch_size
epoch_loss = running_loss / dataset_size
running_accuracy += (accuracy_measure * batch_size)
epoch_accuracy = running_accuracy / dataset_size
progress_bar.set_postfix({'loss': epoch_loss, 'accuracy': epoch_accuracy}, refresh=True)
#Garbage collector
gc.collect()
return epoch_loss
I trained the model on a GPU T4 from Kaggle, which brought me to an honorable score of 70% of comments correctly classified. I can probably gain accuracy by playing more with the different parameters and using more epochs, but it is good enough for the purpose of this article.
A final word on inference
The framework we put in place works well for training from a set of comments pre-formatted as in our training set.
But it won’t work during a "production" scenario where you will receive a bunch of messages for which you need to evaluate the toxicity score.

For inference, you will design another Dataset class and another DataLoader which will be slightly different from what we did before:
class CustomInferenceDataset(Dataset):
def __init__(self, messages, tokenizer, max_length):
#token list standard size
self.length = max_length
#Here the tokenizer will be an instance of the tokenizer
#shown previously
self.tokenizer = tokenizer
#df is the training df shown in the beginning of the article
self.messages = messages
def __len__(self):
return len(self.messages)
def __getitem__(self, i):
# get a message at index i
message = self.messages[i]
#tokenize the message
dic_messages = self.tokenizer.encode_plus(
message,
truncation=True,
add_special_tokens=True,
max_length=self.length,
padding='max_length'
)
#extract tokens and masks
tokens_message = dic_messages['input_ids']
mask_message = dic_messages['attention_mask']
#return a dictionnary of tensors
return {
'tokens_message': torch.tensor(tokens_message, dtype=torch.long),
'mask_message': torch.tensor(mask_message, dtype=torch.long),
}
def get_loader_inference(messages, tokenizer, max_length, batch_size):
dataset = CustomInferenceDataset(
messages,
tokenizer=tokenizer,
max_length=max_length
)
return DataLoader(
dataset,
batch_size=batch_size,
shuffle=False,
drop_last=False)
What has changed:
- We are not loading pair of messages anymore, but single messages
- The Loader is not shuffling the data (this is very important if you don’t want bad surprises with random scores associated with your original vector)
- As there is no batch norm calculation and as we want all the data to be inferred, we set drop_last to False to get all batches, even incomplete ones
And finally, to produce the ranked score:
@torch.no_grad()
def get_scores(model, test_loader, device):
model.eval() # Set the model to evaluation mode
ranks = [] # List to store the rank scores
progress_bar = tqdm(enumerate(test_loader), total=len(test_loader), desc="Scoring")
for i, data in progress_bar:
tokens_message = data['tokens_message'].to(device, dtype=torch.long)
mask_message = data['mask_message'].to(device, dtype=torch.long)
# Forward pass to get the rank scores
rank = model(tokens_message, mask_message)
# Convert tensor to NumPy and add to the list
ranks+=list(rank.cpu().numpy().flatten())
return ranks
This is the top 5 classified messages after inference. In order to stay politically correct, I had to apply a bit of censorship here…

Not very constructive… 🙂
Conclusion
In this article, we leveraged a Hugging Face pre-trained model and Pytorch to produce a model able to rank the level of toxicity of messages.
To do so, we took a "Roberta" transformer (a small one) and connected a final simple node at the end of its encoder with PyTorch. The rest was more classic and probably similar to other projects you might have done if you are already familiar with PyTorch.
This project is an initiation around the possibilities that offer NLP and I wanted to start simply to introduce some basic concepts that are required to go further and play with more challenging tasks or much larger models.
I hope you enjoyed reading, if you want to play with the model you can download a Notebook from my GitHub.