How to Utilize ModernBERT and Synthetic Data for Robust Text Classification

In this article, I discuss how you can implement and fine-tune the new ModernBERT text model. Furthermore, I use the model on a classic text classification task and show you how you can utilize synthetic data to improve the model’s performance.

In this article, I discuss how you can finetune ModernBERT for your classification task. Furthermore, I show you how you can leverage synthetic data to improve the performance of your text classification model. Image by ChatGPT.

· Table of Contents · Finding a dataset · Implementing ModernBERT · Detecting errors · Synthesize data to improve model performance · New results after augmentation · My thoughts and future work · Conclusion

Finding a dataset

First, we need to find a dataset to perform text classification on. To keep it simple, I found an open-source dataset on HuggingFace where you predict the sentiment of a given text. The sentiment can be predicted in the classes:

Negative (id 0)
Neutral (id 1)
Positive (id 2)

You can download the dataset from HuggingFace, but the easiest way of accessing it is using Pandas on a HuggingFace link:

import pandas as pd
splits = {'train': 'train_df.csv'} # we only use a subset of the dataset
df = pd.read_csv("hf://datasets/Sp1786/multiclass-sentiment-analysis-dataset/" + splits["train"])
df = df.sample(frac=0.05, random_state=42)
print(df.head())

I only keep 5% of the dataset to keep this tutorial as simple as possible and ensure most machines have the compute to fine-tune the model.

Now, let’s prepare the dataset by splitting it into train and test and tokenizing it. I won’t be going into technical details here, as I regard this as a prerequisite (though knowing how it works is not required to follow the rest of the article. To set up fine-tuning for ModernBERT, I am using this HuggingFace tutorial.

Let’s first install and import all required packages. You can install them with this requirements file:

# requirements.txt
pandas
fsspec
huggingface-hub
transformers # ensure version 4.48.0 or higher
datasets
torch
torchvision
torchaudio
scikit-learn
accelerate # ensure >=0.26.0
seaborn
requests 
nlpaug
protobuf
sacremoses
nltk
sentencepiece

And import them with:

from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict
from transformers import Trainer, TrainingArguments, pipeline, AutoModelForSequenceClassification, AutoTokenizer
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
import numpy as np
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt

Then, let’s prepare the dataset by splitting and encoding the labels (the sentiment of the text) as int


# make a train test split of df and convert it to dataset dict
train_fraction = 0.5
random_mask = np.random.RandomState(42).rand(len(df)) < train_fraction
df["train_test_split"] = np.where(random_mask, "train", "test")

# get unique sentiments and create consistent mappings
unique_sentiments = df["sentiment"].unique()
label2id = {label: i for i, label in enumerate(sorted(unique_sentiments))}  # Sort for consistency
id2label = {i: label for label, i in label2id.items()}
df['label'] = df['sentiment'].map(label2id)
num_labels = len(unique_sentiments)

Now convert it to a dataset dict so it’s readable by the transformers trainer:

dataset = DatasetDict({
    'train': Dataset.from_pandas(df[df['train_test_split'] == 'train']),
    'test': Dataset.from_pandas(df[df['train_test_split'] == 'test'])
})

Implementing ModernBERT

First, to implement ModernBERT, we need to load the tokenizer and tokenize our dataset:


# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Tokenize helper function
def tokenize(batch):
    return tokenizer(
        batch['text'], 
        padding='max_length',
        truncation=True,
        max_length=256,
        return_tensors="pt"
    )

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

Now, let’s load the model and run the training. We also use a simple f1 metric as our objective function:

# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    num_labels=num_labels,
    label2id=label2id,  # Use the original mapping directly
    id2label=id2label,
)

# Define training args
training_args = TrainingArguments(
    output_dir= "ModernBERT-domain-classifier",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=2,
    bf16=True, # bfloat16 training 
    optim="adamw_torch_fused", # improved optimizer 
    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    use_mps_device=True,
    metric_for_best_model="eval_loss",
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)
trainer.train()

This will fine-tune the model. We can then evaluate it on the test dataset with the code below, which will print the precision, recall, and f1 metrics and also show us a confusion matrix:


def inference(classifier, text):
    return int(classifier(text)[0]["label"])

# Evaluate on test set
predictions = []
labels = []
for row in tqdm(dataset["test"]):
    predictions.append(inference(classifier, row["text"]))
    labels.append(row["label"])

accuracy = sum([pred == label for pred, label in zip(predictions, labels)]) / len(labels)
print(f"Accuracy: {accuracy:.2f}")

labels_string = [id2label[label] for label in labels]
predictions_string = [id2label[prediction] for prediction in predictions]

cm = confusion_matrix(labels_string, predictions_string, labels=["negative", "neutral", "positive"])
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["negative", "neutral", "positive"], yticklabels=["negative", "neutral", "positive"])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix Heatmap')
plt.show()

This gives an accuracy of 62%, and the confusion matrix you see below:¢

This is the confusion matrix from running the finetuned model on the test dataset. The model overall performs pretty well, but there are some errors. The most common confusion is between the labels negative and neutral. We will then focus on limiting the amount of errors the model makes on this. Image by the author — This is the confusion matrix from running the finetuned model on the test dataset. The model overall performs pretty well, but there are some errors. The most common confusion is between the labels *negative* and *neutral*. We will then focus on limiting the amount of errors the model makes on this. Image by the author

Great, you can now fine-tune ModernBERT on any text classification task. In the next section, I will show you how you can recognize errors the model is making, and use synthetic data to improve model performance on the classes it’s struggling with.

Detecting errors

We will use the confusion matrix to recognize which classes the model struggles with. Since we only have three classes, we will find the worst-performing class and try to improve the model performance for that class.

Looking at the confusion matrix, we can determine that the model most often confuses the classes negative and neutral. We will, therefore, create synthetic data for these samples to hopefully see an improvement in model performance.

Synthesize data to improve model performance

Now, we will synthesize some data for the worst-performing class to improve model performance. To keep the synthetization simple, we will use a library called NLP AUG, which allows for easy augmentation of text samples. Ensure you have installed all the requirements given previously in this article in order to run the package successfully.

First, import the augmentor:

import nlpaug.augmenter.word as naw

Then, we will create some augmentation functions. I am using 3 different augmentation functions here:

Add a contextual word (add a word that fits into the context)
Substitute contextual word (substitute a word that fits into the context)
Double translate (translate from English to French and back in order to create an augmented version of a text)

The contextual augmentations use BERT in order to determine fitting contextual words.

We can implement these augmentations in Python with:

# Create translation pipelines
translator_to_french = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
translator_to_english = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

def _translate_augment(text):
    """Translate text to French and back to English to augment."""
    # Translate to French
    translated_to_french = translator_to_french(text)[0]['translation_text']
    # Translate back to English
    translated_back_to_english = translator_to_english(translated_to_french)[0]['translation_text']
    return translated_back_to_english

def _add_contextual_word(text):
    aug = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased', action="insert")
    augmented_text = aug.augment(text)
    return augmented_text

def _substitute_contextual_word(text):
    aug = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased', action="substitute")
    augmented_text = aug.augment(text)
    return augmented_text

def _translate_augment(text):
    """Translate text to French and back to English to augment."""
    translated_to_french = translator_to_french(text)[0]['translation_text']
    translated_back_to_english = translator_to_english(translated_to_french)[0]['translation_text']
    return translated_back_to_english

def augment_text(text):
    """augment text with a random chance of each augmentation"""
    if np.random.rand() < 0.5:
        return _add_contextual_word(text)
    elif np.random.rand() < 0.5:
        return _substitute_contextual_word(text)
    elif np.random.rand() < 0.5:
        return _translate_augment(text)
    else:
        return text

You can then use the augment text function to create augmentations of a bunch of texts.

Now, we need to create the augmented texts. It’s critical that you only augment data in the train set and not in the test set, as you need to keep the test set the same in order to properly test the effect of the augmentation.

First, create the augmented samples

AUGMENT_PROBABILITY = 0.20 # only augment x% of the texts
augmented_texts_negative = [augment_text(text) for text in tqdm(texts_negative) if np.random.rand() < AUGMENT_PROBABILITY]
augmented_texts_neutral = [augment_text(text) for text in tqdm(texts_neutral) if np.random.rand() < AUGMENT_PROBABILITY]

An augmented sample might then look like:

Original sample:
"The Google Calendar integration is riddled with bugs and it has been this way for months. They don't fix it and they are pretty slow on their communication. TickTick is becoming very tempting."

Augmented sample:
"The integration of Google Calendar is riddled with bugs and it has been this way for months. They don't fix it and they are slow enough on their communication. TickTick becomes very tempting."

You can see how the translation has slightly modified the text without changing its meaning, which is essentially what we try to achieve when creating synthetic data.

After creating augmented samples, we add these rows to the dataset with their corresponding labels. Since we are only augmenting samples, we naturally know the label of the augmented sample (the label must be the same as the label of the sample we augmented).

# add these to the train set
new_rows = []
for text in augmented_texts_negative:
    new_rows.append({"text": text, "sentiment": "negative"})
for text in augmented_texts_neutral:
    new_rows.append({"text": text, "sentiment": "neutral"})

length_before = len(df_train)
df_train = pd.concat([df_train, pd.DataFrame(new_rows)], ignore_index=True)
length_after = len(df_train)
print(f"Added {length_after - length_before} rows to the train set")

df_train["train_test_split"] = "train"
df_test["train_test_split"] = "test"
df = pd.concat([df_train, df_test], ignore_index=True)

df.to_csv("./df_with_synth.csv", index=False)

We can now load this dataframe instead, and train a new model, to see how the model performs when trained on augmented data.

New results after augmentation

When I augment 20% of the data negative/neutral, with a 50 percent chance of each augmentation, I get the following results:

Accuracy: 63%

Confusion matrix when training a ModertBERT model on synthetic data. The synthetic data was generated by only taking the rows with label neutral or negative, selecting 20% of them to be augmented, with a 50% of each augmentation being applied. If you compare this confusion matrix with the previous matrix, you can see that the augmentation scheme did not work particularly well. Image by the author.

This did not improve model performance, and it even made predictions on the negative class worse! (as you can see from the top left square in the confusion matrix above). Part of this is because the context augmentations are not working. With this in mind, I ignore the context augmentations and only apply the translation augmentation.

If I apply translate augmentation on all the rows with the label negative or neutral, I get the following results:

Accuracy: 64%

This is the confusion matrix I get after training the model on the second augmented dataset. The dataset consisted of all rows with labels negative or neutral, with a translation augmentation. If you compare this confusion matrix to the first confusion matrix in this article, you can see that the overall performance of the model has increased (from 62% to 64%). Image by the author.

This worked better, though the results look interesting. The negative predictions have gotten worse (now it more often confuses neutral and positive), but as you can see, the model has gotten a lot better at predicting neutral.

My thoughts and future work

Two percentage points increase is not a massive gain when adding this synthetic data. However, I still think it’s pretty interesting that such simple augmentations (for example, just translating a text to and from a language) can increase the performance of a model in certain classes. Perhaps you could achieve even better results with more advanced augmentations (for example, generating augmentations using a large language model) and tuning the augmentation process a bit. However, I think this shows how you can utilize synthetic data to improve your model. This is especially relevant if your model is performing poorly in specific areas (for example, my model was performing poorly in trying to separate negative and neutral sentiments). You can then add synthetic data on this specific area and potentially see an increase in performance.

Conclusion

In this article, I first found an open-source document classification dataset online. We then implemented ModernBERT, a newly released variant from the well-known BERT family of transformer models. We then fine-tuned the model to perform text classification and interpreted the results by looking at accuracy and interpreting the confusion matrix. Using the confusion matrix, we then determined which confusion the model struggled with. With this information, we created synthetic data samples to help the model on classes it struggled with, resulting in improved performance. I finally gave my thoughts on how the model improved and listed some future work to potentially further enhance the model’s performance.

👉 Find me on socials:

📩 Subscribe to my newsletter

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium