
HuggingFace serves as a home to many popular open-source NLP models. Many of these models are effective as is, but often require some sort of training or fine-tuning to improve performance for your specific use-case. As the LLM implosion continues, we will take a step back in this article to revisit some of the core building blocks HuggingFace provides that simplify the training of NLP models.
Traditionally NLP models can be trained using vanilla PyTorch, TensorFlow/Keras, and other popular ML frameworks. While you can go this route, it does require a deeper understanding of the framework you are utilizing as well as more code to write the training loop. With HuggingFace’s Trainer class, there’s a simpler way to interact with the NLP Transformers models that you want to utilize.
Trainer is a class specifically optimized for Transformers models and also provides tight integration with other Transformers libraries such as Datasets and Evaluate. Trainer at a more advanced level also supports distributed training libraries and can be easily integrated with infrastructure platforms such as Amazon SageMaker.
In this example we’ll take a look at using the Trainer class locally to fine-tune the popular Bert model on the IMBD dataset for a Text Classification use-case(Large Movie Reviews Dataset Citation).
NOTE: This article assumes basic knowledge of Python and the domain of NLP. We will not get into any specific Machine Learning theory around model building or selection, this article is dedicated to understanding how we can fine-tune the existing pre-trained models available in the HuggingFace Model Hub.
Table of Contents
- Setup
- Fine-Tuning BERT
- Additional Resources & Conclusion
1. Setup
For this example, we’ll be working in SageMaker Studio and utilize a conda_python3 kernel on a ml.g4dn.12xlarge instance. Note that you can use a smaller instance type, but this might impact the training speed depending on the number of CPUs/workers that are available.
To download the dataset we will utilize the HuggingFace Datasets library.
import datasets
from datasets import load_dataset
We specify a training and an evaluation dataset which we will utilize for the training loop.
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")
test_subset = test_dataset.select(range(100)) # we will take a subset of the data for evaluation
For any text data, you must specify a tokenizer to preprocess the data into a format that your model can understand. In this case we specify the HuggingFace Hub Model ID for the BERT model we are utilizing.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# tokenize text data
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
We then utilize the built-in map function to process our training and evaluation datasets.
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_subset.map(tokenize_function, batched=True)

2. Fine-Tuning BERT
Now that our data has been prepared we pull down our BERT model with the same Model ID that we specified earlier. Notice that we also specify the number of labels for our Text Classification use-case. In this case we specify two as the two values are 0 and 1, representing negative and positive.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased",
num_labels=2)
Next for our Training loop, we specify a TrainingArguments object. In this object we can specify different parameters for training such as the number of epochs, distributed training strategy, and more.
In this case we just specify the output directory for the trained model artifacts, the number of epochs, and evaluation of the model post each epoch. For simplicity’s sake we limit the epoch count to just one in this example.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer",
evaluation_strategy="epoch", num_train_epochs=1)
For evaluation we use the built-in evaluation function from the Evaluate library.
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
# eval function
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
We then pass in the TrainingArguments, tokenized datasets, and evaluation metric function to the Trainer object. We can kick off a training run with the train method which will take about 10–15 minutes with the existing hardware.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test, #using test as eval
compute_metrics=compute_metrics,
tokenizer=tokenizer
)
trainer.train()

For inference, we can directly use the fine-tuned trainer object and predict on the tokenized test dataset we used for evaluation:
trainer.predict(tokenized_test)

In a more realistic use-case, you can take the trainer object and save the model artifacts in a local directory.
trainer.save_model("./custom_model")

You can then load these model artifacts specifying the type of model we trained and see an inference on a single data point.
loaded_model = AutoModelForSequenceClassification.from_pretrained(
pretrained_model_name_or_path="custom_model/")
# sample inference
encoding = tokenizer("I am super delighted", return_tensors="pt")
res = loaded_model(**encoding)
predicted_label_classes = res.logits.argmax(-1)
predicted_label_classes

In a realistic use-case you can take the trained model artifacts and deploy them on a serving stack such as Amazon SageMaker.
3. Additional Resources & Conclusion
GitHub – RamVegiraju/huggingface-finetune-trainer: Utilize HuggingFace Trainer to Fine-Tune a…
You can find the code for the entire example at the link above. I hope this article was a useful introduction into working with the HuggingFace Trainer class to fine-tune Transformers models. To scale up your training workloads, please refer here to see how you can fine-tune a BERT model utilizing SageMaker Training Jobs. In coming articles we’ll explore how to expand upon this Trainer class to fine-tune LLMs with techniques such as PEFT.
As always thank you for reading and feel free to leave any feedback.
If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter.