Hugging Face is a platform that offers tools and pre-trained models for various Natural Language Processing (NLP) and Natural Language Understanding (NLU) tasks. In our previous article, A Warm Embrace: Exploring Hugging Face, we dove into the basics of this platform and its open-source library that features implementations of many state-of-the-art transformer architectures. This post enhances the Hugging Face documentation by providing emerging data scientists with a single, connected view of various Hugging Face tools for a specific task. Specifically, this article explains how to piece together multiple Hugging Face capabilities to fine-tune an existing language model for named entity recognition ("NER").
Relevant Background
In this section, we briefly look at two foundational concepts essential for building our model. As a reminder, we covered Hugging Face basics in A Warm Embrace: Exploring Hugging Face.
- Named Entity Recognition
- Model Fine-tuning
In the sections below, it’s assumed you have some knowledge of model development and the associated concepts – however, if anything is unclear feel free to reach out!
Named Entity Recognition
Named Entity Recognition ("NER") is a common natural language processing task of identifying and categorizing relevant information, or entities, into one of many predefined (named) groups. NER models can be trained on a variety of entities. Some of the most common ones are:
- Names
- Organizations
- Dates
- Places & Locations
In the image below, I manually tagged a couple of different named entities in a sample sentence. In the context of Machine Learning and NLP, NER is the process of automating this categorization process through models.
NER models can enable a variety of tasks including but not limited to, information retrieval, content summarization, content recommendation and machine translation.
Model Fine-Tuning
At the highest level, fine tuning a model refers to adjusting existing model weights based on a new dataset by replacing some or all of the model’s layers and retraining. Depending on your task and dataset, you might retrain just the last layer, some layers of the model, or the entire model.
One might wonder—why not just train your own model from scratch?
- Creating a new model often demands substantial computational resources and time. Utilizing a pre-trained model allows us to harness the capabilities of a model trained on extensive data without the hefty computational and time investments.
- Since pre-trained models are often trained on large and comprehensive datasets, fine-tuning a model allows you to achieve strong performance on a smaller dataset, thereby minimizing the risk of overfitting and improving generalization, among other benefits.
Developing our NER Model
The dataset we will be using is called Broad Twitter Corpus (clear for commercial use under the Creative Commons Attribution 4.0 International license). The dataset itself is just a huge collection of tweets. Each of these tweets has been annotated so that they include named entity tags. More importantly, according to the white paper, these tweets were manually annotated. Building an NER model on top of this dataset will not only enable us to automatically annotate the entities in the future, but it will also enable some of the downstream use cases we described in a previous section.
A rough outline of the process we will follow is below.
- Set up our environment
- Load Broad Twitter dataset
- Load pre-trained BERT model
- Re-tokenize Broad Twitter tokens
- Fine tune pre-trained BERT model
Environment Setup
For simplicity, I did this work in a Google Colab notebook. Google Colab provides free GPU access by going to runtime -> change runtime type and selecting T4 GPU. As an aside, this code can be run in many different environments – not just Colab.
The first thing we need to do is install the required Hugging Face packages. I copied a brief description of each from the Hugging Face documentation, which is linked to each package name.
- Datasets: "Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks"
- Transformers: "Transformers provides APIs and tools to easily download and train state-of-the-art pre-trained models"
- Evaluate: "A library for easily evaluating machine learning models and datasets"
These packages can be installed using PIP.
%%capture ##cell magic command that captures and discards all the stdout and stderr from the cell in which it's used
!pip install datasets
!pip install transformers
!pip install evaluate
Loading Broad Twitter dataset
Hugging Face’s datasets library makes it extremely easy to load datasets using a two lines of Python code.
from datasets import load_dataset
twitter = load_dataset("GateNLP/broad_twitter_corpus")
When you load a dataset, it loads all relevant data splits contained within the dataset. Printing the dataset will show you the available splits, number of rows for each split, and the features each row has. The cell below shows the results of print(twitter)
.
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 5342
})
validation: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 2002
})
test: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 2002
})
})
Loading BERT
When loading BERT for token classification, we need to specify the number of labels. This dataset has been annotated with IOB2 labels. According to the HF dataset page, mappings between numeric tag and string tag are as follows.
labels = {
0:'O',
1:'B-PER',
2:'I-PER',
3:'B-ORG',
4:'I-ORG',
5:'B-LOC',
6:'I-LOC',
}
If you are unfamiliar with IOB2 labels, one might wonder what is the difference between B-PER
and I-PER
? What about the difference between B-ORG
and I-ORG
, or the difference between B-LOC
and I-LOC
? The answer lies in tokenization. Specifically, some entities might span over multiple tokens. We use these entity prefixes to tell the model where the entity begins and how far it continues. The B
prefix before a label indicates that the token is the beginning of a chunk of that type, and the I
prefix indicates that the token is inside a chunk. The O
label means that a token is outside of any chunk.
Now that we know we have 7 labels, we can load BERT using the following lines of code.
from transformers import AutoModelForTokenClassification
bert_ner = AutoModelForTokenClassification.from_pretrained("bert-base-cased", num_labels=len(labels))
Lastly, we need to tell the model how to map between ID and label.
bert_ner.config.id2label = labels
bert_ner.config.label2id = {v: k for k, v in labels.items()}
We have BERT loaded, but do we even need to fine tune BERT for our NER task? When BERT was introduced by Google it was considered groundbreaking. Can’t we use it as-is for NER without fine-tuning? We can test its ability to classify tokens without any fine tuning below. We’ll use our example sentence from earlier, which closely resembles a tweet (short).
from transformers import pipeline, AutoTokenizer
from evaluate import evaluator
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
pipe = pipeline('token-classification', model='bert-base-cased', tokenizer=tokenizer, device=device_id)
pipe(["Zachary Raicik works for Corvus and lives in San Diego"], aggregation_strategy="average")
The results may surprise you. It turns out that without any fine tuning, BERT isn’t very good at our task. In the code block below, you can see that BERT labeled Zachary Raicik works for Corvus and lives in San
as one entity and Diego
as another.
[[{'entity_group': 'LABEL_0',
'score': 0.66933894,
'word': 'Zachary Raicik works for Corvus and lives in San',
'start': 0,
'end': 48},
{'entity_group': 'LABEL_1',
'score': 0.5502231,
'word': 'Diego',
'start': 49,
'end': 54}]]
Re-tokenize Broad Twitter tokens
When we downloaded the Broad Twitter dataset, it came with a set of predefined tokens. However, there is no guarantee that the tokens in the dataset will match the ones generated by the BERT tokenizer. BERT may split up some tokens into sub-word tokens. Consequently, we need to build a function to re-distribute the provided tags to the sub-word tokens generated by BERT.
def tokenize_and_tag(row):
tokens, ner_tags = row["tokens"], row["ner_tags"]
sub_tokens, labels = [], []
for token, tag in zip(tokens, ner_tags):
token_sub_tokens = tokenizer.tokenize(token)
sub_tokens.extend(token_sub_tokens)
labels.extend([tag] * len(token_sub_tokens))
sub_tokens = ['[CLS]'] + sub_tokens + ['[SEP]']
labels = [-100] + labels + [-100]
padding_length = tokenizer.model_max_length - len(sub_tokens)
sub_tokens = sub_tokens + ['[PAD]'] * padding_length
labels = labels + [-100] * padding_length
input_ids = tokenizer.convert_tokens_to_ids(sub_tokens)
attention_mask = [1 if token != '[PAD]' else 0 for token in sub_tokens]
token_type_ids = [0] * tokenizer.model_max_length
row["bert_tokens"] = sub_tokens
row["input_ids"] = input_ids
row["attention_mask"] = attention_mask
row["token_type_ids"] = token_type_ids
row["labels"] = labels
return row
Hugging Face expects the fields input_ids
, attention_mask
, token_type_ids
and labels
for training. See here for more information about processing your data for use with the Transformers
library.
We can use this function to re-tokenize our datasets.
train_twitter = twitter['train'].map(tokenize_and_tag)
test_twitter = twitter['test'].map(tokenize_and_tag)
validation_twitter =twitter['validation'].map(tokenize_and_tag)
Fine tune pre-trained BERT model
At this point, our datasets are ready to go and we are ready to begin training.
import numpy as np #Used to mask predictions and labels
from transformers import TrainingArguments, Trainer #Training
from evaluate import load #Used to load required performance metrics during training
When fine tuning a pre-trained model, you can choose to retrain as many layers as you want. Since this is just an illustrative exercise, we will only tune the last layer of BERT.
# Freeze ALL model parameters
for param in bert_ner.parameters():
param.requires_grad = False
# Unfreeze the last 5 layers
for param in bert_ner.bert.encoder.layer[-1:].parameters():
param.requires_grad = True
We use TrainingArguments
to specify things like the learning rate, number of epochs, etc.
training_args = TrainingArguments(
evaluation_strategy="epoch",
output_dir="twitter-training",
learning_rate=5e-05,
num_train_epochs=5.0,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
)
Loss functions are designed for optimizing model weights during training but not for interpretation. For that reason, we will include additional metrics to understand how good our model is.
In NER tasks, class imbalance is relatively common. For that reason, metrics like accuracy may not be the most appropriate. In this case, the class imbalance isn’t awful (it’s shown in the dataset description). However, for completeness, we will use the weighted F1 score to evaluate our model during training. The weighted F1 score takes into account the number of true instances for each label when calculating the F1 score for that label. This means that each class contributes to the average proportionally to its size.
metric = evaluate.load("f1") # Use evaluate.combine if you want multiple metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
valid_mask = np.array(labels) != -100
valid_labels = labels[valid_mask]
valid_predictions = predictions[valid_mask]
return metric.compute(predictions=valid_predictions, references=valid_labels,average='weighted')
We have all the pieces required to train.
trainer = Trainer(
model=bert_ner,
args=training_args,
train_dataset=train_twitter,
eval_dataset=validation_twitter,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model("twitter-training-mdl")
Once the training process completes, you should get a view that looks something like this.
In a realistic setting, you might want to increase the number of epochs and invest more time in choosing the right parameters for the task. Some examples of this may include
- Learning Rate Optimization: The learning rate controls how much to update the model’s weights. We use
5e-05
in our example, but it’s possible a different learning rate is more appropriate for this task. - Weight decay: This is a regularization technique that discourages large weights. In general, it leads to a simpler model and helps to prevent overfitting.
Putting it All Together
Let’s revisit our sentence from earlier.
from transformers import pipeline
from evaluate import evaluator
pipe = pipeline('token-classification', model='twitter-training-mdl', tokenizer=tokenizer, device=device_id)
pipe(["Zachary Raicik works for Corvus and lives in San Diego"], aggregation_strategy="average")
The results indicate that our fine tuned model is much better compared to BERT without any fine tuning.
[[{'entity_group': 'PER',
'score': 0.8900693,
'word': 'Zachary Raicik',
'start': 0,
'end': 14},
{'entity_group': 'ORG',
'score': 0.534402,
'word': 'Corvus',
'start': 25,
'end': 31},
{'entity_group': 'LOC',
'score': 0.7905616,
'word': 'San Diego',
'start': 45,
'end': 54}]]
In this article we covered the basics of fine tuning an existing model for a specific task. In our case, we used BERT to build a Named Entity Recognition model. This process can be applied to any number of tasks using different datasets or models. Although this post provided a strong introduction to some of what Hugging Face can do, we barely scratched the surface. For example, we didn’t even pass our model to the robust evaluation library maintained by Hugging Face. In a future post, we will cover how to use some of these additional tools for both fine tuning a model for NER and additional use cases.