
When it comes to dealing with NLP problems, Bert oftentimes comes up as a machine learning model that we can count on in terms of its performance. The fact that it’s been pre-trained on more than 2,500M words and its bidirectional nature to learn information from a sequence of words makes it a powerful model to use.
I wrote about how we can leverage BERT for text classification before, and in this article, we’re going to focus more on how to use BERT for named entity recognition (NER) tasks.
What is NER?
NER is a task in NLP to identify and extract meaningful information (or we can call it entities) in a sentence or text. An entity can be a single word or even a group of words that refer to the same category.
As an example, let’s say we the following sentence and we want to extract information about a person’s name from this sentence.

The first step of a NER task is to detect an entity. This can be a word or a group of words that refer to the same category. As an example:
- ‘Bond‘ ➡️ an entity that consists of a single word
- ‘James Bond‘ ➡️ an entity that consists of two words, but they are referring to the same category.
To make sure that our BERT model knows that an entity can be a single word or a group of words, then we need to provide information about the beginning and the ending of an entity on our training data via the so-called Inside-Outside-Beginning (IOB) tagging. We will see more about this on our dataset later in this article.
After detecting an entity, the next step in a NER task is to categorize the detected entity. The categories of an entity can be anything depending on our use case. Below is an example of categories of entities:
- Person: Bond, James Bond, Sam, Anna, Frank, Leonardo DiCaprio
- Location: New York, Vienna, Munich, London
- Organization: Google, Apple, Stanford University, Deutsche Bank
- Location: Central Park, Brandenburger Tor, Times Square
These entities are basically the label of our data during the training process of our BERT model, which we will look at in detail later in the following section.
BERT for NER
As previously mentioned, BERT is a transformers-based Machine Learning model that will come in pretty handy if we want to solve NLP-related tasks.
If you’re not yet familiar with BERT, I recommend you to read my previous article about text classification with BERT before reading this article. There you’ll find information about what BERT actually is, what kind of input data the model expects, and the output that you’ll get from the model.
What differentiates between BERT for text classification and the NER problem is how we set the output of the model. For a text classification problem, we only use the embedding vector output from the special [CLS] token, as you can see in the visualization below:

Meanwhile, if we want to use BERT for NER tasks, we need to use the embedding vector output from all of the tokens, as you can see in the visualization below:

By using the embedding vector output from all of the tokens, then we can classify texts at the token level. This is exactly what we want since we want our BERT model to predict the entity of each token. Now without further ado, let’s go to the implementation.
About the Dataset
The dataset that we’re going to use in this article is the CoNLL-2003 dataset, which is a dataset specifically used for NER task. You can download the data on Kaggle via the link below.
This dataset is distributed under Open Database v1.0 license, so we are free to share and use this dataset for our own purpose. Now let’s take a look at what the dataset looks like.
As we can see above, we have a dataframe which consists of the text and the label. The label corresponds to entity category of each word in a text.
In total, there are 9 entity categories, which are:
geo
for geographical entityorg
for organization entityper
for person entitygpe
for geopolitical entitytim
for time indicator entityart
for artifact entityeve
for event entitynat
for natural phenomenon entityO
is assigned if a word doesn’t belong to any entity.
Let’s take a look at the unique labels available on our dataset:
As you might notice, each entity category is preceeded with the letter I
or B
. This corresponds to what previously mentioned as IOB tagging. I
means Intermediate and B
means Beginning. Let’s take a look at the following sentence to understand the concept of IOB tagging a little bit more.

- ‘Kevin’ has
B-pers
label since it’s the beginning of a person entity - ‘Durant’ has
I-pers
label because it’s the continuation of a person entity - ‘Brooklyn’ has
B-org
since it’s the beginning of an organization entity - ‘Nets’ has
I-org
label since it’s the continuation of an organization entity - Other words are assigned
O
label as they don’t belong to any entity
Data Preprocessing
Before we are able to use a BERT model to classify the entity of a token, of course, we need to do data preprocessing first, which includes two parts: tokenization and adjusting the label to match the tokenization. Let’s start with tokenization first.
Tokenization
Tokenization can be easily implemented with BERT, as we can use BertTokenizerFast
class from a pretrained BERT base model with HuggingFace.
To give you an example how BERT tokenizer works, let’s take a look at one of the texts from our dataset:
Tokenizing the text above with BertTokenizerFast
is very straightforward:
We provide several arguments when calling tokenizer
method from BertTokenizerFast
class above:
padding
: to pad the sequence with a special [PAD] token to the maximum length that we specify. The maximum length of a sequence for a BERT model is 512.max_length
: maximum length of a sequence.truncation
: this is a Boolean value. If we set the value to True, then tokens that exceed the maximum length will not be used.return_tensors
: the tensor type that is returned, depending on machine learning frameworks that we use. Since we’re using PyTorch, then we usept
.
And below is the output of the tokenization process:
As you can see, the output that we get from the tokenization process is a dictionary, which contains three variables:
input_ids
: The id representation of the tokens in a sequence. In BERT, the id 101 is reserved for the special [CLS] token, the id 102 is reserved for the special [SEP] token, and the id 0 is reserved for [PAD] token.token_type_ids
: To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values oftoken_type_ids
will be 0.attention_mask
: To identify whether a token is a real token or padding. The value would be 1 if it’s a real token, and 0 if it’s a [PAD] token.
From the input_ids
above, we can decode the ids back into the original sequence with decode
method as follows:
We got our original sequence back after implementing decode
method with the addition of special tokens from BERT such as [CLS] token at the beginning of the sequence, [SEP] token at the end of the sequence, and a bunch of [PAD] tokens to fulfill the required maximum length of 512.
After this tokenization process, we need to proceed to the next step, which is adjusting the label of each token.
Adjusting Label After Tokenization
This is a very important step that we need to do after the tokenization process. This is because the length of the sequence is no longer matching the length of the original label after the tokenization process.
The BERT tokenizer uses the so-called word-piece tokenizer under the hood, which is a sub-word tokenizer. This means that BERT tokenizer will likely to split one word into one or more meaningful sub-words.
As an example, let’s say we have the following sequence:

The sequence above has in total 13 tokens and thus, it also has 13 labels. However, after BERT tokenization, we get the following result:
There are two problems that we need to address after tokenization process:
- The addition of special tokens from BERT such as [CLS], [SEP], and [PAD]
- The fact that some tokens are splitted into sub-words.
As sub-word tokenization, word-piece tokenization splits uncommon words into their sub-words, such as ‘Geir‘ and ‘Haarde‘ in the example above. This sub-word tokenization helps the BERT model to learn the semantic meaning of related words.
The consequence of this word piece tokenization and the addition of special tokens from BERT is that the sequence length after tokenization is no longer matching the length of the initial label.
From the example above, now there are in total 512 tokens in the sequence after tokenization, while the length of the label is still the same as before. Also, the first token in a sequence is no longer the word ‘Prime‘, but the newly added [CLS] token, so we need to shift our label as well.
To solve this problem, we need to adjust the label such that it has the same length as the sequence after tokenization. To do this, we can utilize the word_ids
method from the tokenization result as follows:
As you can see from the code snippet above, each splitted token shares the same word_ids
, where special tokens from BERT such as [CLS], [SEP], and [PAD] all do not have specificword_ids
.
These word_ids
will be very useful to adjust the length of the label by applying either of these two methods:
- We only provide a label to the first sub-word of each splitted token. The continuation of the sub-word then will simply have ‘-100’ as a label. All tokens that don’t have
word_ids
will also be labeled with ‘-100’. - We provide the same label among all of the sub-words that belong to the same token. All tokens that don’t have
word_ids
will be labeled with ‘-100’.
The function in the code snippet below will do exactly the step defined above.
If you want to apply the first method, set label_all_tokens
to False. If you want to apply the second method, set label_all_tokens
to True, as you can see in the following code snippet:
In the rest of this article, we’re going to implement the first method, in which we will only provide a label to the first sub-word in each token and set label_all_tokens
to False.
Dataset Class
Before we train our BERT model for NER task, we need to create a dataset class to generate and fetch data in a batch.
In the code snippet above, we call BertTokenizerFast
class with tokenizer
variable in the __init__
function to tokenize our input texts, and align_label
function to adjust our label after tokenization process.
Next, let’s split our data randomly into training, vaidation, and test. However, mind you that the total number of data is 47959. Hence, for demonstration purpose and to speed up the training process, I’m going to take only 1000 of them. You can, of course, take all of the data for model training.
Model Building
In this article, we’re going to use a pretrained BERT base model from HuggingFace. Since we’re going to classify text in the token level, then we need to use BertForTokenClassification
class.
BertForTokenClassification
class is a model that wraps BERT model and adds linear layers on top of BERT model that will act as token-level classifiers.
In the code snippet above, first, we instantiate the model and set the output of each token classifier equal to the number of unique entities on our dataset, which in our case is 17.
Next, we will define a function for the training loop.
Training Loop
The training loop for our BERT model is the standard PyTorch training loop with a few additions, as you can see below:
In the training loop above, I only train the model for 5 epochs and then use SGD as the optimizer. The loss computation in each batch is already taken care of by BertForTokenClassification
class.
In each epoch of the training loop, there is also an important step that we need to do. After model prediction, we need to ignore all of the tokens that have ‘-100’ as the label, as you can see in lines 36, 37, 62, and 63.
Below is the example of the training output after we train our BERT model for 5 epochs:

Of course, the output that you’ll see may vary when you train your own BERT model as there is stochasticity in the training process.
There are a lot of things that you can do to improve the performance of our model. If you notice, we have a data imbalance problem as there are a lot of tokens with ‘O’ label. We can improve our model, for example, by applying class weights during the training process.
Also, you can try different optimizers such as the Adam optimizer with weight decay regularization.
Evaluate Model on Test Data
Now that we have trained our model, we can evaluate its performance on unseen test data with the following code snippet.
In my case, the trained model achieved an average of 92,22% accuracy on the test set. You can of course, change the metrics to F1 score, precision, or recall.
Alternatively, we can use the trained model to predict the entity of each word of a text or a sentence with the following code:
If everything works perfectly, then our model will be able to perform reasonably well to predict the entity of each word of an unseen sentence as you can see above.
Conclusion
In this article, we have implemented BERT for Named Entity Recognition (NER) task. This means that we have trained BERT model to predict the IOB tagging of a custom text or a custom sentence in a token level.
I hope that this article helps you to get started with BERT for NER task. You can find all of the code implemented in this article in this notebook.