Custom Named Entity Recognition with BERT

How to use PyTorch and Hugging Face to classify named entities in a text

Marcello Politi

Published in

Towards Data Science

6 min readJul 7, 2022

Named-entity recognition (NER)

is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Background:

In this article we will use some concepts that I have already introduced in my previous article.

BERT is a language model based heavily on the Transformer encoder. If you are unfamiliar with Transformers I recommend reading this amazing article.

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations…

jalammar.github.io

BERT in a nutshell :

It takes as input the embedding tokens of one or more sentences.
The first token is always a special token called [CLS].
The sentences are separated by another special token called [SEP].
For each token BERT outputs an embedding called hidden state.
Bert was trained on the masked language model and next sentence prediction tasks.

In the masked language model (MLM), an input word (or token) is masked and BERT has to try to figure out what the masked word is. For the next sentence prediction (NSP) task, two sentences are given in input to BERT, and he has to figure out whether the second sentence follows semantically from the first one.

If you think about it, solving the named entity recognition task means classifying each token with a label (person, location,..). So the most intuitive way to approach this task is to take the corresponding hidden state of each token and feed it through a classification layer. The final classifiers share the weights, so in practice we have a single classifier, but for demonstration pourposes, I think that it’s easier to visualize it as if there were more classifiers.

From the image above you can see that we will be using a lighter version of BERT called DistilBERT. This distilled model is 40% smaller than the original but still maintains about 97% performance on the various NLP tasks.
Another thing you can notice, is that BERT’s input is not the original words but the tokens. BERT has associated a tokenizer that preprocesses the text so that it is appealing for the model. The tokenizer often splits words into subwords and in addition special tokens are added : [CLS] to indicate the beginning of the sentence, [SEP] to separate multiple sentences, and [PAD] to make each sentence have the same number of tokens.

In addition, each token embedding is summed with a sentence embedding, which is a vector that somehow adds the information whether the token refers to the first or second sentence given as input to BERT.
Since the computation in transformers models is parallel unlike recurrent neural networks, we lose the temporal dimension, i.e., the ability to tell which is the first word of a sentence which is the second, etc.
Therefore, each token is also summed to a positional embedding that takes into account the position of the token in the sequence

If you want to learn more about BERT or his wordpiece tokenizer check out these resources:

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Discussions: Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments) Translations…

jalammar.github.io

BERT 101 — State Of The Art NLP Model Explained

BERT, short for Bidirectional Encoder Representations from Transformers, is a Machine Learning (ML) model for natural…

huggingface.co

WordPiece: Subword-based tokenization algorithm

Understand subword-based tokenization algorithm used by state-of-the-art NLP models — WordPiece

towardsdatascience.com

Summary of the tokenizers

On this page, we will have a closer look at tokenization. As we saw in the preprocessing tutorial, tokenizing a text is…

huggingface.co

Let’s code!

Dataset

The dataset we are going to use is called is CoNLL-2003 which you can find on Kaggle (with an open license).

Download the Kaggle dataset directly from Colab (remember to upload your personal Kaggle key).

Imports

imports

Split Dataset

Let’s load the first N rows of our dataframe and change the column names. Then split the dataframe into train, dev and test sets.

split data

Custom Classes

I am now going to define three classes that will be needed for this project. The first class defines our DistilBert Model. There’s no need to build the classifiers manually on top of the pretrained language model because HuggingFace already provides us with a built-in model that contains the classifier as the last layer. This model is called DistilBertForTokenClassification.

DistilbertNer Class

The init method takes into input the dimensions of the classification, so the number of tokens that we can predict, and instantiate the pretrained model.

The forward computation simply takes the inputs_ids (the tokens) and the attention_mask (array of 0/1 that tells us if the token is a PAD or not) and returns in output a dictionary : {loss, logits}

Model class

NerDataset Class

The second class it’s an extension of the module nn.Dataset, which enables us to create our custom dataset.

Given in input a dataframe this method will tokenize the text and match the additional generated tokens with the correct tag (as we described in the previous article). Now you can index the dataset which will return a batch of texts and labels.

Custom dataset

MetricsTracking Class

Sometimes it’s annoying to compute all the metrics inside the train loop, that’s why this class helps us do that. You only need to instantiate a new object and call the update method every time you make predictions over a batch.

Custom Methods

tags_2_labels : Method that takes a list of tags and a dictionary mapping tags to labels, and returns a list of labels associated to the original tags.

Tags to labels

tags_mapping : takes in input the tags column of a dataframe and returns : (1) a dictionary mapping tags to indexes (label) (2) dictionary mapping indexes to tags (3) the label corresponding to tag O (4) a set of unique tags encountered in the train data which will define the classifier dimension.

Tags mapping

match_tokens_labels : from the tokenized text and the original tags (which are associated with words not token) it outputs an array of tags for every single token. It associates a token with the tag of its original word.

freeze_model : freezes last num_layers of a model to prevent catastrophic forgetting.

train_loop : usual train loop.

Main

Now let’s use everything in the main scope.

Create mapping tags-labels

Tokenize text

Train model

Now you should get results similar to these ones

Congratulations, you built your named entity recognition model based on a pre-trained language model!

Final Thoughts

In this hands-on article we saw how we can take advantage of the capabilities of a pretrained model to create a simple model to solve the named entity recognition challenge in very little time. Keep in mind that when you train the whole model on the new data you are likely to change the original DistilBert weights too much and thus deteriorate the performance of the model. For this you can decide to freeze all but the last layer, the classification layer, and prevent this problem called catastophic forgetting. In the model selection phase you can try various models by going to freeze the last k layers.

The End

Marcello Politi

Linkedin, Twitter, CV