The world’s leading publication for data science, AI, and ML professionals.

A Beginner’s Guide to Use BERT for the First Time

From predicting single sentence to fine-tuning using custom dataset to finding the best hyperparameter configuration.

Photo by Jamie Street on Unsplash
Photo by Jamie Street on Unsplash

BERT has become a new standard for Natural Language Processing (NLP). It achieved a whole new state-of-the-art on eleven NLP task, including text classification, sequence labeling, question answering, and many more. Even better, it can also give incredible results using only a small amount of data. BERT was first released in 2018 by Google along with its paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Now we can easily apply BERT to our model by using Huggingface (🤗) Transformers library. The library already provided complete documentation about other transformers models too. You can check it here. In this post, I will try to summarize some important points which we will likely use frequently. We will take a look at how to use and train models using BERT from 🤗 Transformers. Later, you can also utilize other transformers models (such as XLM, RoBERTa, XLM RoBERTa (my favorite!), BART, and many others) by simply changing a single line of code.

Text classification seems to be a pretty good start to get to know BERT. There are many kinds of text classification tasks, but we will choose sentiment analysis in this case. Here are 5 main points which we will be covered in this post:

  1. Installation
  2. Pipeline
  3. Fine-tune
  4. Using custom dataset
  5. Hyperparameter search

Installation

As stated on their website, to run 🤗 Transformers you will need to have some requirement as follow:

  • Python 3.6+
  • Pytorch 1.10+ or Tensorflow 2.0

They also encourage us to use virtual environments to install them, so don’t forget to activate it first.

The installation is quite easy, when Tensorflow or Pytorch had been installed, you just need to type:

pip install transformers

In this post, we are going to use Pytorch. But it should be easy if you want to translate it into Tensorflow, just add ‘TF’ at the beginning of each model class name.

Pipeline

When you just want to test or simply use it to predict some sentences, you can use pipeline(). Besides text classification, they already provided many different tasks such as text generation, question answering, summarization, and so on. To run sentiment analysis task, simply type:

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('We are very happy to show you the 🤗 Transformers library.')

It uses a model named "distilbert-base-uncased-finetuned-sst-2-english" by default. We can also change to other models that we can find in the model hub. For example, if we want to use nlptown/bert-base-multilingual-uncased-sentiment, then simply do the following:

classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

Fine-tune

First thing first, we need a dataset. At this point, we are going to use the dataset provided by 🤗 Datasets. They provide a wide range of task options, varying from text classification, token classification, language modeling, and many more. To install it, simply execute the following line:

pip install datasets

Load data

We are going to use sst2 dataset from GLUE task and bert-base-uncased pretrained. By runningload_dataset and load_metric, we are downloading dataset as well as metric. load_metricautomatically loads a metric associated with the chosen task.

Preprocessing

To preprocess, we need to instantiate our tokenizer using AutoTokenizer (or other tokenizer class associated with the model, eg: BertTokenizer). By calling from_pretrained(), we download the vocab used during pretraining the given model (in this case, bert-base-uncased). The vocab is useful so that the tokenization results are corresponding to the model’s vocab.

Fine-tuning

Fortunately, they also provide a simple interface called [Trainer()](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) which makes the training and evaluation process much easier without losing its flexibility to modify a wide range of training options.

First, instantiate and download the model with [from_pretrained(](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained)). Since our task is sequence classification, we can use AutoModelForSequenceClassification (or other model class associated to the pretrained, eg: BertForSequenceClassification).

We need to define our own compute_metrics function if we want to have other metrics in addition to the loss. This function can be passed to the trainer.

Using Custom Dataset

Now we just need to convert our dataset into the right format so that the model can work properly. We will use a small subset from Amazon review dataset in the fashion category. You can find the dataset here. The labels are still in the form of rating, so we need to change them into whether positive or negative. Reviews with 3 or more stars will be classified as positive, and the rest are negative. This is just for an example, feel free to change it the way you like.

After that, we split them into train, validation, and test and tokenize them using AutoTokenizer. We also need to convert our data to dataset object by subclassing torch.utils.data.Dataset object and implementing __len__ and __getitem__. Take a look at AmazonDataset class below. For training, just repeat the steps in the previous section. But this time, we use DistilBert instead of BERT. It is a small version of BERT. Faster and lighter!

As you can see, the evaluation is quite good (almost 100% accuracy!). Apparently, it’s because there are a lot of repetitive data. Some reviews can appear more than three times in the dataset. So, make sure that your data is clear and good enough to represent the actual world.

Hyperparameter Search

Even better, they also support hyperparameter search using Optuna or Ray tune (you can choose one). It will run the training process several times so it needs to have the model defined via a function (so it can be reinitialized at each new run). See model_init function below.

Besides that, it will also take a very long time to run. Alternatively, you can do a hyperparameter search using only a portion of the training data to save time and resources. After getting the best configuration, we can rerun the training using full data with the best configuration. Just do something like this:

train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)

This process will return a BestRun object containing information about the hyperparameter which is used for the best run. To use this configuration, just set the hyperparameter into TrainingArgument.


That’s it! If you want to try another task or another pretrained model or even use your own dataset, you can easily customize it to your needs by modifying a couple of lines, and BOOM! You already had your own transformers-powered NLP model!

References

[1] Huggingface Transformers

[2] Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)


If you enjoyed reading this post and would like to hear more from me and other writers here, join Medium and subscribe to my newsletter. Or simply follow the links below. Thank you!

Join Medium with my referral link – Arfinda Ilmania

Get an email whenever Arfinda Ilmania publishes.


Related Articles