
BERT has become a new standard for Natural Language Processing (NLP). It achieved a whole new state-of-the-art on eleven NLP task, including text classification, sequence labeling, question answering, and many more. Even better, it can also give incredible results using only a small amount of data. BERT was first released in 2018 by Google along with its paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Now we can easily apply BERT to our model by using Huggingface (🤗) Transformers library. The library already provided complete documentation about other transformers models too. You can check it here. In this post, I will try to summarize some important points which we will likely use frequently. We will take a look at how to use and train models using BERT from 🤗 Transformers. Later, you can also utilize other transformers models (such as XLM, RoBERTa, XLM RoBERTa (my favorite!), BART, and many others) by simply changing a single line of code.
Text classification seems to be a pretty good start to get to know BERT. There are many kinds of text classification tasks, but we will choose sentiment analysis in this case. Here are 5 main points which we will be covered in this post:
- Installation
- Pipeline
- Fine-tune
- Using custom dataset
- Hyperparameter search
Installation
As stated on their website, to run 🤗 Transformers you will need to have some requirement as follow:
- Python 3.6+
- Pytorch 1.10+ or Tensorflow 2.0
They also encourage us to use virtual environments to install them, so don’t forget to activate it first.
The installation is quite easy, when Tensorflow or Pytorch had been installed, you just need to type:
pip install transformers
In this post, we are going to use Pytorch. But it should be easy if you want to translate it into Tensorflow, just add ‘TF’ at the beginning of each model class name.
Pipeline
When you just want to test or simply use it to predict some sentences, you can use pipeline(). Besides text classification, they already provided many different tasks such as text generation, question answering, summarization, and so on. To run sentiment analysis task, simply type:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('We are very happy to show you the 🤗 Transformers library.')
It uses a model named "distilbert-base-uncased-finetuned-sst-2-english" by default. We can also change to other models that we can find in the model hub. For example, if we want to use nlptown/bert-base-multilingual-uncased-sentiment, then simply do the following:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
Fine-tune
First thing first, we need a dataset. At this point, we are going to use the dataset provided by 🤗 Datasets. They provide a wide range of task options, varying from text classification, token classification, language modeling, and many more. To install it, simply execute the following line:
pip install datasets
Load data
We are going to use sst2
dataset from GLUE task and bert-base-uncased
pretrained. By runningload_dataset
and load_metric
, we are downloading dataset as well as metric. load_metric
automatically loads a metric associated with the chosen task.
Preprocessing
To preprocess, we need to instantiate our tokenizer using AutoTokenizer (or other tokenizer class associated with the model, eg: BertTokenizer). By calling from_pretrained()
, we download the vocab used during pretraining the given model (in this case, bert-base-uncased). The vocab is useful so that the tokenization results are corresponding to the model’s vocab.
Fine-tuning
Fortunately, they also provide a simple interface called [Trainer()](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer)
which makes the training and evaluation process much easier without losing its flexibility to modify a wide range of training options.
First, instantiate and download the model with [from_pretrained(](https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained)).
Since our task is sequence classification, we can use AutoModelForSequenceClassification (or other model class associated to the pretrained, eg: BertForSequenceClassification).
We need to define our own compute_metrics
function if we want to have other metrics in addition to the loss. This function can be passed to the trainer.
Using Custom Dataset
Now we just need to convert our dataset into the right format so that the model can work properly. We will use a small subset from Amazon review dataset in the fashion category. You can find the dataset here. The labels are still in the form of rating, so we need to change them into whether positive or negative. Reviews with 3 or more stars will be classified as positive, and the rest are negative. This is just for an example, feel free to change it the way you like.
After that, we split them into train, validation, and test and tokenize them using AutoTokenizer. We also need to convert our data to dataset object by subclassing torch.utils.data.Dataset
object and implementing __len__
and __getitem__
. Take a look at AmazonDataset class below. For training, just repeat the steps in the previous section. But this time, we use DistilBert instead of BERT. It is a small version of BERT. Faster and lighter!
As you can see, the evaluation is quite good (almost 100% accuracy!). Apparently, it’s because there are a lot of repetitive data. Some reviews can appear more than three times in the dataset. So, make sure that your data is clear and good enough to represent the actual world.
Hyperparameter Search
Even better, they also support hyperparameter search using Optuna or Ray tune (you can choose one). It will run the training process several times so it needs to have the model defined via a function (so it can be reinitialized at each new run). See model_init
function below.
Besides that, it will also take a very long time to run. Alternatively, you can do a hyperparameter search using only a portion of the training data to save time and resources. After getting the best configuration, we can rerun the training using full data with the best configuration. Just do something like this:
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
This process will return a BestRun object containing information about the hyperparameter which is used for the best run. To use this configuration, just set the hyperparameter into TrainingArgument.
That’s it! If you want to try another task or another pretrained model or even use your own dataset, you can easily customize it to your needs by modifying a couple of lines, and BOOM! You already had your own transformers-powered NLP model!
References
[2] Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
If you enjoyed reading this post and would like to hear more from me and other writers here, join Medium and subscribe to my newsletter. Or simply follow the links below. Thank you!