The world’s leading publication for data science, AI, and ML professionals.

First time using and fine-tuning the BERT framework for classification

My interest in the field of NLP started when I decided to participate in one of the ongoing competition of identifying whether the given…

First time fine-tuning the BERT framework


Photo by Dmitry Ratushny
Photo by Dmitry Ratushny

Beginning of journey in language land

My interest in the field of NLP started when I decided to participate in one of the ongoing competition of identifying whether the given tweet is about any disaster or not. I was not having any experience in the field of language processing, and after a few internet searches, I came to know about some of the text preprocessing of data like tokenization and lemmatization, used TfidfTransformer and TfidfVectorizer for feature extraction then simply used Naive Bayes for classification (score = 0.77). I took a deep learning specialization course in the meantime and came to know about RNN and decided to use the LTSM model for this task and got better results(score = 0.79987, Top 40%). In that course, there was mention of transfer learning and how it could be a powerful tool for any task. I thought why not try this on a dataset I have right now.


Discovery of BERT

I searched about different frameworks in NLP and came to know about BERT. It is said to be one of the most powerful and influential models in the field of NLP by Google, trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. It can be fine-tuned according to your needs thus making it even more powerful. I decided to use this framework and fine-tune it according to my dataset. I was searching how to use this framework and came across the Hugging face transformers which provides general-purpose architectures (Bert, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.


I was able to fine-tune this model after lot of experimentation and reading the documentation and hopeful this experience could also help you in some way.


Let’s get started

First, let’s see the data given

data = pd.read('train.csv')
data.head()
Output of the code snippet
Output of the code snippet
data.describe(include = 'all')
Output of the code snippet
Output of the code snippet

Similarly, there is a ‘test.csv’ whose tweets we have to predict. We can combine both datasets and do some necessary operations on them. We can drop the keyword and location column so we can make predictions from the given tweets only.

df1 = pd.read_csv('content/drive/My Drive/disaster tweets/train.csv')
df2 = pd.read_csv('content/drive/My Drive/disaster tweets/test.csv')
combined = pd.concat([df1,df2], axis=0)
combined = combined.drop(['keyword','location'], axis=1)

I didn’t pre-process or clean the data (e.g- removing punctuations or removing HTML tags etc.) as I just wanted to see how to work with the framework. I am sure that cleaning and working with data will further give better results.

from transformers import BertForSequenceClassification, AdamW    #importing appropriate class for classification
import numpy as np
import pandas as pd
import torch
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')   #Importing the bert model
model.train()     #tell model its in training mode so that some
layers(dropout,batchnorm) behave accordingly

Tokenisation and encoding

For using the BERT model we have to first tokenize and encode our text and BERT tokenizer is provided in Hugging Face transformer.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded = tokenizer(combined.text.values.tolist(), padding=True, truncation=True, return_tensors='pt')
Output of the code snippet
Output of the code snippet

The encoder gives three tensors for a single tweet in the form of a dictionary _(‘input_ids’, ‘attention_mask’, ‘token_typeids’) which is used by the model.


Now, let’s separate the tensors into different variables(we only need ‘input_ids’ and ‘attention_mask’ )and break the combined data again in the test and train format.

input_id = encoded['input_ids']
attention_mask = encoded['attention_mask']
train_id = input_id[:len(df1)]
train_am = attention_mask[:len(df1)]
test_id = input_id[len(df1):]
test_am = attention_mask[len(df1):]
train = combined.iloc[:len(df1)]
test = combined.iloc[len(df1):]

For training and testing purposes let’s split the train data into two parts for training and testing of the model.

Xtrain = train.iloc[:6800]
Xtest =  train.iloc[6800:]
Xtrain_id = train_id[:6800]
Xtrain_am = train_am[:6800]
Xtest_id = train_id[6800:]
Xtest_am = train_am[6800:]
labels = torch.tensor(Xtrain.target.values.tolist())
labels = labels.type(torch.LongTensor)
labels.shape

Fine-tuning the model

Now, let’s focus on the model. We will use PyTorch for training the model(TensorFlow could also be used). First, we will configure our optimizer (Adam) and then we will train our model in batch so that our machine( CPU, GPU) doesn’t crash.

optimizer = AdamW(model.parameters(), lr=1e-5)
n_epochs = 1 
batch_size = 32 

for epoch in range(n_epochs):

    permutation = torch.randperm(Xtrain_id.size()[0])

    for i in range(0,Xtrain_id.size()[0], batch_size):
        optimizer.zero_grad()

        indices = permutation[i:i+batch_size]
        batch_x, batch_y,batch_am = Xtrain_id[indices],   labels[indices], Xtrain_am[indices]

        outputs = model(batch_x, attention_mask=batch_am, labels=batch_y)
        loss = outputs[0]
        loss.backward()
        optimizer.step()

Here outputs give us a tuple containing cross-entropy loss and final activation of the model. For example, here is the output of two tensors

Output given by the model
Output given by the model

We can use these activations to classify the disaster tweets with the help of the softmax activation function.


Now let’s test the model with the help of remaining data in the training set. First, we have to put the model in the test mode and then take outputs from the model.

model.eval()     #model in testing mode
batch_size = 32
permutation = torch.randperm(Xtest_id.size()[0])

for i in range(0,Xtest_id.size()[0], batch_size):

  indices = permutation[i:i+batch_size]
  batch_x, batch_y, batch_am = Xtest_id[indices], labels[indices], Xtest_am[indices]

  outputs = model(batch_x, attention_mask=batch_am, labels=batch_y)
  loss = outputs[0]
  print('Loss:' ,loss)

You can also get an accuracy metric by comparing output with the label and calculate (correct prediction)/(total tweets) * 100.

Now, let’s predict our actual test data whose output we have to find.

import torch.nn.functional as F  #for softmax function    
batch_size = 32
prediction = np.empty((0,2)) #empty numpy for appending our output
ids = torch.tensor(range(original_test_id.size()[0]))
for i in range(0,original_test_id.size()[0], batch_size):
  indices = ids[i:i+batch_size]
  batch_x1, batch_am1 = original_test_id[indices], original_test_am[indices]
  pred = model(batch_x1, batch_am1) #Here only activation is given as output
  pt_predictions = F.softmax(pred[0], dim=-1)  #applying softmax activation function
  prediction = np.append(prediction, pt_predictions.detach().numpy(), axis=0) #appending the prediction
Shape of predication
Shape of predication

As we can see prediction has two columns, prediction[:,0] gives the probability of having label 0 and prediction[:,1] gives the probability of having label 1. We can use the argmax function to find the proper label.

sub = np.argmax(prediction, axis=1)

Then by arranging these labels with the proper id we can get our predictions.

submission = pd.DataFrame({'id': test.id.values, 'target':sub})
Submission dataset
Submission dataset

Using this model I got score 0.83695 and placed in the top 12% without even cleaning or processing the data. So we can see how powerful this framework is how it can be used for various purposes. You can also see code here.


I hope my experience could help you in some way also let me know what more could be done to improve the performance (as I am also a newbie in NLP :P).


Related Articles