First time fine-tuning the BERT framework

Beginning of journey in language land
My interest in the field of NLP started when I decided to participate in one of the ongoing competition of identifying whether the given tweet is about any disaster or not. I was not having any experience in the field of language processing, and after a few internet searches, I came to know about some of the text preprocessing of data like tokenization and lemmatization, used TfidfTransformer and TfidfVectorizer for feature extraction then simply used Naive Bayes for classification (score = 0.77). I took a deep learning specialization course in the meantime and came to know about RNN and decided to use the LTSM model for this task and got better results(score = 0.79987, Top 40%). In that course, there was mention of transfer learning and how it could be a powerful tool for any task. I thought why not try this on a dataset I have right now.
Discovery of BERT
I searched about different frameworks in NLP and came to know about BERT. It is said to be one of the most powerful and influential models in the field of NLP by Google, trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. It can be fine-tuned according to your needs thus making it even more powerful. I decided to use this framework and fine-tune it according to my dataset. I was searching how to use this framework and came across the Hugging face transformers which provides general-purpose architectures (Bert, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
I was able to fine-tune this model after lot of experimentation and reading the documentation and hopeful this experience could also help you in some way.
Let’s get started
First, let’s see the data given
data = pd.read('train.csv')
data.head()

data.describe(include = 'all')

Similarly, there is a ‘test.csv’ whose tweets we have to predict. We can combine both datasets and do some necessary operations on them. We can drop the keyword and location column so we can make predictions from the given tweets only.
df1 = pd.read_csv('content/drive/My Drive/disaster tweets/train.csv')
df2 = pd.read_csv('content/drive/My Drive/disaster tweets/test.csv')
combined = pd.concat([df1,df2], axis=0)
combined = combined.drop(['keyword','location'], axis=1)
I didn’t pre-process or clean the data (e.g- removing punctuations or removing HTML tags etc.) as I just wanted to see how to work with the framework. I am sure that cleaning and working with data will further give better results.
from transformers import BertForSequenceClassification, AdamW #importing appropriate class for classification
import numpy as np
import pandas as pd
import torch
model = BertForSequenceClassification.from_pretrained('bert-base-uncased') #Importing the bert model
model.train() #tell model its in training mode so that some
layers(dropout,batchnorm) behave accordingly
Tokenisation and encoding
For using the BERT model we have to first tokenize and encode our text and BERT tokenizer is provided in Hugging Face transformer.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded = tokenizer(combined.text.values.tolist(), padding=True, truncation=True, return_tensors='pt')

The encoder gives three tensors for a single tweet in the form of a dictionary _(‘input_ids’, ‘attention_mask’, ‘token_typeids’) which is used by the model.
Now, let’s separate the tensors into different variables(we only need ‘input_ids’ and ‘attention_mask’ )and break the combined data again in the test and train format.
input_id = encoded['input_ids']
attention_mask = encoded['attention_mask']
train_id = input_id[:len(df1)]
train_am = attention_mask[:len(df1)]
test_id = input_id[len(df1):]
test_am = attention_mask[len(df1):]
train = combined.iloc[:len(df1)]
test = combined.iloc[len(df1):]
For training and testing purposes let’s split the train data into two parts for training and testing of the model.
Xtrain = train.iloc[:6800]
Xtest = train.iloc[6800:]
Xtrain_id = train_id[:6800]
Xtrain_am = train_am[:6800]
Xtest_id = train_id[6800:]
Xtest_am = train_am[6800:]
labels = torch.tensor(Xtrain.target.values.tolist())
labels = labels.type(torch.LongTensor)
labels.shape
Fine-tuning the model
Now, let’s focus on the model. We will use PyTorch for training the model(TensorFlow could also be used). First, we will configure our optimizer (Adam) and then we will train our model in batch so that our machine( CPU, GPU) doesn’t crash.
optimizer = AdamW(model.parameters(), lr=1e-5)
n_epochs = 1
batch_size = 32
for epoch in range(n_epochs):
permutation = torch.randperm(Xtrain_id.size()[0])
for i in range(0,Xtrain_id.size()[0], batch_size):
optimizer.zero_grad()
indices = permutation[i:i+batch_size]
batch_x, batch_y,batch_am = Xtrain_id[indices], labels[indices], Xtrain_am[indices]
outputs = model(batch_x, attention_mask=batch_am, labels=batch_y)
loss = outputs[0]
loss.backward()
optimizer.step()
Here outputs give us a tuple containing cross-entropy loss and final activation of the model. For example, here is the output of two tensors

We can use these activations to classify the disaster tweets with the help of the softmax activation function.
Now let’s test the model with the help of remaining data in the training set. First, we have to put the model in the test mode and then take outputs from the model.
model.eval() #model in testing mode
batch_size = 32
permutation = torch.randperm(Xtest_id.size()[0])
for i in range(0,Xtest_id.size()[0], batch_size):
indices = permutation[i:i+batch_size]
batch_x, batch_y, batch_am = Xtest_id[indices], labels[indices], Xtest_am[indices]
outputs = model(batch_x, attention_mask=batch_am, labels=batch_y)
loss = outputs[0]
print('Loss:' ,loss)
You can also get an accuracy metric by comparing output with the label and calculate (correct prediction)/(total tweets) * 100.
Now, let’s predict our actual test data whose output we have to find.
import torch.nn.functional as F #for softmax function
batch_size = 32
prediction = np.empty((0,2)) #empty numpy for appending our output
ids = torch.tensor(range(original_test_id.size()[0]))
for i in range(0,original_test_id.size()[0], batch_size):
indices = ids[i:i+batch_size]
batch_x1, batch_am1 = original_test_id[indices], original_test_am[indices]
pred = model(batch_x1, batch_am1) #Here only activation is given as output
pt_predictions = F.softmax(pred[0], dim=-1) #applying softmax activation function
prediction = np.append(prediction, pt_predictions.detach().numpy(), axis=0) #appending the prediction

As we can see prediction has two columns, prediction[:,0] gives the probability of having label 0 and prediction[:,1] gives the probability of having label 1. We can use the argmax function to find the proper label.
sub = np.argmax(prediction, axis=1)
Then by arranging these labels with the proper id we can get our predictions.
submission = pd.DataFrame({'id': test.id.values, 'target':sub})

Using this model I got score 0.83695 and placed in the top 12% without even cleaning or processing the data. So we can see how powerful this framework is how it can be used for various purposes. You can also see code here.
I hope my experience could help you in some way also let me know what more could be done to improve the performance (as I am also a newbie in NLP :P).