The world’s leading publication for data science, AI, and ML professionals.

How To Train A Deep Learning Sentiment Analysis Model

Train your own high performing sentiment analysis model

Sentiment Analysis with Deep Learning

Photo by Pietro Jeng on Unsplash
Photo by Pietro Jeng on Unsplash

Objective

Sentiment analysis is a technique in natural language processing used to identify emotions associated with the text. Common use cases of sentiment analysis include monitoring customers’ feedbacks on social media, brand and campaign monitoring.

In this article, we examine how you can train your own sentiment analysis model on a custom dataset by leveraging on a pre-trained HuggingFace model. We will also examine how to efficiently perform single and batch prediction on the fine-tuned model in both CPU and GPU environments. If you are looking to for an out-of-the-box sentiment analysis model, check out my previous article on how to perform sentiment analysis in python with just 3 lines of code.

Installation

pip install transformers
pip install fast_ml==3.68
pip install datasets

Import Packages

import numpy as np
import pandas as pd
from fast_ml.model_development import train_valid_test_split
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import datasets

Enable GPU accelerator if it is available.

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')

Data Preparation

We will be using an ecommerce dataset that contains text reviews and ratings for women’s clothes.

df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df.head()

We are only interested in the Review Text and Rating columns. The Review Text column serves as input variable to the model and the Rating column is our target variable it has values ranging from 1 (least favourable) to 5 (most favourable).

For clarity, let’s append "Star" or "Stars" behind each integer rating.

df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')

This is how the data looks like now, where 1,2,3,4,5 stars are our class labels.

Let’s encode the ratings using Sklearn’s LabelEncoder.

le = LabelEncoder()
df_reviews['Rating'] = le.fit_transform(df_reviews['Rating'])
df_reviews.head()

Notice that the Rating column has been transformed from a text to an integer column.

The numbers in the Rating column ranges from 0 to 4. These are the class id for the class labels which will be used to train the model. Each of the class id corresponds to a rating.

print (le.classes_)
>> ['1 Star' '2 Stars' '3 Stars' **'4 Stars'** '5 Stars']

The position index of the list is the class id (0 to 4) and the value at the position is the original rating. For example at position number 3, the class id is "3" and it corresponds to the class label of "4 stars".

Let’s split the data into train, validation and test in the ratio of 80%, 10% and 10% respectively.

(train_texts, train_labels,
 val_texts, val_labels,
 test_texts, test_labels) = train_valid_test_split(df_reviews, target = 'Rating', train_size=0.8, valid_size=0.1, test_size=0.1)

Convert the review text from pandas series to list of sentences.

train_texts = train_texts['Review Text'].to_list()
train_labels = train_labels.to_list()
val_texts = val_texts['Review Text'].to_list()
val_labels = val_labels.to_list()
test_texts = test_texts['Review Text'].to_list()
test_labels = test_labels.to_list()

Create a DataLoader class for processing and loading of the data during training and inference phase.

class DataLoader(torch.utils.data.Dataset):
    def __init__(self, sentences=None, labels=None):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

        if bool(sentences):
            self.encodings = self.tokenizer(self.sentences,
                                            truncation = True,
                                            padding = True)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

        if self.labels == None:
            item['labels'] = None
        else:
            item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.sentences)

    def encode(self, x):
        return self.tokenizer(x, return_tensors = 'pt').to(DEVICE)

Let’s take a look at the DataLoader in action.

train_dataset = DataLoader(train_texts, train_labels)
val_dataset = DataLoader(val_texts, val_labels)
test_dataset = DataLoader(test_texts, test_labels)

The DataLoader initializes a pretrained tokenizer and encodes the input sentences. We can get a single record from the DataLoader by using the __getitem__ function. Below is the result after an input sentence is tokenized.

print (train_dataset.__getitem__(0))

The output data is a dictionary consisting of 3 keys-value pairs

  • input_ids: this contains a tensor of integers where each integer represents words from the original sentence. The tokenizer step has transformed the individuals words into tokens represented by the integers. The first token 101 is the start of sentence token and the102 token is the end of sentence token. Notice that there are many trailing zeros, this is due to padding that was applied to the sentences at the tokenizer step.
  • attention_mask: this is an array of binary values. Each position of the attention_mask corresponds to a token in the same position in the input_ids. 1 indicates that the token at the given position should be attended to and 0 indicates that the token at the given position is a padded value.
  • labels: this is the target label

Define Evaluation Metrics

We would like the model performance to be evaluated at intervals during the training phase. For that we require a metrics computation function that accepts a tuple of (prediction, label) as argument and returns a dictionary of metrics: {'metric1':value1,metric2:value2}.

f1 = datasets.load_metric('f1')
accuracy = datasets.load_metric('accuracy')
precision = datasets.load_metric('precision')
recall = datasets.load_metric('recall')
def compute_metrics(eval_pred):
    metrics_dict = {}
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    metrics_dict.update(f1.compute(predictions = predictions, references = labels, average = 'macro'))
    metrics_dict.update(accuracy.compute(predictions = predictions, references = labels))
    metrics_dict.update(precision.compute(predictions = predictions, references = labels, average = 'macro'))
    metrics_dict.update(recall.compute(predictions = predictions, references = labels, average = 'macro'))
    return metrics_dict

Training

Next, we configure instantiate a distilbert-base-uncased model from pretrained checkpoint.

id2label = {idx:label for idx, label in enumerate(le.classes_)}
label2id = {label:idx for idx, label in enumerate(le.classes_)}
config = AutoConfig.from_pretrained('distilbert-base-uncased',
                                    num_labels = 5,
                                    id2label = id2label,
                                    label2id = label2id)
model = AutoModelForSequenceClassification.from_config(config)
  • num_labels: number of classes
  • id2label: dictionary that maps the class ids to class labels {0: '1 Star', 1: '2 Stars', 2: '3 Stars', 3: '4 Stars', 4: '5 Stars'}
  • label2id:mapping dictionary that maps the class labels to class ids {'1 Star': 0, '2 Stars': 1, '3 Stars': 2, '4 Stars': 3, '5 Stars': 4}

Let’s examine the model configuration. The id2label and label2id dictionaries has been incorporated into the configuration. We can retrieve these dictionaries from the model’s configuration during inference to find out the corresponding class labels for the predicted class ids.

print (config)
>> DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "1 Star",
    "1": "2 Stars",
    "2": "3 Stars",
    "3": "4 Stars",
    "4": "5 Stars"
  },
  "initializer_range": 0.02,
  "label2id": {
    "1 Star": 0,
    "2 Stars": 1,
    "3 Stars": 2,
    "4 Stars": 3,
    "5 Stars": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.6.1",
  "vocab_size": 30522
}

We can also examine the model architecture using

print (model)

Set up the training arguments.

training_args = TrainingArguments(
    output_dir='/kaggle/working/results',
    num_train_epochs=10,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.05,
    report_to='none',
    evaluation_strategy='steps',
    logging_dir='/kagge/working/logs',
    logging_steps=50)
  • report_to enables logging of training artifacts and results to platforms such as mlflow, tensorboard, azure_ml etc
  • per_device_train_batch_size is the batch size per TPU/GPU/CPU during training. Lower this if you face out of memory issues on your device
  • per_device_eval_batch_size is the batch size per TPU/GPU/CPU during evaluation. Lower this if you face out of memory issues on your device
  • logging_stepdetermines how frequently are the metrics evaluation done during training

Instantiate the Trainer. Under the hood, the Trainer runs the training and evaluation loop based on the given training arguments, model, datasets and metrics.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics)

Let’s start training!

trainer.train()

Evaluation is performed every 50 steps. We can change the interval of evaluation by changing the logging_steps argument in TrainingArguments. In addition to the default training and validation loss metrics, we also get additional metrics which we had defined in the compute_metric function earlier.

Evaluation

Let’s evaluate our training on the test set.

eval_results = trainer.predict(test_dataset)

The Trainer‘s predict function returns 3 items:

  1. An array of raw prediction scores
print (test_results.predictions)
  1. The ground truth label ids
print (test_results.label_ids)

>> [1 1 4 ... 4 3 1]
  1. Metrics
print (test_results.metrics)

>> {'test_loss': 0.9638910293579102,
        'test_f1': 0.28503729426950286,
        'test_accuracy': 0.5982339955849889,
        'test_precision': 0.2740061405117546,
        'test_recall': 0.30397183356136337,
        'test_runtime': 5.7367,
        'test_samples_per_second': 394.826,
        'test_mem_cpu_alloc_delta': 0,
        'test_mem_gpu_alloc_delta': 0,
        'test_mem_cpu_peaked_delta': 0,
        'test_mem_gpu_peaked_delta': 348141568}

The model prediction function outputs unnormalized probability scores. To find the class probabilities we take a softmax across the unnormalized scores. The class with the highest class probabilities is taken to be the predicted class. we can find this by taking the argmax of the class probabilities. The id2label attribute which we stored in the model’s configuration earlier on can be used to map the class id (0-4) to the class labels (1 star, 2 stars..).

label2id_mapper = model.config.id2label
proba = softmax(torch.from_numpy(test_results.predictions))
pred = [label2id_mapper[i] for i in torch.argmax(proba, dim = -1).numpy()]
actual = [label2id_mapper[i] for i in test_results.label_ids]

We use Sklearn’s classification_reportto obtain the precision, recall, f1 and accuracy scores.

class_report = classification_report(actual, pred, output_dict = True)
pd.DataFrame(class_report)

Save Model

trainer.save_model('/kaggle/working/sentiment_model')

Inference

In this section, we look at how to load and perform predictions on the trained model. Let’s test out inference in a separate notebook.

Setup

import pandas as pd
import numpy as np
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax

The inference can work in both GPU or CPU environment. Enable GPU in your environment if it is available.

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')

This is the same DataLoader as we have used in the training phase

class DataLoader(torch.utils.data.Dataset):
    def __init__(self, sentences=None, labels=None):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

        if bool(sentences):
            self.encodings = self.tokenizer(self.sentences,
                                            truncation = True,
                                            padding = True)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

        if self.labels == None:
            item['labels'] = None
        else:
            item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.sentences)

    def encode(self, x):
        return self.tokenizer(x, return_tensors = 'pt').to(DEVICE)

Create a Model Class

The SentimentModel class helps to initialize the model and contains the predict_proba and batch_predict_proba methods for single and batch prediction respectively. The batch_predict_proba uses HuggingFace’s Trainer to perform batch scoring.

class SentimentModel():

    def __init__(self, model_path):

        self.model = AutoModelForSequenceClassification.from_pretrained(model_path).to(DEVICE)
        args =  TrainingArguments(output_dir='/kaggle/working/results', per_device_eval_batch_size=64)
        self.batch_model = Trainer(model = self.model, args= args)
        self.single_dataloader = DataLoader()

    def batch_predict_proba(self, x):

        predictions = self.batch_model.predict(DataLoader(x))
        logits = torch.from_numpy(predictions.predictions)

        if DEVICE == 'cpu':
            proba = torch.nn.functional.softmax(logits, dim = 1).detach().numpy()
        else:
            proba = torch.nn.functional.softmax(logits, dim = 1).to('cpu').detach().numpy()
        return proba

    def predict_proba(self, x):

        x = self.single_dataloader.encode(x).to(DEVICE)
        predictions = self.model(**x)
        logits = predictions.logits

        if DEVICE == 'cpu':
            proba = torch.nn.functional.softmax(logits, dim = 1).detach().numpy()
        else:
            proba = torch.nn.functional.softmax(logits, dim = 1).to('cpu').detach().numpy()
        return proba

Data Preparation

Let’s load some sample data

df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')
df_reviews.head()

We will create two sets of data. One for batch scoring and the other for single scoring.

batch_sentences = df_reviews.sample(n = 10000, random_state = 1)['Review Text'].to_list()
single_sentence = df_reviews.sample(n = 1, random_state = 1)['Review Text'].to_list()[0]

Predict

Instantiate the model

sentiment_model = SentimentModel('../input/fine-tune-huggingface-sentiment-analysis/sentiment_model')

Predict on a single sentence using the predict_proba method.

single_sentence_probas = sentiment_model.predict_proba(single_sentence)
id2label = sentiment_model.model.config.id2label
predicted_class_label = id2label[np.argmax(single_sentence_probas)]
print (predicted_class_label)
>> 5 Stars

Predict on a batch of sentences using the batch_predict_proba method.

batch_sentence_probas = sentiment_model.batch_predict_proba(batch_sentences)
predicted_class_labels = [id2label[i] for i in np.argmax(batch_sentence_probas, axis = -1)]

Speed of Inference

Let’s compare the inference speed for between predict_proba and batch_predict_proba method

for 10k sample data in a CPU and GPU environment. We will iterate through 10k samples for predict_proba make a single prediction at a time while scoring all 10k without iteration using the batch_predict_proa method.

%%time
for sentence in batch_sentences:
    single_sentence_probas = sentiment_model.predict_proba(sentence)
%%time
batch_sentence_probas = sentiment_model.batch_predict_proba(batch_sentences)

GPU environment

Iterating through predict_proba takes ~2 minutes while batch_predict_proba takes ~30 seconds for 10k sample data. Batch prediction is almost 4 times faster than using single prediction in GPU environment.

CPU Environment

In CPU environment, predict_proba took ~14 minutes while batch_predict_proba took ~40 minutes, that is almost 3 times longer.

Therefore for large set of data, use batch_predict_proba if you have GPU. if you do not have access to a GPU, you are better off with iterating through the dataset using predict_proba.

Conclusion

In this article we examined:

  • How to train your own Deep Learning sentiment analysis model by leveraging on a pretrained HuggingFace model
  • How to create single and batch prediction methods for scoring
  • Inference speed for single and batch scoring in both CPU and GPU environments

The notebooks for this article can be found here:


Join Medium to read more stories like this.


Related Articles