Sentiment Analysis with Deep Learning

Objective
Sentiment analysis is a technique in natural language processing used to identify emotions associated with the text. Common use cases of sentiment analysis include monitoring customers’ feedbacks on social media, brand and campaign monitoring.
In this article, we examine how you can train your own sentiment analysis model on a custom dataset by leveraging on a pre-trained HuggingFace model. We will also examine how to efficiently perform single and batch prediction on the fine-tuned model in both CPU and GPU environments. If you are looking to for an out-of-the-box sentiment analysis model, check out my previous article on how to perform sentiment analysis in python with just 3 lines of code.
Installation
pip install transformers
pip install fast_ml==3.68
pip install datasets
Import Packages
import numpy as np
import pandas as pd
from fast_ml.model_development import train_valid_test_split
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import datasets
Enable GPU accelerator if it is available.
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')
Data Preparation
We will be using an ecommerce dataset that contains text reviews and ratings for women’s clothes.
df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df.head()

We are only interested in the Review Text
and Rating
columns. The Review Text
column serves as input variable to the model and the Rating
column is our target variable it has values ranging from 1 (least favourable) to 5 (most favourable).
For clarity, let’s append "Star" or "Stars" behind each integer rating.
df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')
This is how the data looks like now, where 1,2,3,4,5 stars are our class labels.

Let’s encode the ratings using Sklearn’s LabelEncoder
.
le = LabelEncoder()
df_reviews['Rating'] = le.fit_transform(df_reviews['Rating'])
df_reviews.head()
Notice that the Rating
column has been transformed from a text to an integer column.

The numbers in the Rating
column ranges from 0 to 4. These are the class id for the class labels which will be used to train the model. Each of the class id corresponds to a rating.
print (le.classes_)
>> ['1 Star' '2 Stars' '3 Stars' **'4 Stars'** '5 Stars']
The position index of the list is the class id (0 to 4) and the value at the position is the original rating. For example at position number 3, the class id is "3" and it corresponds to the class label of "4 stars".
Let’s split the data into train, validation and test in the ratio of 80%, 10% and 10% respectively.
(train_texts, train_labels,
val_texts, val_labels,
test_texts, test_labels) = train_valid_test_split(df_reviews, target = 'Rating', train_size=0.8, valid_size=0.1, test_size=0.1)
Convert the review text from pandas series to list of sentences.
train_texts = train_texts['Review Text'].to_list()
train_labels = train_labels.to_list()
val_texts = val_texts['Review Text'].to_list()
val_labels = val_labels.to_list()
test_texts = test_texts['Review Text'].to_list()
test_labels = test_labels.to_list()

Create a DataLoader
class for processing and loading of the data during training and inference phase.
class DataLoader(torch.utils.data.Dataset):
def __init__(self, sentences=None, labels=None):
self.sentences = sentences
self.labels = labels
self.tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
if bool(sentences):
self.encodings = self.tokenizer(self.sentences,
truncation = True,
padding = True)
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
if self.labels == None:
item['labels'] = None
else:
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.sentences)
def encode(self, x):
return self.tokenizer(x, return_tensors = 'pt').to(DEVICE)
Let’s take a look at the DataLoader
in action.
train_dataset = DataLoader(train_texts, train_labels)
val_dataset = DataLoader(val_texts, val_labels)
test_dataset = DataLoader(test_texts, test_labels)
The DataLoader
initializes a pretrained tokenizer and encodes the input sentences. We can get a single record from the DataLoader
by using the __getitem__
function. Below is the result after an input sentence is tokenized.
print (train_dataset.__getitem__(0))

The output data is a dictionary consisting of 3 keys-value pairs
input_ids
: this contains a tensor of integers where each integer represents words from the original sentence. Thetokenizer
step has transformed the individuals words into tokens represented by the integers. The first token101
is the start of sentence token and the102
token is the end of sentence token. Notice that there are many trailing zeros, this is due to padding that was applied to the sentences at thetokenizer
step.attention_mask
: this is an array of binary values. Each position of theattention_mask
corresponds to a token in the same position in theinput_ids
.1
indicates that the token at the given position should be attended to and0
indicates that the token at the given position is a padded value.labels
: this is the target label
Define Evaluation Metrics
We would like the model performance to be evaluated at intervals during the training phase. For that we require a metrics computation function that accepts a tuple of (prediction, label)
as argument and returns a dictionary of metrics: {'metric1':value1,
metric2:value2}
.
f1 = datasets.load_metric('f1')
accuracy = datasets.load_metric('accuracy')
precision = datasets.load_metric('precision')
recall = datasets.load_metric('recall')
def compute_metrics(eval_pred):
metrics_dict = {}
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
metrics_dict.update(f1.compute(predictions = predictions, references = labels, average = 'macro'))
metrics_dict.update(accuracy.compute(predictions = predictions, references = labels))
metrics_dict.update(precision.compute(predictions = predictions, references = labels, average = 'macro'))
metrics_dict.update(recall.compute(predictions = predictions, references = labels, average = 'macro'))
return metrics_dict
Training
Next, we configure instantiate a distilbert-base-uncased
model from pretrained checkpoint.
id2label = {idx:label for idx, label in enumerate(le.classes_)}
label2id = {label:idx for idx, label in enumerate(le.classes_)}
config = AutoConfig.from_pretrained('distilbert-base-uncased',
num_labels = 5,
id2label = id2label,
label2id = label2id)
model = AutoModelForSequenceClassification.from_config(config)
num_labels
: number of classesid2label
: dictionary that maps the class ids to class labels{0: '1 Star', 1: '2 Stars', 2: '3 Stars', 3: '4 Stars', 4: '5 Stars'}
label2id
:mapping dictionary that maps the class labels to class ids{'1 Star': 0, '2 Stars': 1, '3 Stars': 2, '4 Stars': 3, '5 Stars': 4}
Let’s examine the model configuration. The id2label
and label2id
dictionaries has been incorporated into the configuration. We can retrieve these dictionaries from the model’s configuration during inference to find out the corresponding class labels for the predicted class ids.
print (config)
>> DistilBertConfig {
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"id2label": {
"0": "1 Star",
"1": "2 Stars",
"2": "3 Stars",
"3": "4 Stars",
"4": "5 Stars"
},
"initializer_range": 0.02,
"label2id": {
"1 Star": 0,
"2 Stars": 1,
"3 Stars": 2,
"4 Stars": 3,
"5 Stars": 4
},
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.6.1",
"vocab_size": 30522
}
We can also examine the model architecture using
print (model)
Set up the training arguments.
training_args = TrainingArguments(
output_dir='/kaggle/working/results',
num_train_epochs=10,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.05,
report_to='none',
evaluation_strategy='steps',
logging_dir='/kagge/working/logs',
logging_steps=50)
report_to
enables logging of training artifacts and results to platforms such as mlflow, tensorboard, azure_ml etcper_device_train_batch_size
is the batch size per TPU/GPU/CPU during training. Lower this if you face out of memory issues on your deviceper_device_eval_batch_size
is the batch size per TPU/GPU/CPU during evaluation. Lower this if you face out of memory issues on your devicelogging_step
determines how frequently are the metrics evaluation done during training
Instantiate the Trainer
. Under the hood, the Trainer
runs the training and evaluation loop based on the given training arguments, model, datasets and metrics.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics)
Let’s start training!
trainer.train()
Evaluation is performed every 50 steps. We can change the interval of evaluation by changing the logging_steps
argument in TrainingArguments
. In addition to the default training and validation loss metrics, we also get additional metrics which we had defined in the compute_metric
function earlier.

Evaluation
Let’s evaluate our training on the test set.
eval_results = trainer.predict(test_dataset)
The Trainer
‘s predict
function returns 3 items:
- An array of raw prediction scores
print (test_results.predictions)

- The ground truth label ids
print (test_results.label_ids)
>> [1 1 4 ... 4 3 1]
- Metrics
print (test_results.metrics)
>> {'test_loss': 0.9638910293579102,
'test_f1': 0.28503729426950286,
'test_accuracy': 0.5982339955849889,
'test_precision': 0.2740061405117546,
'test_recall': 0.30397183356136337,
'test_runtime': 5.7367,
'test_samples_per_second': 394.826,
'test_mem_cpu_alloc_delta': 0,
'test_mem_gpu_alloc_delta': 0,
'test_mem_cpu_peaked_delta': 0,
'test_mem_gpu_peaked_delta': 348141568}
The model prediction function outputs unnormalized probability scores. To find the class probabilities we take a softmax across the unnormalized scores. The class with the highest class probabilities is taken to be the predicted class. we can find this by taking the argmax of the class probabilities. The id2label
attribute which we stored in the model’s configuration earlier on can be used to map the class id (0-4) to the class labels (1 star, 2 stars..).
label2id_mapper = model.config.id2label
proba = softmax(torch.from_numpy(test_results.predictions))
pred = [label2id_mapper[i] for i in torch.argmax(proba, dim = -1).numpy()]
actual = [label2id_mapper[i] for i in test_results.label_ids]
We use Sklearn’s classification_report
to obtain the precision, recall, f1 and accuracy scores.
class_report = classification_report(actual, pred, output_dict = True)
pd.DataFrame(class_report)
Save Model
trainer.save_model('/kaggle/working/sentiment_model')
Inference
In this section, we look at how to load and perform predictions on the trained model. Let’s test out inference in a separate notebook.
Setup
import pandas as pd
import numpy as np
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax
The inference can work in both GPU or CPU environment. Enable GPU in your environment if it is available.
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')
This is the same DataLoader
as we have used in the training phase
class DataLoader(torch.utils.data.Dataset):
def __init__(self, sentences=None, labels=None):
self.sentences = sentences
self.labels = labels
self.tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
if bool(sentences):
self.encodings = self.tokenizer(self.sentences,
truncation = True,
padding = True)
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
if self.labels == None:
item['labels'] = None
else:
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.sentences)
def encode(self, x):
return self.tokenizer(x, return_tensors = 'pt').to(DEVICE)
Create a Model Class
The SentimentModel
class helps to initialize the model and contains the predict_proba
and batch_predict_proba
methods for single and batch prediction respectively. The batch_predict_proba
uses HuggingFace’s Trainer
to perform batch scoring.
class SentimentModel():
def __init__(self, model_path):
self.model = AutoModelForSequenceClassification.from_pretrained(model_path).to(DEVICE)
args = TrainingArguments(output_dir='/kaggle/working/results', per_device_eval_batch_size=64)
self.batch_model = Trainer(model = self.model, args= args)
self.single_dataloader = DataLoader()
def batch_predict_proba(self, x):
predictions = self.batch_model.predict(DataLoader(x))
logits = torch.from_numpy(predictions.predictions)
if DEVICE == 'cpu':
proba = torch.nn.functional.softmax(logits, dim = 1).detach().numpy()
else:
proba = torch.nn.functional.softmax(logits, dim = 1).to('cpu').detach().numpy()
return proba
def predict_proba(self, x):
x = self.single_dataloader.encode(x).to(DEVICE)
predictions = self.model(**x)
logits = predictions.logits
if DEVICE == 'cpu':
proba = torch.nn.functional.softmax(logits, dim = 1).detach().numpy()
else:
proba = torch.nn.functional.softmax(logits, dim = 1).to('cpu').detach().numpy()
return proba
Data Preparation
Let’s load some sample data
df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')
df_reviews.head()
We will create two sets of data. One for batch scoring and the other for single scoring.
batch_sentences = df_reviews.sample(n = 10000, random_state = 1)['Review Text'].to_list()
single_sentence = df_reviews.sample(n = 1, random_state = 1)['Review Text'].to_list()[0]
Predict
Instantiate the model
sentiment_model = SentimentModel('../input/fine-tune-huggingface-sentiment-analysis/sentiment_model')
Predict on a single sentence using the predict_proba
method.
single_sentence_probas = sentiment_model.predict_proba(single_sentence)
id2label = sentiment_model.model.config.id2label
predicted_class_label = id2label[np.argmax(single_sentence_probas)]
print (predicted_class_label)
>> 5 Stars
Predict on a batch of sentences using the batch_predict_proba
method.
batch_sentence_probas = sentiment_model.batch_predict_proba(batch_sentences)
predicted_class_labels = [id2label[i] for i in np.argmax(batch_sentence_probas, axis = -1)]
Speed of Inference
Let’s compare the inference speed for between predict_proba
and batch_predict_proba
method
for 10k sample data in a CPU and GPU environment. We will iterate through 10k samples for predict_proba
make a single prediction at a time while scoring all 10k without iteration using the batch_predict_proa
method.
%%time
for sentence in batch_sentences:
single_sentence_probas = sentiment_model.predict_proba(sentence)
%%time
batch_sentence_probas = sentiment_model.batch_predict_proba(batch_sentences)
GPU environment
Iterating through predict_proba
takes ~2 minutes while batch_predict_proba
takes ~30 seconds for 10k sample data. Batch prediction is almost 4 times faster than using single prediction in GPU environment.

CPU Environment
In CPU environment, predict_proba
took ~14 minutes while batch_predict_proba
took ~40 minutes, that is almost 3 times longer.

Therefore for large set of data, use batch_predict_proba
if you have GPU. if you do not have access to a GPU, you are better off with iterating through the dataset using predict_proba
.
Conclusion
In this article we examined:
- How to train your own Deep Learning sentiment analysis model by leveraging on a pretrained HuggingFace model
- How to create single and batch prediction methods for scoring
- Inference speed for single and batch scoring in both CPU and GPU environments
The notebooks for this article can be found here:
Join Medium to read more stories like this.