The world’s leading publication for data science, AI, and ML professionals.

How To Do Sentiment Analysis With Augmented Data Using Transformers & Transfer Learning – Part 2

Last part of the two part series on how to perform NLP – sentiment analysis with little to none labeled data.

Photo by Elaine Casap on Unsplash
Photo by Elaine Casap on Unsplash

It’s undisputed. We live in a world of data. Since the internet became wide spread, data has grown exponentially, according to Forbes 90% of the worlds data was generated between 2016 and 2018, meaning, that all the data that came before, from the dawn of the history of humanity to 2016 only occupied the other 10% of the pie. Today that number is a lot less.

I believe data is the fuel for knowledge. And today data is everywhere.

Yet the challenge is making sense of it. Almost 5 billion YouTube videos are watched every day and 500 million tweets are published on Twitter daily. With the rise of social media platforms and online news sites accessing facts, ideas, opinions and misinformation is easier than ever. There is however a set back, the data out there is messy, unstructured and sometimes outright false. How do we harness the power of this information in a meaningful way? In this article we are going to look at one way to do just that.

On the previous section we performed data augmentation on a small set of financial articles. This resulted on a new dataset that was over 50 times larger than the original source and big enough to move on to our next endeavors: Sentiment Analysis.

Sentiment analysis is a form of Natural Language Processing (NLP) that identifies and quantifies text data emotional states and subjective information of the topics, persons and entities within it. Traditionally performed using algorithms such as Naïve Bayes or Support Vector Machines. It’s also historically been a supervised task, using labeled datasets to train algorithms, requiring large datasets that are carefully selected, categorized and labeled manually by humans.

Today, the consensus is clear, Deep Learning methods achieve better accuracy on most NLP tasks than other methods. And right now the undisputed Kings of NLP are large pre-trained models called Transformers. Their architecture aim to solve sequence-to-sequence tasks while handling long-range input and output dependencies with attention and recurrence.

What make Transformers so special its not only that most have been trained in so much data and have been highly optimized to the point that most are useful even when using them off-the-shelf, but that they can be easily fine tuned to the point that they will be able to perform really well, even when training them with little to no data.

Finetuning in deep learning involves using weights of a previous model for training another similar deep learning process to achieve the desired output or to enhance performance on the target task. For this exercise I chose the model Bidirectional Encoder Representations from Transformers or simply BERT.

Bert is a "bidirectional" semi supervised model. meaning that BERT learns information from both the left and the right side of a token’s context (small units of text) during the training phase and its been pre-trained using unlabeled data extracted from the ‘BooksCorpus‘ with 800M words and English Wikipedia with 2,500M words. Bert works by predicting masked words, specifically by the words that lead and follow a particular phrase of word.

The first thing I like to do is to set my hyperparameters as constants, this will make my job a lot easier since all the changes I need to make to my code are on one place.

MODEL_NAME = 'bert-based-cased'
BATCH_SIZE = 125
LR = 1e-4
MAX_LEN = 100
EPOCHS = 10
LABEL_NUM = 2
device = 'cuda'  # use gpu if no graphics card found

The hyperparameters here are defined as:

  • ‘MODEL_NAME’ – This is the parameter that Transformers uses to download the model we’ll be using
  • ‘BATCH_SIZE’ – Is the number of samples that will be propagated through the network
  • ‘LR’ – Controls how quickly the model is learning
  • ‘MAX_LEN’ – Is the maximum length of the sequences we’ll feed the model.
  • ‘EPOCHS’ – Is the number of cycles the model will go through the dataset
  • ‘LABEL_NUM’ – The number of classes that we are looking for
  • ‘device’ – Use ‘cuda’ if you have access to a graphics card, else use ‘gpu’

Before we can train the model we need to do some house keeping work. To start we’ll encode labeled classes in our data then split the data into train, test and validate parts. Train and Validate will be used to train and monitor our model, and test will be used to measure the models predictive performance once its done training.

Most tokenizers are unique to a model and when trying to perform re-enforced learning on Bert we have to uses its own tokenizer, conspicuously named BertTokenizer.

The tokenized data is then packed into what is called DataLoaders. A DataLoader is an iterable container that is used to feed data to the model in batches. Both train and validate data need to be within a DataLoader for the model to learn and validate itself efficiently. To construct them I built a helper function that will take a data input and return a DataLoader container.

Once the DataLoaders are constructed we can move on to training the model. For our Optimizer we are choosing Adam, the number of steps per epoch will be calculated by the number of elements in the DataLoader times the number of Epochs.

We’ll add a scheduler that gets called every time a batch is fed to the model. We are also avoiding exploding gradients (when lots of gradients are concatenated into a single vector) by clipping the them off from the model using PyTorch’s _’nn.utils.clip_gradnorm’.

With each iteration the model will learn from our data and calculate its loss score, which, if everything is correct will continue lowering until the models training is complete.

Subsequently we’ll write our validation portion of the code. Once its done it will return the average loss per dataloader batch, predictions and the true values of the data.

We’ll use predictions and actual values to calculate the F1 score. F1 combines precision and recall to come up with the models accuracy on a given dataset.

Next we’ll train two models. One using the original data and another with the augmented dataset created on the first part of this series. We’ll calculate their respective Accuracy and F1 scores and finally test the two models against unseen data to see how well each performs.

The results for the model trained on the small/original data source are as follows:

Output from Model trained on the original data
Output from Model trained on the original data

By the third Epoch neither our training or validating loss made it below 0.6. This poor score tells us that the model didn’t have enough data to learn and generalize. The F1 score doesn’t reach 0.5, and the poor accuracy is reflected on the models failure to accurately predict almost have of the new samples.

However the results for the model trained on the augmented dataset paint a much better picture:

Output from Model trained on the Augmented data
Output from Model trained on the Augmented data

By the last Epoch our train and validation loss where both low and very close to each other. The F1 score reached 0.99, as it did the Accuracy score. Then the model was used to predict the values from unseen articles and was able to predict correctly the actual sentiment over 99% of the time.

As you can see the model trained on the augmented dataset is well suited to identify our test data and thus we can conclude that it has been fined tuned enough to perform well sentiment analysis on financial data.

Conclusion:

On this two part series, we covered how data augmentation and re-enforce learning using transformers can help us create state-of-the-art classification models in spite of lacking large amounts of label data.

In case you missed it and you are curios to know how to do Data Augmentation from scratch, click here to jump to the first part of this series: Augment Your Small Dataset Using Transformers & Synonym Replacement for Sentiment Analysis

Sources:


Related Articles