The world’s leading publication for data science, AI, and ML professionals.

Mining Opinions to Understand Customer Trends: Part 1 of 2

Fine tuning pre-trained Transformers with PyTorch

Photo by Şahin Sezer Dinçer on Unsplash
Photo by Şahin Sezer Dinçer on Unsplash

With the wealth of information available on the internet, mining opinions to understand how the customers feel about your brand, product and service has become an important success metric. Companies are not holding back while investing in AI to get worthy insights from the customer’s mind.

"Your most unhappy customers are your greatest source of learning" ~ Bill Gates

To mine customer opinions, companies often leverage the advancements of Natural Language Processing (NLP) and in this blog, I want to explore some key concepts of NLP that can be used to solve the problem at hand.

Objective

In this 2-part series, I use Transformers, a model that changed the face of NLP in 2017, to analyze customer sentiment across different product domains using Amazon reviews. Along with Transformers, I explore the concepts of parallel computing, transfer learning, and interactive dashboards.

Part 1 of the series focusses on fine tuning pre-trained Transformers and Part 2 details my experiment with creating interactive dashboards in Jupyter Notebooks.


1. Before We Code…

1a. Transformer-ing NLP

Photo by Samule Sun on Unsplash
Photo by Samule Sun on Unsplash

Transformer is an NLP encoder-decoder architecture that uses the multi-head self-attention mechanism to parallel process input sequences.

Let’s break down the above sentence and develop an intuition for each part:

  • "NLP": Natural Language Processing or NLP is a field in machine learning which helps machines derive meaning from human language. Applications of NLP range from understanding language (like summarising text, social media monitoring) to generating language (like creating captions for pictures) and sometimes doing both simultaneously (like language translation, chatbots).
  • "encoder-decoder architecture": The computer/machine cannot understand words, hence we feed it language in the form of numbers. Encoding means converting data into a coded message (in our case a numeric vector a.k.a. hidden state) and decoding means converting a coded message into understandable language. The decoder output would depend on the objective of the model, for example, while translating English to Hindi, the decoder would convert the coded message to the Hindi language.
  • "self-attention mechanism": The drawback of a simple encoder-decoder architecture is that it cannot skill-fully manage long sentence sequences due to the vanishing gradient problem. To solve for this, Attention Mechanism was introduced. The purpose was to allow the NLP model to focus more on relevant parts of the input sequence while decoding each time step. There are different types of attention, the one used by transformers is called Self-Attention. It captures the contextual relationship between words in a sentence by utilizing context vectors.

Attention is one of the most important concepts to understand about Transformers and NLP in general, read more about it here!

  • "multi-head attention": The self-attention process is repeated multiple times in parallel and each of these is called a head, hence the name ‘multi-head’ attention. Multi-head Attention allows the embeddings to learn different aspects of the meanings of each word. A way to think about it is that a building can be described by its height, width, color, location, etc. and having different descriptions makes the final picture richer.
  • "parallel process input sequences": The predecessors of Transformers encoded the input in sequence (one word after the other, left-to-right or right-to-left) disallowing the models to leverage the magic of GPU’s parallel computing. Given the Transformer architecture creates context vectors for each time step, the words are independent of each other and can be processed parallelly. This property also allows for bidirectionality in the model – meaning every word prediction considers context on both sides of the current word. (fig-1)
Fig-1: Transformers consider context in both directions during encoding (Image by author)
Fig-1: Transformers consider context in both directions during encoding (Image by author)

Transformers are a powerful force in NLP and there is a corpus of articles explaining the original paper. The short description above doesn’t do justice to the intricate beauty of the architecture – I’d encourage you to read more about it. Here is one such useful article.

Fig-2: Transformer Architecture (source)
Fig-2: Transformer Architecture (source)

1b. Transfer Learning to the Rescue

The human language, understood by us rather simply, seems infinite in its complexity when we start teaching it to a Machine Learning model. Finding datasets that capture all the nuances to make an advanced NLP model is a difficult task, and once we find the necessary amount of data, the actual training of the model becomes a computationally expensive task. To solve for this, we use Transfer Learning.

In Transfer Learning, we reuse a pre-trained model as a starting point for another model. The underlying idea is that the learned features in the first task are general enough to be repurposed for the second task.

Fig-3: Illustration of Transfer Learning (Image by Author)
Fig-3: Illustration of Transfer Learning (Image by Author)

In 2018, Google revolutionalized the NLP landscape (again!) by releasing Bidirectional Encoder Representations from Transformers (BERT). BERT is essentially the "Encoder" part of the "Transformers" which had been trained on the Wikipedia corpus (meaning lots and lots of data!).

Transfer Learning from BERT showed exceptional predictions on 11 different downstream tasks in 2018.

The purpose of BERT was to make a robust encoder that can embed words while taking into account the context of the sentence from both directions. (words before and after the current word) and alas, it does exactly that! Using the pre-trained representations from BERT significantly increases the accuracy of NLP model predictions and even more significantly reduces the computation expense which would have otherwise been required to get to that accuracy.

Pre-trained language models have become the real deal in NLP research-verse with a lot of breakthroughs in the past 3 years. GPT-2, GPT-3, T5, RoBERTa and XLNet are some other popular language models.

1c. The Power of Parallel Computing

With the expansion in data availability and the complexity of architectures, adding new layers to your neural net can be a daunting task for your CPU. A simple way to make the ‘job’ easier is to utilize multiple CPU cores. This helps the CPU to parallelize the process and reduce the turnaround time.

An even better way to accelerate your processing is to use GPUs. A CPU’s capacity to parallelly compute threads is limited to the number of cores, however, GPUs can process multiple computations simultaneously. GPUs have proven to be extremely valuable for ML models, especially, for Deep Learning.

GPUs might not be pocket friendly for everyone but worry not, the ML community has your back! There are multiple options available online to use free GPUs! I’ll be leveraging Google Colab‘s NVIDIA GPU for this blog.

For NVIDIA GPUs, we can use CUDA to create a simple solution to allow parallel processing in our model


2. Now, Let’s Code

The concepts discussed above can be used to create an ML model for mining opinions and understand current customer sentiment. I have broken down the process into 4 steps:

I’ll be using the ‘unprocessed’ tar file from Multi-Domain Sentiment Dataset (version 2.0). This dataset contains Amazon reviews for products in multiple categories across the last few decades.

2a. Data Preparation

The data is present in XML format and will first need to be converted into a Pandas Dataframe. We can do so by using Beautiful Soup.

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It helps in navigating, searching, and modifying the parse tree"

We will now convert the text into tokens using the BERT vocabulary. PyTorch-Pretrained-BERT library provides tokens for all BERT models. Read more about them here.

Each model has a ‘maximum sequence’ limit and we’ll need to truncate our tokens to BERT’s limit – which is 512. We’ll also add a token to differentiate between sentences (‘CLS’) and padding (‘PAD’) to ensure all sentences have the same length.

We’ll now create a Data Loader for iterating through data. Data Loader enables automatic batching of the data and is useful when we want to parallelize the data operations. While pulling the text into Data Loader, we tokenize each row of the dataset and load it with the corresponding label.

2b. Transfer Learning

For Transfer Learning, I have adopted the pre-trained model from NAACL’s 2019 Transfer Learning tutorial.

On top of the model, we’ll add a classifier head for the business problem. In our case, we want to predict ‘positive’ or ‘negative’ customer reaction – hence, our model will have 2 output classes.

2c. Model Fine-Tuning and Training

We’ll now need to create a function for the training of train data and an evaluation function to evaluate validation data at each epoch. We’ll additionally need to define a configuration for model optimization. Defining a separate dictionary for fine-tuning the model allows us to make easy changes while iterating on the hyperparameters.

PyTorch’s Ignite libraries makes training and evaluation very convenient with only a few lines of code.

The accuracy of the model can be checked using the defined evaluator. The pre-trained model enabled us to reach 92% accuracy with a couple thousand examples.

Fig-4: Model Accuracy on the test set
Fig-4: Model Accuracy on the test set

2d. Predict!

To predict samples, we’ll first tokenize the data similar to Code-Block 2 and then input it into the model.

The above function can directly be used to output the positive and negative classes for a single sentence. In Part 2 of this blog, we’ll use this function for batch prediction.

Fig-5: Model Prediction for single input
Fig-5: Model Prediction for single input

Conclusion

To recap, we learnt how to read XML files using Beautiful Soup, used PyTorch Data Loaders to create iterators, tokenized text using BERT, and used transfer learning on a pre-trained model to create a base for our sentiment analysis. A classifier was also added to the base model to enable it to predict positive and negative sentiments.

Check out the "Attention is all you need" paper to read more about Transformers. Please refer to the source code for more details and feel free to reach out to me for any questions/suggestions that you might have!

In the second part of this blog, I’ll create an interactive dashboard which will plot real-time trends based on the predictions of the transformer model!


Related Articles