Multiclass Text Classification using LSTM in Pytorch

Predicting item ratings based on customer reviews

Published in

Towards Data Science

6 min readApr 7, 2020

Human language is filled with ambiguity, many-a-times the same phrase can have multiple interpretations based on the context and can even appear confusing to humans. Such challenges make natural language processing an interesting but hard problem to solve. However, we’ve seen a lot of advancement in NLP in the past couple of years and it’s quite fascinating to explore the various techniques being used. This article aims to cover one such technique in deep learning using Pytorch: Long Short Term Memory (LSTM) models.

Here’s a link to the notebook consisting of all the code I’ve used for this article: https://jovian.ml/aakanksha-ns/lstm-multiclass-text-classification

If you’re new to NLP or need an in-depth read on preprocessing and word embeddings, you can check out the following article:

Getting Started with Natural Language Processing (NLP)

using simple Python libraries

towardsdatascience.com

Gentle Intro to RNNs and LSTMs:

What sets language models apart from conventional neural networks is their dependency on context. Conventional feed-forward networks assume inputs to be independent of one another. For NLP, we need a mechanism to be able to use sequential information from previous inputs to determine the current output. Recurrent Neural Networks (RNNs) tackle this problem by having loops, allowing information to persist through the network.

An unrolled Recurrent Neural Network (Image by author)

However, conventional RNNs have the issue of exploding and vanishing gradients and are not good at processing long sequences because they suffer from short term memory.

Long Short Term Memory networks (LSTM) are a special kind of RNN, which are capable of learning long-term dependencies. They do so by maintaining an internal memory state called the “cell state” and have regulators called “gates” to control the flow of information inside each LSTM unit. Here’s an excellent source explaining the specifics of LSTMs:

Understanding LSTM Networks

Posted on August 27, 2015 Humans don't start their thinking from scratch every second. As you read this essay, you…

colah.github.io

Structure of an LSTM cell. (source : Varsamopoulos, Savvas & Bertels, Koen & Almudever, Carmen. (2018). Designing neural network based decoders for surface codes.)

Basic LSTM in Pytorch

Before we jump into the main problem, let’s take a look at the basic structure of an LSTM in Pytorch, using a random input. This is a useful step to perform before getting into complex inputs because it helps us learn how to debug the model better, check if dimensions add up and ensure that our model is working as expected.

Even though we’re going to be dealing with text, since our model can only work with numbers, we convert the input into a sequence of numbers where each number represents a particular word (more on this in the next section).

We first pass the input (3x8) through an embedding layer, because word embeddings are better at capturing context and are spatially more efficient than one-hot vector representations.

In Pytorch, we can use the nn.Embedding module to create this layer, which takes the vocabulary size and desired word-vector length as input. You can optionally provide a padding index, to indicate the index of the padding element in the embedding matrix.

In the following example, our vocabulary consists of 100 words, so our input to the embedding layer can only be from 0–100, and it returns us a 100x7 embedding matrix, with the 0th index representing our padding element.

We pass the embedding layer’s output into an LSTM layer (created using nn.LSTM), which takes as input the word-vector length, length of the hidden state vector and number of layers. Additionally, if the first element in our input’s shape has the batch size, we can specify batch_first = True

The LSTM layer outputs three things:

The consolidated output — of all hidden states in the sequence
Hidden state of the last LSTM unit — the final output
Cell state

We can verify that after passing through all layers, our output has the expected dimensions:

3x8 -> embedding -> 3x8x7 -> LSTM (with hidden size=3)-> 3x3

Multiclass Text Classification — Predicting ratings from review comments

Let’s now look at an application of LSTMs.

Problem Statement: Given an item’s review comment, predict the rating ( takes integer values from 1 to 5, 1 being worst and 5 being best)

Dataset: I’ve used the following dataset from Kaggle:

Women's E-Commerce Clothing Reviews

23,000 Customer Reviews and Ratings

www.kaggle.com

Metric

We usually take accuracy as our metric for most classification problems, however, ratings are ordered. If the actual value is 5 but the model predicts a 4, it is not considered as bad as predicting a 1. Hence, instead of going with accuracy, we choose RMSE — root mean squared error as our North Star metric. Also, rating prediction is a pretty hard problem, even for humans, so a prediction of being off by just 1 point or lesser is considered pretty good.

Preprocessing

As mentioned earlier, we need to convert our text into a numerical form that can be fed to our model as input. I’ve used spacy for tokenization after removing punctuation, special characters, and lower casing the text:

We count the number of occurrences of each token in our corpus and get rid of the ones that don’t occur too frequently:

We lost about 6000 words! This is expected because our corpus is quite small, less than 25k reviews, the chance of having repeated words is quite small.

We then create a vocabulary to index mapping and encode our review text using this mapping. I’ve chosen the maximum length of any review to be 70 words because the average length of reviews was around 60.

Pytorch Dataset

The dataset is quite straightforward because we’ve already stored our encodings in the input dataframe. We also output the length of the input sequence in each case, because we can have LSTMs that take variable-length sequences.

Pytorch training loop

The training loop is pretty standard. I’ve used Adam optimizer and cross-entropy loss.

LSTM Model

I’ve used three variations for the model:

LSTM with fixed input size:

This pretty much has the same structure as the basic LSTM we saw earlier, with the addition of a dropout layer to prevent overfitting. Since we have a classification problem, we have a final linear layer with 5 outputs. This implementation actually works the best among the classification LSTMs, with an accuracy of about 64% and a root-mean-squared-error of only 0.817

2. LSTM with variable input size:

We can modify our model a bit to make it accept variable-length inputs. This ends up increasing the training time though, because of the pack_padded_sequence function call which returns a padded batch of variable-length sequences.

3. LSTM with fixed input size and fixed pre-trained Glove word-vectors:

Instead of training our own word embeddings, we can use pre-trained Glove word vectors that have been trained on a massive corpus and probably have better context captured. For our problem, however, this doesn’t seem to help much.

Predicting ratings using regression instead of classification

Since ratings have an order, and a prediction of 3.6 might be better than rounding off to 4 in many cases, it is helpful to explore this as a regression problem. Not surprisingly, this approach gives us the lowest error of just 0.799 because we don’t have just integer predictions anymore.

The only change to our model is that instead of the final layer having 5 outputs, we have just one. The training loop changes a bit too, we use MSE loss and we don’t need to take the argmax anymore to get the final prediction.

Conclusion:

LSTM appears to be theoretically involved, but its Pytorch implementation is pretty straightforward. Also, while looking at any problem, it is very important to choose the right metric, in our case if we’d gone for accuracy, the model seems to be doing a very bad job, but the RMSE shows that it is off by less than 1 rating point, which is comparable to human performance!

References: