Application of RNN for customer review sentiment analysis

Shukhrat Khodjaev
Towards Data Science
6 min readSep 26, 2018

--

In my previous blog post I wrote about using BeautifulSoup for scraping over two thousand Flixbus customer reviews and identifying company’s strengths and weaknesses by performing NLP analysis.

Building up on previous story, I decided to use the collected text data to train a Recurrent Neural Network model for predicting customers’ sentiment, which proved to be highly efficient scoring 95.93% accuracy on the test set.

What is sentiment analysis? Wikipedia provides a nice explanation:

“… sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event.” - Source

Without any further ado let’s jump into implementation.

Loading and preparing the data

As a starting point, I loaded a csv file containing 1,780 customer reviews in English with the corresponding rating on the scale from 1 to 5, where 1 is the lowest (negative) and 5 is the highest (positive) rating. Here is a quick glance at the data frame:

Data frame with customer reviews and rating

Great! Now we have the data to work with. However, as our goal is to predict sentiment — whether review is positive or negative, we have to select appropriate data for this task.

Using Counter function I noticed that we have quite an unbalanced distribution of reviews per rating:

# Count of reviews per rating
Counter({5: 728, 4: 416, 1: 507, 3: 86, 2: 43})

To balance it out and to ensure a good representation of the sentiment classes, I decided to keep 5 star reviews for “positive”, while 1 and 2 star reviews for “negative” sentiment. As a result I ended up with a sample size of 1,278 reviews in total. Not much, but let’s see what we can get out of it.

Prior to processing the reviews, the sentiment should be binary encoded with 1 for positive and 0 for negative sentiment using list comprehension.

data['Sentiment'] = [1 if x > 4 else 0 for x in data.Rating]

Now we have a basic set up and it is time to continue with data preprocessing.

Data preprocessing

RNN input requires array data type, therefore, we convert the “Reviews” into the X array and “Sentiment” into the y array accordingly.

X, y = (data['Review'].values, data['Sentiment'].values)

Text data has to be integer encoded before feeding it into the RNN model. This can be easily achieved by using basic tools from the Keras library with only a few lines of code:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tk = Tokenizer(lower = True)
tk.fit_on_texts(X)
X_seq = tk.texts_to_sequences(X)
X_pad = pad_sequences(X_seq, maxlen=100, padding='post')

First, the text should be tokenized by fitting Tokenizer class on the data set. As you can see I use “lower = True” argument to convert the text into lowercase to ensure consistency of the data. Afterwards, we should map our list of words (tokens) to a list of unique integers for each unique word using texts_to_sequences class.

Dictionary

As an example, below you can see how the original reviews turn into a sequence of integers after applying prepocessing.

Original reviews vs after tokenization and sequencing

Next, we use pad_sequences class on the list of integers to ensure that all reviews have the same length, which is a very important step for preparing data for RNN model. Applying this class would either shorten the reviews to 100 integers, or pad them with 0’s in case they are shorter.

Reviews after padding

Now, we split the data set into training and testing using sklearn’s train_test_split and keeping 25% of original data as a hold out set:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size = 0.25, random_state = 1)

Furthemore, the training set can be split into training and validation set:

batch_size = 64
X_train1 = X_train[batch_size:]
y_train1 = y_train[batch_size:]
X_valid = X_train[:batch_size]
y_valid = y_train[:batch_size]

It is time to build the model and fit it on the training data using Keras:

from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
vocabulary_size = len(tk.word_counts.keys())+1
max_words = 100
embedding_size = 32
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(200))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

As embedding requires the size of the vocabulary and the length of input sequences, we set vocabulary_size at the number of words in Tokenizer dictionary + 1 and input_length at 100 (max_words), where value of the latter parameter must be the same as for padding(!). Embedding size parameter specifies how many dimensions will be used to represent each word. Normally one uses the values 50, 100 and 300 as an input for this parameter, but while tuning the model value 32 delivered the best result in this case.

Next, we add 1 hidden LSTM layer with 200 memory cells. Potentially, adding more layers and cells can lead to better results.

Finally, we add the output layer with sigmoid activation function to predict a probability of a review being positive.

After training the model for 10 epochs, we achieve an accuracy of 98.44% on validation set and 95.93% on test (hold out) set.

Training and testing accuracy results
Confusion matrix of hold out set prediction

Isn’t that awesome? Let’s double check to be on a safe side!

Final validation

To validate the accuracy of the model further, I additionally scraped 100 latest customer reviews of Flixbus from Trustpilot, which of course where not included in the original data set. The newly scraped reviews include 1, 4 and 5 star rating reviews with the following count:

# Count of reviews per rating
Counter({1: 83, 4: 13, 5: 4})

In order to prepare the reviews for prediction, the same preprocessing steps have to be applied on text before passing them into trained model.

# Prepare reviews for check
Check_set = df.Review.values
Check_seq = tk.texts_to_sequences(Check_set)
Check_pad = pad_sequences(Check_seq, maxlen = 100, padding = 'post')
# Predict sentiment
check_predict = model.predict_classes(Check_pad, verbose = 0)
# Prepare data frame
check_df = pd.DataFrame(list(zip(df.Review.values, df.Rating.values, check_predict)), columns = ['Review','Rating','Sentiment'])
check_df.Sentiment = ['Pos' if x == [1] else 'Neg' for x in check_df.Sentiment]
check_df

Finally, we get the following result:

Final check results
Confusion matrix of validation prediction

From the screenshot above, you can immediately spot some cases of misclassifications. Out of 100 predicted cases, only sixteen reviews with actual 1 star rating are incorrectly classified as “Pos” (having positive sentiment). However, if we dig deeper, we can see that the problem is actually not as big as it seems to be:

Misclassifications

Out of 16 cases, there are only 3 unique reviews! One review repeats 14 times (not fair 💩). In the mean time, all the rest with rating 4 and 5 were correctly classified as positive.

That is it!

Woohoo! We have successfully trained and validated the performance of RNN for sentiment prediction. Overall, it is a relatively simple and easy task, delivering outstanding results. Hope you enjoyed this read and will try to work on your own implementation!

--

--