Sentiment Classification for Restaurant Reviews using CNN in PyTorch

Implementing Convolutional Neural Network (CNN) with Word2Vec embeddings as input to classify Yelp Restaurant Reviews in PyTorch

Dipika Baad
Towards Data Science

--

Sentiment Classification using CNN in PyTorch by Dipika Baad

In this article, I will explain how CNN can be used for text classification problems and how to design the network to accept word2vec pre-trained embeddings as input to the network. You will understand how to build a custom CNN in PyTorch for a sentiment classification problem.

In my previous post, I introduced the basics of PyTorch and how to implement Logistic Regression for Sentiment Classification. You can refer to that if you are new to PyTorch. Refer to the Feed Forward Neural Network article if you want to get comfortable with defining networks in PyTorch in general. I have explained in the previous posts other methods for Sentiment Classification using BOW, TF-IDF, Word2Vec and Doc2Vec vectors using Decision Tree Classifier, which will be compared at the end as well. Let’s start with loading the data now!

Restaurant Reviews by Sentiment Example by Dipika Baad

Load the data

Yelp restaurant review dataset can be downloaded from their site and the format of the data present there is JSON. The data provided is actually not in correct json format readable for python. Each row is dictionary but for it to be a valid json format, a square bracket should be at the start and end of the file with , being added at end of each row. Define the INPUT_FOLDER as folder path in your local directory where yelp review.json file is present. Declare OUTPUT_FOLDER as a path where you want to write the output from the following function. Loading of json data and writing the top 100,000 rows is done in the following function:

Once the above function has been run, you are ready to load it in pandas dataframe for the next steps. For the experiment, only small amount of data is taken so that it can be run faster to see the results.

Exploring data

After the data is loaded, new column for sentiment indication is created. It is not always the situation that some column with the prediction label you want to do is present in the original dataset. This can be a derived column in most of the cases. For this case, stars column in the data is used to derive sentiment.

Output:

After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted.

Output:

Once that is done, number of rows for each sentiment is checked. Sentiment Classes are as follows:

  1. Positive : 1
  2. Negative: -1
  3. Neutral: 0

Number of rows are not equally distributed across these three sentiments. In this post, problem of imbalanced classes won’t be dealt that is why, simple function to retrieve the top few records for each sentiment is written. In this example, top_n is 10000 which means total of 30,000 records will be taken.

Output:

How to preprocess text data?

Preprocessing involves many steps like tokenization, removing stop words, stemming/lemmatization etc. These commonly used techniques were explained in detail in my previous post of BOW. Here, only the necessary steps are explained in the next phase.

Why do you need to preprocess this text? — Not all the information is useful in making predictions or doing classifications. Reducing the number of words will reduce the input dimension to your model. The way the language is written, it contains lot of information which is grammar specific. Thus when converting to numeric format, word specific characteristics like capitalisation, punctuations, suffixes/prefixes etc. are redundant. Cleaning the data in a way that similar words map to single word and removing the grammar relevant information from text can tremendously reduce the vocabulary. Which methods to apply and which ones to skip depends on the problem at hand.

1. Removal of Stop Words

Stop words are the words which are commonly used and removed from the sentence as pre-step in different Natural Language Processing (NLP) tasks. Example of stop words are: ‘a’, ‘an’, ‘the’, ‘this’, ‘not’ etc. Every tool uses a bit different set of stop words list that it removes but this technique is avoided in cases where phrase structure matters like in this case of Sentiment Analysis.

Example of removing stop words:

Output:

As it can be seen from the output, removal of stop words removes necessary words required to get the sentiment and sometimes it can totally change the meaning of the sentence. In the examples printed by above piece of code, it is clear that it can convert a negative statement into positive sentence. Thus, this step is skipped for Sentiment Classification.

2. Tokenization

Tokenization is the process in which the sentence/text is split into array of words called tokens. This helps to do transformations on each words separately and this is also required to transform words to numbers. There are different ways of performing tokenization. I have explained these ways in my previous post under Tokenization section, so if you are interested you can check it out.

Gensim’s simple_preprocess allows you to convert text to lower case and remove punctuations. It has min and max length parameters as well which help to filter out rare words and most commonly words which will fall in that range of lengths.

Here, simple_preprocess is used to get the tokens for the dataframe as it does most of the preprocessing already for us. Let’s apply this method to get the tokens for the dataframe:

Output:

3. Stemming

Stemming process reduces the words to its’ root word. Unlike Lemmatization which uses grammar rules and dictionary for mapping words to root form, stemming simply removes suffixes/prefixes. Stemming is widely used in the application of SEOs, Web search results, and information retrieval since as long as the root matches in the text somewhere it helps to retrieve all the related documents in the search.

There are different algorithms used to do the stemming. PorterStammer(1979), LancasterStammer (1990), and SnowballStemmer ( can add custom rules). NLTK or Gensim package can be used for implementing these algorithms for stemming. Lancaster is bit slower than Porter so we can use it according to size and response time required. Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. It is not very clear which one will produce accurate results, so one has to experiment different methods and choose the one that gives better results. In this example, Porter Stemmer is used which is simple and speedy. Following code shows how to implement stemming on dataframe and new column stemmed_tokens is created:

Output:

Splitting into Train and Test Sets:

Train data would be used to train the model and test data is the data on which the model would predict the classes and it will be compared with original labels to check the accuracy or other model test metrics.

  • Train data ( Subset of data for training ML Model) ~70%
  • Test data (Subset of data for testing ML Model trained from the train data) ~30%

Try to balance the number of classes in both the sets so that the results are not biased or one of the reasons for insufficient model training. This is a crucial part of machine learning model. In real-world problems, there are cases of imbalanced classes which needs using techniques like oversampling minority class, undersampling majority class (Resample function from scikit-learn packaged or generating synthetic samples using SMOTE functionality in Imblearn package .

For this case, the data is split into two parts, train and test with 70% in train and 30% in test. While making the splitting, it is better to have equal distribution of classes in both train and test data. Here, function train_test_split from scikit-learn package is used.

Output:

As it can be seen from the above output, data is distributed for each classes proportionately. Number of rows for each sentiment in train and test are printed.

Convolutional Neural Network for Text Classification

Now, we are ready to dive into how will we use CNN for text classification and how the input will be constructed. CNN involves two operations, which can be thought of as feature extractors: convolution and pooling. Output of these operations is finally connected to the multi-layer perceptron to get the final output.

Neural network works only on numerical data so first task is to find suitable transformation from words to numeric format. Here we will use Word2Vec vectors of size 500 as input which was explain in my previous post.

  • Convolutional layers — These layers are used to find patterns by sliding small kernel window over input. Instead of multiplying the filters on the small regions of the images, it slides through embedding vectors of few words as mentioned by window size. For looking at sequences of word embeddings, the window has to look at multiple word embeddings in a sequence. They will be rectangular with size window_size * embedding_size. For example, in our case if window size is 3 then kernel will be 3*500. This essentially represents n-grams in the model. The kernel weights (filter) are multiplied to word embeddings in pairs and summed up to get output values. As the network is being learned, these kernel weights are also being learned. Following shows an example of how computations are done in convolution layer (padding is used which is explained later on).
CNN Calculation for text classification by Dipika Baad
  • Input and output channels for Convolutional —Here, nn.Conv2d is used for creating convolution layer. In case of images, the inputs for different pigment is given separately in which case the number of input channels are 3 if RGB or 1 if it is a grey scale. In this case, we are feeding only one feature i.e. word embedding so the first parameter for conv2d is 1 and output_channels is total number of features which will be NUM_FILTERS. We can have multiple filters for each window size, and hence there will be those many total outputs.
  • Padding —Sometimes kernel size will not overlay perfectly while sliding over. Thus to make the height of same size, padding is used. Here we have used window_size-1. Following animation shows how it will work with window_size-1 padding.
Sliding window with Padding for CNN Classification with text by Dipika Baad
  • Maxpooling — Once we have the feature vector and it has extracted the significant features, it is enough to know that it exists in sentence like some positive phrase “great food” and it does not matter where it appears in the sentence. Maxpooling is used to just get that information and discard the rest of it. For example, in the above animation the feature vector we had, after applying maxpooling, the max value will be chosen. In the above case it shows max when very and delicious are in the phrase, which makes sense.
Maxpooling result on feature vector by Dipika Baad

Once the maxpooling is applied on all the feature vectors, further layers can be added like feed forward neural network. Let’s get started with implementation! Following code shows the basic libraries you need to include before building the network. I am using GPU on Google colab, so the device indicates cuda.

Output:

Generating input and output tensor

Input will be Word2Vec vectors trained with embedding size 500. As we want to keep the length of sentences the same, padding token will be used to fill extra remaining words when the size of sentence is less than the highest length sentence in the corpus. Let’s train the Word2Vec model by using following function. If you are interested to learn more about Word2Vec, then refer to my previous article. Model is trained on whole corpus.

Once the model is ready, we can create a function to generate input tensor.

Max length of the sentence found was 927. So each input tensor would be of that size. For creating the output tensor, mapping from label to positive values has to be done. Currently we had -1 for negative, this is not possible in neural network. Three neurons in the output layer will give probabilities for each label so we just need mapping to positive numbers. Function is as follows:

Defining CNN

Following code shows how to define the CNN. Some parameters like window sizes, number of filters can be tweaked to get different results. Here, I have loaded the model generated above, this is for the case when you are training word2vec at different time and running CNN at different times, it is better to use model from saved file.

Train CNN Model

Once the network is defined, we can start by intializing the necessary objects for training like model object, loss and optimization object. Following code shows how to train the CNN model. I have ran this for 30 epochs. Loss is being recorded at each step for training data.

Testing the model

Testing the model code is shown as follows. Loss graph is also plotted and code for saving the plot. This is useful when you are doing multiple experiments and wan to compare results after all combinations of different hyper-parameters.

Output:

From the loss graph it is clear that the loss is reducing steadily and the loss is not fluctuating a lot which indicates the learning rate is not too high. Accuracy of 0.72 is very good accuracy compared to previous methods where decision classifier was used with BOW, TF-IDF, Word2Vec and Doc2Vec and also Logistic regression using PyTorch. This accuracy is close to where we used simple feed forward neural network. Hence, using CNN is always not necessary. Depending on the problem complexity and resources available for computation, appropriate method should be used.

So now you can easily experiment for your own dataset with this method! I hope this helped you to understand how to use PyTorch to build CNN model to do the sentiment analysis on restaurant reviews data. Feel free to extend this code! This is applicable to any other text classification problems where multiple classes are there. If I can think about improving this model, I would use different learning rate, epochs, different window sizes, embedding sizes, number of filters and other optimization algorithms like SGD, RMSProp, etc. Preprocessing can be changed to use lemmatization or other stemming algorithms to see how the results change. There is lot of room for experimenting for your project.

As always — Happy experimenting and learning :)

--

--

Big Data Consultant @Netlight | CoFounder @HuskyCodes | Web developer | Passionate about coding, dancing, reading