Notes from Industry

For more than a decade, businesses have leveraged the reachability of the internet for advertising their content. Without this medium of advertising, it would be difficult for these companies to reach their desired customer base. But with the rapid growth of social networking sites, the internet has evolved into a forum where consumers evaluate the products and services based on the feedback about a product posted online by others. Reviews about any particular product decide its reputation in the market. Various studies conducted regarding online shopping behavior have shown that a potential customer reads at least four or five reviews on average about any particular product before trusting it. That is why customer reviews are crucial for the functioning of a business. In this case study, we will be exploring the concept known as aspect-based extraction, which is critical for online reviews. We are dividing it into the following sections –
- Problem description
- Model overview
- Data description
- Data preprocessing
- Model architecture
- Baseline model
- Conclusion
- Deployment
- Future work
- Links
- References
Problem description
Sentiment analysis is a natural language processing technique used to determine whether the given text is positive, negative neutral. This technique works great when you wish to deduce the overall sentiment from a given chunk of text. One drawback of this technique is that one would have to manually sift through each review if the person wished to understand which aspect of the product was mentioned as unsatisfactory by the customer. This form of manual labor is very time-consuming. In such situations, aspect-based sentiment analysis is a better option. Using this technique, we can analyze reviews by associating sentiments with specific aspects of these reviews.
In the 2017 research paper titled ‘An Unsupervised Neural Attention Model for Aspect Extraction‘ by Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier, the researchers designed an unsupervised deep neural network that can classify a set of sentences by their aspect. This model limits itself to identifying only one aspect per input and does not associate any sentiment with this aspect. Any model properly trained to perform sentiment analysis can later be applied to the reviews after they have been segregated. In this case study, we shall be designing the model mentioned in the 2017 research paper from scratch. We will be using Tensorflow 2.x. as the back-end framework in Python for model building and training.
Model overview
The traditional machine learning models previously trained to perform this task assume that the words occurring in each sentence are independent and the context is irrelevant. Such assumptions lead to a degradation in the performance of these models. Word embeddings were introduced in the 2013 paper titled ‘Distributed Representations of Words and Phrases and their Compositionality‘ by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. In this paper, they introduced a model named Word2Vec. The purpose of this model was to indicate how important the context of a sentence is. With this model, similar words get mapped to vectors of similar orientation. Co-occurring words are located close to each other in the embedding space.
In the 2014 research paper titled ‘Neural Machine Translation by Jointly Learning to Align and Translate‘ by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, the concept of attention mechanism was introduced for language translation tasks. This technique mimics the cognitive process of attention that allows any living thing to choose and concentrate on relevant stimuli. As per this technique, certain parts of the text are given more priority over the whole text. In other words, attention enables the model to emphasize important words and de-emphasize irrelevant words during training. This improves the model’s ability to discover more coherent aspects. The name of this model given by the researchers is Attention-based Aspect Extraction (ABAE).
Data description
We shall be using the city search corpus as the dataset for training and testing purposes and it can be freely downloaded from this site. The same dataset was used in the research paper for testing the model. There are 52,574 reviews available in the dataset, out of which only 3,400 are labeled. There are six categories of aspects considered in the paper. These aspects are Food, Staff, Ambience, Price, Anecdotes, and Miscellaneous. The researchers selected the unlabelled data as the train set and the labeled data as the test set. These sets can be found on the researcher’s Google drive link.
Data preprocessing
After loading the test and train sets, we need to preprocess them. For this purpose, we are defining the preprocess() and complete_preprocess() functions.
Using this function we are tokenizing all the words in a review after converting them into lower case. The next step is to lemmatize these tokens. Essentially, lemmatization helps to make your data matrix sparser by converting every word into its corresponding lemma. Such words are said to be in their canonical form or dictionary form. Also, note that we must remove stop words that don’t add any value to the review.
For training the Word2Vec model, we need to provide the input in the form of lists. This means that each review is converted into a single list and the entire set is a list containing multiple sub-lists. This can be done with the help of the split_list function.
All reviews are available as a string representation. Strings cannot be understood by the deep neural network models and thus, we are required to encode them in a numerical format. This means that each token present in the review should be mapped to a unique number greater than or equal to 1. All strings, including numerals such as 8 and 9 have a unique mapping when present in a review. But as per the research paper, all numbers should be represented as a single token denoted by before mapping it to the corresponding representation.When converting test data into the corresponding numerical representation, it can be seen that this set may contain some words that aren’t available in the train set. We will represent such words by before mapping. To maintain uniformity in the input data, we need to convert all these lists of tokens into uniform length. This can be done with the help of padding. Using this technique, all the lists of tokens are converted into the same size by prepending or appending it with the token before mapping. In our case, we will prepend each list with the tokens. The width of each padded set is equal to the length of the longest sentence in the set. The mapping for the reserved tokens is: {”:0, ”:1, ”:2}. From this mapping, we can infer that all the words present in the set should get a value of 3 or more. Given below is a dummy example of the tokenized and padded representation(with a maximum length of 4) of how the input should look like before being fed to the model –
"I ate 2 donuts" -> ["I", "ate", "2", "donuts"] -> [3, 4, 2, 5]
"I dislike donuts" -> ["<pad>", "I", "dislike", "donuts"] ->
[0, 3, 6, 5]
"I drank" -> ["<pad>", "<pad>", "I", "drank"] -> [0, 0, 3, 7]
Model architecture
In the aforementioned paper where the concept of Word2Vec was introduced, researchers also added a concept known as negative sampling that helps us to train the word embeddings faster. In layman’s terms, negative samples are reviews picked randomly from the collection of all reviews available in the train set. This means that when we consider one review as the input to be fed(called either positive samples or target samples), we ought to pick a set of P reviews to be fed alongside the target sample as negative samples. Here, P can be a small number such as 5 or 6, or maybe even a larger number such as 21 or 22. Please note that we aren’t feeding the input to the model directly in the sequence in which it is already present in the train set. This is to counter the overfitting of the train data. To summarize this confusing concept, let me restructure my words carefully as follows- when the value of P is 5, during each step of one epoch, we are presenting the model with 1 review randomly sampled from the train set as the positive input and 5 reviews randomly sampled with replacement from this same set as the negative samples. In this case, the batch size is assumed to be 1. If the batch size were set to a value of 32, we are feeding 321=32 positive samples to the model alongside the 325=160 negative samples during any one step of an epoch. Here is the code of generating positive and negative samples –
It is now the time to understand ABSA in detail. Given below is the architecture of this model. An important point to note here is that this diagram only describes the processing of the positive samples.

Using all the word embeddings generated from the Word2Vec model, we will club them together into a single matrix. This matrix will be regarded as the embedding matrix. The next step is to train this matrix using the k-means clustering algorithm and identify the cluster centers generated after the model converges. The number of cluster centers formed is equal to the value of k, which is a hyperparameter manually fed by us to the model before the training process. The value of k should be the number of different aspects that we wish to identify using the ABSA model. The weight matrix(denoted as T in the above diagram) of the rₛ layer is initialized with the normalized form of all these cluster centers. This T matrix is also called the aspect matrix.
The biggest difference between the positive and negative samples is that the positive samples are fed to the attention layer, but the latter isn’t. But all samples must be converted into word embeddings by passing them through a non-trainable Embedding layer before further processing. This layer is initialized using the embedding matrix created in one of the previous steps. Using Tensorflow’s Subclassing API, we designed multiple custom layers for processing the samples.
The name of the first custom layer is Average. It can calculate the average value of all the word embeddings in a review, denoted by the term yₛ.

Then, the positive sample embeddings and yₛ values are fed to the attention layer. The term dᵢ denotes an intermediate multiplication between the transpose of the word embeddings, matrix M and yₛ. Here, M is trainable and is initialized using the Glorot uniform initializer. In this initializer, samples are drawn from a uniform distribution within a specific limit. The calculation of this limit value depends on the shape of this matrix. By applying softmax on dᵢ, we calculate aᵢ, which are the attention weights.

The next custom layer is named WeightedSum. Using this layer, we calculate the dot product between the embeddings of each word and the attention weights calculated for each of them. This dot product is denoted by the term zₛ.

Using Tensorflow’s native Dense layer, we multiply zₛ with randomly initialized weights and then add the bias term to it. On top of that, we apply softmax. This layer is denoted as pₜ in the paper.

The next step is to pass the negative samples through the custom Average layer. The output that we get is a list containing the averages of each sample. And this list is denoted by zₙ. The last layer is rₛ, which we had mentioned earlier. Given below is the relationship between rₛ and pₜ.

Autoencoders are unsupervised neural network architectures used to compress data. Both the inputs and outputs are of the same size. There is a bottleneck layer in between that is used to compress data into the required size. The section from the input layer to the bottleneck layer is termed encoder and the section from the bottleneck layer to the output layer is termed decoder. This network aims to reconstruct the input compressed by the bottleneck layer at the output. By training the model, we are trying to minimize the reconstruction error.

The entire ABSA model is a customized autoencoder. The section from the input layer to the pₜ layer is the encoder and the section from the pₜ layer till the rₛ layer is the decoder. We aim to minimize the reconstruction error at the rₛ layer.
In the 2016 paper titled ‘Feuding Families and Former Friends: Unsupervised Learning for Dynamic Fictional Relationships‘ by Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal Daumé III, the researchers mention the concept of contrastive max-margin loss. This loss function heavily penalizes the model if the negative sample embeddings are similar to the reconstructed embeddings. What this means is that the loss function tries to maximize the product between rₛ and zₛ, while it tries to minimize the product between rₛ and nᵢ(when nᵢ is an element from the list zₙ). We are using a custom layer named HingeLoss to perform the necessary calculations in this step. The formula for the loss function is given as –

To prevent overfitting, we need to regularize the model. Even for this purpose, a customized L2 regularization is mentioned. Also, note that we are supposed to use the normalized form of the aspect matrix for regularization. The code for regularization is applied to the Dense rₛ layer using a custom function.

We are planning to use RMSProp as the optimizer for training this model. The learning rate is 1e-02, epsilon value is 1e-06. It is necessary to set the clipnorm value while training this model. In our case, we set it to 10. This means that the L2 norm of the gradients is capped at the given value. We are training the model for 15 epochs and considering 182 batches in each epoch. The batch size is set to 1024 and the negative sampling rate is 20. When the loss calculated in any epoch is smaller than the previously recorded smallest loss, we are displaying the top 50 words derived for each aspect and their corresponding similarity scores. Given below is the code for training –
From the printed words, we must manually infer the aspects. Accordingly, we should also create a cluster map based on this information. Now it’s time to plot the graph indicating the minimization using Tensorboard. As you can see, the model converges at the loss value of 4.7.

It is now time to perform prediction on the test set. We will need to create a new custom model using the layers of the previously made ABSA model. For this newly-made model, we’ll only use the input layer designed for the positive samples in the ABSA model as the input to this model. The output for this model is the pₜ layer of ABSA.
The researchers filtered the test based on the aspects. They only allowed those reviews that belonged to the Food, Staff, or Ambience aspect. This model cannot identify more than one aspect in a single review. So, even those reviews that contained multiple samples were removed. After this step, the final task is the perform prediction and generate the classification report. As you can see, the performance of the ABSA model seems to be fair!

Baseline model
Topic modeling is an unsupervised method for classifying documents by finding some natural groups of items(called topics in this context). This technique has been used for a long time because it can automatically organize, understand, search and efficiently summarize data. The most popular topic modeling algorithm is Latent Dirichlet Allocation(LDA). It is the baseline model aspect-based sentiment analysis task. We will be comparing the results of our ABSA model with the results obtained using the LDA model.
A document can be a part of multiple topics with varying proportions of similarity. Each document(also called a review in our case) is a list of words. What we really want to figure out is the probability of a word belonging to each of the topics. Each row in the table represents a different topic and each column is a different word present in the data set. Each cell contains the probability that the word(column) belongs to the topic(row).

For our task, these topics are just the aspects. So, the number of topics derived is equal to the number of aspects required for our analysis. An important assumption made about the words is that the ordering and the grammatical structure of these words aren’t important. This means that words are independent and such an assumption leads to degraded performance. Anyways, we shall be training the LDA model on our train set with the batch size set to 1024. Given below is the code for displaying the top 50 words belonging to each of the aspects.
On printing the classification report, we can observe that the performance of the ABSA model is better when it comes to the aspect extraction task.

Conclusion
As you can see, neural network-based models work well for aspect-based sentiment analysis tasks. Attention mechanism should be given credit for such performance.
Deployment
Screenshot of the deployment is mentioned below. A video recording of the same can be viewed here.

Future work
Our model can only identify reviews that contain no more than one aspect. Alongside this, we need to use other models when we wish to evaluate the sentiment associated with any given aspect. Future research in this field should focus on these two issues.
Links
Github repository: https://github.com/Ashcom-git/case-study-2
LinkedIn: https://www.linkedin.com/in/ashwin-michael-10b617142/
References
Applied Roots. [online] Available at: https://www.appliedaicourse.com/ .
Ruidan He, Wee Sun Lee, Hwee Tou Ng and Daniel Dahlmeier (2017). An Unsupervised Neural Attention Model for Aspect Extraction. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013).
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber and Hal Daumé III. (2016). Feuding families and former friends: Unsupervised learning for dynamic fictional relationship. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Yanlin Chen (2019). How to build a LDA Topic Model using from text. [online] Medium. Available at: https://medium.com/@yanlinc/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6.
Susan Li (2018). Topic Modeling and Latent Dirichlet Allocation (LDA) in Python. [online] Medium. Available at: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24.
Ria Kulshrestha (2020). Latent Dirichlet Allocation(LDA). [online] Medium. Available at: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2.
Sanne de Roever (2020). Aspects, the better topics? Applying unsupervised aspect extraction on Amazon cosmetics reviews. [online] Medium. Available at: https://medium.com/@sanne.de.roever/aspects-the-better-topics-applying-unsupervised-aspect-extraction-on-amazon-cosmetics-reviews-9d523747f8e5.