The world’s leading publication for data science, AI, and ML professionals.

Where should I eat after the pandemic? (Part 1/2)

Decision Making with Aspect-Based Sentiment Analysis using Transformers

We’ve all been in this situation before: looking up reviews for some new product we are interested in or some restaurant we’ve wanted to try. We glean information from one review to the next, each one rendering a more precise representation of the product. We keep scrolling and scrolling when suddenly an hour has passed, and we are no closer to an answer than we were before. We realize this a futile endeavor and that we’ll be searching endlessly for the decisive factor in determining how we spend our hard-earned money. Well, in this tutorial, I’m going to put this Sisyphean task to rest using Aspect-Based Sentiment Analysis (ABSA). Spend less time sifting through reviews and more time enjoying the product or service. Let’s get started!

Photo by natsuki on Unsplash
Photo by natsuki on Unsplash

Introduction

Due to the pandemic, I have an insatiable craving for travel and something good to eat. Without spending my time scouring the Internet, I wanted to figure out where I should go once the situation allows for it. By automating the process above, I analyzed over 1 million reviews to select a restaurant best suited to my individual preferences. Before we get into the specifics of that, let’s start with a more straightforward example.

Photo by Omid Armin on Unsplash
Photo by Omid Armin on Unsplash

Example

I’m looking to buy wireless earphones, and I’m trying to decide between AirPod Pros and Powerbeats Pros. The following are some features to consider before buying: sound quality, comfort, noise-canceling quality, and battery life. I assign an importance weight to each component and proceed to comb reviews gauging how each product performs, to create a table like this:

Earbud Comparison
Earbud Comparison

Conveniently, these features and corresponding feature ratings are on Amazon, so I didn’t have to do the grunt work of coming up with these values myself. By multiplying each feature rating by the importance and taking the sum across all features, I can get each product’s final score. For example, the evaluation of the AirPod Pros looks like the following:

Comparing the final scores, we can see the AirPod Pros beat out (ha!) the Powerbeats Pros. Having a systematic way to arrive at my decisions is phenomenal, but there are some practical issues. For example, this analysis took me less than 10 minutes, but it was also between products that already had aspect-based ratings on Amazon. Since these are not readily available for most other products, I would instead have to sift through an excessive amount of reviews to produce aspect-based ratings for each product. Another thing to consider is that there were only two products in this case; comparing more products in this manner would turn my 10-minute analysis into a lengthy 2–3 hour endeavor.

So how can we reconcile our methodical Decision Making with efficiency? Automation. By automating this process, we can make our decision much faster. The following defines the framework we will be working within:

Since this process can work for any product or service, I decided to try it out with restaurants from the Yelp dataset. I’ve always had trouble deciding where I want to eat, often spending so much time on Yelp that it’s past dinner time. Dining out isn’t much of an option right now – given the pandemic – so I wanted to find the best place to have an incredible meal after everything is over. Without further delay, let’s jump in!

Datasets

First things first, let’s import all of the libraries we will be using.

SemEval-2014

We will only need the SemEval-2014 dataset for this part of the tutorial. In the SemEval-2014 dataset, a restaurant’s aspects are food, service, price, ambience, and anecdotes. For each aspect, there is only one associated label: positive, negative, conflict, or neutral. You can retrieve this dataset from my GitHub; however, this is already taken care of in the preprocessing functions we’re going to build.

With all of our data ready to go, let’s move on to training!

Model

A few different models can be chosen for this task, ranging from rule-based models to deep learning. After looking through the literature, I came across a paper utilizing Transformers for the ABSA task. The authors converted the ABSA task to a sentence-pair classification task, enabling them to use BERT for their model. The following table shows the results of the paper:

Source: Reprinted from "Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence," by C. Sun, L. Huang, X. Qiu, 2019, arXiv.
Source: Reprinted from "Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence," by C. Sun, L. Huang, X. Qiu, 2019, arXiv.

The authors performed the fine-tuning on the bert-base-uncased model from the transformers library, reporting 85.1% accuracy with the BERT-pair-NLI-M model and 85.9% accuracy with the BERT-pair-QA-B model. Since the only difference between the two is the method of preprocessing, I’m choosing BERT-pair-NLI-M. Although this isn’t the highest performing, it is faster than BERT-pair-QA-B. The BERT-pair-QA-B model requires the construction of 25 sentences to perform a full aspect classification of a review. In contrast, BERT-pair-NLI-M only requires five sentences, each of which is a single word, to complete the same task. For my purposes, the difference in accuracy between the two models is negligible. In other words, the 0.08% increase in accuracy is not substantial enough to warrant a 5x reduction in speed.

Preprocessing

For the sentence-pair classification task, the model takes as input two sentences and outputs a single label. We treat the aspect we are interested in as the second input sentence. Therefore, preprocessing is as follows:

Preprocessing Example
Preprocessing Example

To prepare the dataset for training and evaluation, I use the following code to load the SemEval-2014 dataset from my GitHub repository and prepare it for PyTorch:

Train

Now that we’ve narrowed the scope of models to sentence-pair classification using transformers in conjunction with the preprocessing method defined above, we still have several options for the specific model we want to fine-tune. Given the recent developments in approximating BERT with fewer parameters, I want to see if any of those models will perform better than BERT. If the same or greater accuracy is achievable with a smaller model, I’ll be able to process the Yelp dataset much faster without sacrificing performance. The following is a list of models to consider from the transformers library with the corresponding hyperparameters to be used when training:

Model Comparisons
Model Comparisons

The epochs set for each model are the number needed for the validation loss to plateau. I’ll use the following code to train each model:

Evaluate

After training all the models, I evaluate each based on accuracy and Matthew’s Correlation Coefficient (MCC). One could include other metrics such as the macro F1-score, but MCC is more informative than other single score metrics. For more details on MCC, you can visit its Wikipedia page. In the multivariate case, the confusion matrix is used to compute the MCC.

In addition to the quality metrics, I’ll also time how long each model takes to evaluate the test set. The following is the code I’ll be using to evaluate each model:

Model Comparison
Model Comparison

From the table, we can see that bert-tiny, while being quite fast, does not have a high enough accuracy or MCC to match the other models. However, from the rest, we can see that bert-small does quite well at 88.8% accuracy and is also ~7x faster than bert and ~3x faster than distilbert. Interestingly, distilbert performs slightly better than bert across all displayed metrics, excluding MCC by only a negligible amount. Essentially, what we’ve found is that distilbert is overall better than bert on this task and is ~2x faster! So the choices really come down to bert-small or distilbert. As a subjective decision, I’m going to choose distilbert because I value the 2.4% increase in accuracy over the speed decrease. Here is a quick look at the confusion matrix for distilbert:

Confusion Matrix Heatmap
Confusion Matrix Heatmap

We can see that the model predicts the none and positive labels well while having a sub-par performance on the negative and neutral labels. The conflict label, however, seems to be giving the model the most trouble. This can be hard to overcome, given that if the sentence is conflicting, there would be positive and negative sentiments embedded in it. Without a sufficient number of examples illustrating when a sentence is conflicting, the model will not learn how to differentiate between positive and negative. Nevertheless, we will move forward using distilbert.

Examples

To get an idea of how our classifier makes decisions, let’s look at a few examples and see how it performs, starting with the following review that I made up, followed by the model’s classification.

We had a good time at the restaurant.
Example 1
Example 1

The model is correct in this example; there is nothing in the sentence specific about food, service, price, or ambience. There is, however, positive sentiment toward a specific anecdotal experience about going there. This case was simple, so I’m going to add another sentence discussing the food.

We had a good time at the restaurant. The food was delicious.
Example 2
Example 2

In this case, the model classifies correctly that the food was good. However, it missed the positive anecdotal portion of the review. I tried running it with the sentences reversed, but the classifier made the same mistake.

Now I’m going to add some information about the service.

The waiter we had was horrible, but the food was amazing.
Example 3
Example 3

This was a good example to test because it contained positive and negative sentiments about two different categories in the same sentence. The model correctly identified the sentiment for both of the categories.

Lastly, I’ll try a longer review and add another category.

My girlfriend is still not even speaking to me after I took her on a date here last week. The atmosphere was nice, but the food here was disgusting and the service was even worse. I will never in my life go back to this pizzeria.
Example 4
Example 4

The model was correct for all categories except for the anecdotes, which was supposed to be negative. The model did, however, do a nice job considering the review is multiple sentences. Despite its good performance on this review, the model does poorly when reviews become longer than a few sentences. However, there are ways to handle the problem. Personally, I’ve run the model on all reviews in the dataset. I took the approach of splitting a review into multiple sentences using spacy, running the model over each sentence, and then aggregating them back to one output by taking the mean. If you feel up for it, try to take that next step as well. For the sake of time, however, we will not be showing that here.

Conclusion

Now that we understand more about the model’s behavior, it’s time to explore the Yelp dataset! In the next article, I’ll delve into using the Yelp dataset to decide where to eat after the pandemic.

Stay tuned!

Code

Links

References

[1] Chi Sun and Luyao Huang and Xipeng Qiu, Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence (2019), arXiv preprint arXiv:1903.09588


Related Articles