The world’s leading publication for data science, AI, and ML professionals.

When the Parts May be Worth More than the Whole: Feature- vs. Review-Level Sentiment Analysis

By Aadit Barua and Josh Barua

Getting Started

Product Feature- vs. Review-Level Sentiment Analysis: When the Parts are Worth More than the Whole

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

By Aadit Barua and Josh Barua, Westlake High School, Austin, TX.

The main idea

We develop and test a simple method for extracting sentiments regarding individual features of a product or service from user reviews. The motivation is that users may mention multiple features in their reviews, and may hold different or even opposing opinions regarding such features. For example, a customer may like the sound quality of a pair of headphones but dislike the noise cancellation feature. Yet, performing sentiment analysis on the whole review may yield misleading or incorrect results regarding the sentiment toward the individual features. Thus, instead of passing an entire review through a sentiment analyzer, we first pre-process the text to extract windows of words around a feature under the assumption that users express their emotions close to a feature word. Using reviews of running shoes, we establish the ground truth for each feature sentiment score and show that the error associated with our approach is significantly smaller than that obtained by using whole reviews.

Why bother with feature-level sentiments?

Sentiment analysis plays a critical role in many NLP applications, including recommender systems and understanding consumer opinions toward brands and products. Reviews often refer to multiple features or attributes. Some examples are provided below in Table 1.

Table 1: Frequently mentioned features of products
Table 1: Frequently mentioned features of products

It is common to perform sentiment analysis on full reviews. However, the sentiment of an overall review may not correctly reflect the sentiment toward individual features. Consider some reviews shown in Table 2. The feature words are shown in bold.

Table 2: Feature-level versus overall sentiments
Table 2: Feature-level versus overall sentiments

In these examples, the users express opposite opinions of two features, as a result of which the overall sentiment does not reflect the correct sentiment associated with either feature.

When is feature-level sentiment analysis useful?

Feature-level sentiment analysis is useful in many applications. Recommender systems that take consumer preferences of product features as inputs can make more relevant suggestions using sentiment scores for individual features. We can accurately compare products based on how consumers feel about specific features (e.g., camera and battery quality of two smartphones). Offering detailed feedback to a company to help improve its product or service depends on our ability to obtain feature-level sentiments. For example, from the sentiment analysis of a hotel review, we can tell the hotel management that the consumer liked its property. But performing analysis at the feature level may provide the insight that while the guest was thrilled with the service, the experience with the amenities was underwhelming. The extremely positive sentiment toward the first feature (service) may mask the slightly negative opinion of the second (amenities).

A simple unsupervised approach

There may be complex supervised methods to extract feature-level sentiments. However, we outline a simple unsupervised approach. We assume that sentiment bearing words such as adjectives are likely to be located close to a feature word rather than far away from it. Three reviews, the features contained therein, the sentiment bearing words, and the distance – the number of words – between a feature and a sentiment bearing word are shown in Table 3. We ignore stopwords in our count. The Stanford NLP page provides a list of common stopwords as well as words that basically behave like stopwords.

Table 3: Distance between feature and sentiment bearing words
Table 3: Distance between feature and sentiment bearing words

To extract sentiment-bearing words for a feature word, we propose extracting a window of words around the latter as shown in Figure 1.

Figure 1: A window of 3 words around a feature word
Figure 1: A window of 3 words around a feature word

For instance, consider a real review in our data (we removed the brand and product names in this rather negative review):

"After more than 10 pairs of (brand name) with hundreds of miles on them, I tried these as a replacement. Unfortunately, they are a big step back in comfort and quality. Overall, I found (product 1) more narrow than (Product 2), but 6 mm drop felt about the same. For a neutral shoe, the arch support feels really big because of how narrow they fit me. I like the added mesh on the upper shoe, however, I found the (product 1) to be lacking in the cushion on the front half of the shoe."

If we are interested in the user’s sentiment regarding cushion, we extract the following with a window size of 3:

The sentiment is captured by the single word "lacking". We use the phrase "found product lacking cushion front half shoe" as the input to VADER for unsupervised sentiment analysis. It is also possible there that will be multiple mentions of a feature in a review. In that case, we extract multiple windows for the same feature for a given review. The code to extract windows of words is shown below:

from nltk.tokenize import word_tokenize #import word_tokenize 
limit = 3 #the number of words on either side of the feature word used to create the window
attribute = "cushion" #desired feature 
reviews_list = ["I love the cushion, but hate the support.", "Even though the shoe was durable, it had very little cushion and made my feet sore after runs."] #list of product reviews
attribute_position_list = [] #list of positions where the feature word was found in each review
review_with_attribute_list = [] #list of review containing the feature word
for review in reviews_list:
    word_tokens = word_tokenize(review) #tokenizes each review
    position = 0 #position in the review where the feature is found
    for word in word_tokens: #for each word in the review
        if(word.find(attribute)>-1):
            attribute_position_list.append(position) 
            review_with_attribute_list.append(word_tokens) 
        position = position + 1
index = 0 #keeps track of which review is being parsed
for review in review_with_attribute_list:
    limited_sentence_list = [] #list of words within the window 
    for item in range(len(review)):
        if(abs(item - attribute_position_list[index]) <= limit): 
            limited_sentence_list.append(review[item])
    parsed_review = ' '.join(limited_sentence_list)
    print(parsed_review)
    index = index + 1

The VADER code is shared below:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
def get_sentiment(parsed_review): #passes the parsed review 
    sid = SentimentIntensityAnalyzer()
    score = sid.polarity_scores(parsed_review)
    compound = score.get('compound')
    return compound #returns the sentiment score

Testing our approach

We use a set of reviews of running shoes to test our approach. First, we use VADER to obtain sentiment scores for full reviews. Then we use the parser above to extract windows of words around a feature word. We use two features: Cushion and support. The extracted words for each feature are used as inputs to VADER. We note that the correlation between the feature sentiments is .44 (Table 4), which is not particularly high. As a result, the correlations between individual features and overall sentiments are even lower, as shown in Table 4, which suggests that the overall sentiments will not give us a good indication of how a user feels about a particular feature.

Table 4: Correlation between sentiments for the whole review, feature 1 and feature 2
Table 4: Correlation between sentiments for the whole review, feature 1 and feature 2

To establish the ground truth, we manually read each review and score the sentiment for each of the two features. We calculate two root mean square error (RMSE) values for each feature: one uses the ground truth scores and the sentiment scores for the whole review, and the other uses the ground truth scores and the feature-level sentiment scores. The formula for RMSE for our setting is:

Table 5: RMSE values for sentiment analysis using whole reviews and windows of words
Table 5: RMSE values for sentiment analysis using whole reviews and windows of words

Conclusion

Sentiment analysis is a critical element of NLP, whose usefulness depends on the insights it can provide into customer preferences. The ability to offer detailed guidance and advice to a company regarding its products or services depends on our ability to conduct sentiment analysis at the feature or attribute level. We demonstrated a simple method in this article. We hope you will try out more sophisticated approaches. Perhaps the idea of extracting parts-of-speech bigrams and analyzing semantic orientation used by Peter Turney (2002) can be adapted to our setting. Maybe the BIOES labeling method can be applied in a modified form. But whatever you may do, have fun!

References

Turney, Peter D. "Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews." arXiv preprint cs/0212032 (2002).


Related Articles