Benchmarking Off-The-Shelf Sentiment Analysis

ZhongTr0n
Towards Data Science
9 min readMar 8, 2020

--

Source: https://pixabay.com/

Introduction

Power BI is increasingly adding and facilitating AI capabilities; One of the capabilities the in the area of text analysis is sentiment scoring. For those not familiar, the idea is very simple. You feed some text to the pretrained model and it will spit out a score ranging from 0 to 1, with 0 being highly negative and 1 being positive. In other words a text like “What a great amazing product” should return a high value (closer to 1) and a text like “Terrible, bad experience” should return a value approximating 0.

Whereas before, you had to have some semblance of coding skills in addition to being familiar with the basics of NLP, today a greater array of platforms are offering “off the shelf” solutions, where all you have to do is click a few buttons.

Power BI is one of those solutions. Using this tutorial, you can try it for yourself under the condition that you have Power BI Premium capacity.

Source: https://docs.microsoft.com/en-us/power-bi/desktop-ai-insights

The Experiment

Setup

As I was curious about the performance of Power BI’s sentiment analysis, I came up with a very basic experiment to test it, which was to compare the sentiment analysis from comments with the actual rating the author gave. I went looking for a dataset of product reviews containing both a review as a string as a numerical rating.

The idea was to make Power BI generate a rating on the review the author wrote and compare it with the actual score given by the user. My hypothesis was that the user’s score reflects the actual sentiment of the comment that he or she wrote.

Methodology

There are many ways to evaluate the performance of AI models. The methods vary greatly on the type of data and interpretation you would like to obtain. For this test, I opted for root mean squared error or simply RMSE.

The RMSE is often used to test the performance of quantitative predictions. It basically comes down to a model that analyses the error for each observation. By squaring the difference, larger errors will receive a larger penalization. For interpretation, a lower RMSE indicates better performance of the model.

Extra Angle

In order to get a more complete picture, I decided to add another benchmark using Python. The Python VADER library is, just like the Power BI feature, an off-the-shelf sentiment analysis. By off-the-shelf, I mean it does not require any configuration nor training. The VADER library is very simple and only requires a couple of lines:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzeranalyser = SentimentIntensityAnalyzer()def print_sentiment_scores(sentence):
snt = analyser.polarity_scores(sentence)
return (snt[‘compound’])

The library takes a string as input and returns 4 numeric values;

  • Negative score
  • Neutral score
  • Positive score
  • Compound score

For this test, I only used the compound score, as it is the most similar to the Power BI score.

Aside from the Python model, I also added an extra column to the data that generates a random score. It is simply a random number generator that will randomly pick a value between 0 and 5. Adding random digits can simplify interpretation of metrics like RMSE.

The Data

Even though there are great search engines like Google Dataset Search I still mostly use Kaggle to find the dataset that fits my project needs. For this project I need a dataset that contains both a written review (in English) as well as a numerical score. A simple search for the keyword “product review” gave me a dataset with Amazon reviews for headphones.

The data looks like this:

Shape: (14337, 4)

I imported the file in Power BI and ran the sentiment analysis. If you want to run the sentiment analysis yourself, you can find a guide here. After making Power BI add the sentiment score column, I ran a Jupyter Notebook to do the performance test.

However, before performing the test, I ran the following steps to clean/prepare the data:

  • Removed unwanted columns
  • Renamed columns
  • Removed rows without a review score
  • Removed rows without a review text
  • Normalized all the scores to a 1–5 scale
  • Added a column with review length
  • Added a column with a random score (random number generator)
  • Added a “correct” column for each source, returning True if the predicted score matches the actual score.

After applying the Power BI sentiment analysis and Python script, the data now looks like this:

Shape: (13394, 10)

The Results

So now we have the data in a clean format and sentiment scores from our three different sources:

  • Actual user score
  • Power BI generated score
  • Python (VADER) score

It is time to rate their performance. As explained in the methodology I will be using the RMSE or root mean square error.

Model performance

Before we take at a look at the accuracy of the results, let’s start by comparing the distribution of ratings by different sources. Using a kernel density plot we can get an overview of all three sources.

If we would be dealing with perfect models, all lines would be overlapping. As you can see, in reality there is quite some difference between the actual user score and the models. For reference, you can see the green dotted line that consists of random scores. From a first exploration, it seems the models perform better with extreme scores and struggle in the middle.

Let’s dive a bit deeper and take a look at the following values:

  • RMSE
  • Percentage of correct scoring
  • Distribution of model scoring per actual user score

Each chart represents one model. On the x-axis you can see the actual user score, while the y-axis represents the model score. Each vertical line of dots represents the distribution of model scoring for the corresponding user score. In an ideal scenario, there should only be dots on the red guide line, meaning the model would only have values that match the user score. In a good model, the largest dots (larger portion of prediction) should be as close as possible to the red line.

Both the RMSE and percentage of correct scoring indicate that the Power BI model is the best. Although the model does not seem to be a perfect predictor, we can certainly say it can serve as an indicator of the comment’s sentiment. Compared to the random scoring, we can clearly see a pattern. Furthermore, just like in the kernel density plot, it is clear that both models perform better with extreme scores.

Length as a Factor

Although NLP has taken huge leaps in last years, there is still a long way to go. Our robot friends still struggle with all the unstructured language we produce as humans. We often see that it is easier for a sentiment analysis to rate a shorter string instead of a longer one that might contain conflicting information. And in a sense, this applies to our human understanding too. It is pretty straight forward to grasp the positive sentiment from the review “Great product”, while a longer review like “I like the product, but the battery is bad and not as good as the competitor which also has better sound” begs a deeper analysis.

In light of this theory, it seemed wise to compare the model performance on different review lengths. Before I start dicing the data in review lengths, let’s have a look at the distribution length using a simple boxplot.

Since the data contains a lot of outliers, I rescaled the x-axis from 0 to 100 words:

As most of the reviews are situated around 10 to 30 words, I will use the following cutoffs to dice the data:

  • Short reviews: 0 to 15 words (n=6226)
  • Medium reviews: 16 to 40 words (n=5021)
  • Long reviews: +40 words (n=2147)

Now let’s take a look at the new results:

The results are quite interesting, it is clear the Power BI model performed significantly better when fed short reviews (RMSE 1,2), the Python model seems to do the opposite. However, it is important to add, that the portion of data with long reviews is way smaller than the short reviews. For both models, the reviews with medium length still seem to outperform the models that use the full dataset.

Outliers

Lastly, as a responsible data professional, one should always start and end by looking and understanding the data. Below you can find a sample of reviews that were completely off. In other words, they are reviews where the user score is 1 and the model score is 5 or vice versa. Let’s examine what might have gone wrong.

Sample of reviews with bad model score.

As you can see from this random sample, it is easy to understand why the model gave the wrong score. The reviews contain a mixture of:

  • typo’s: “Sounds quality bedCan’t connect all mobile”
  • conflicting information: “Fine, but not good”
  • inconsistent scoring: “Not bad” (user score: 1)

Conclusion

Off-the-shelf Model Performance

With an RMSE of 1.2 we can say that the off-the-shelf models can certainly be used as an indicator for sentiment. With further cleaning and NLP modelling the score will likely further improve, given that a lot of bad scoring is caused by typo’s and inconsistent user rating.

Even though a self-trained model will most likely outperform these off-the-shelf solutions, it is important to understand the simplicity/performance trade-off. The Power BI sentiment analysis is extremely easy to apply and in a couple of clicks you can have a rating for thousands of reviews in merely seconds.

Secondly, we are currently still in a very early stage of out of the box AI, and even though I personally prefer getting my hands dirty on the code, I must admit that the future of these solutions seems promising.

Lastly, it is important to understand that this is just a personal experiment and not a peer reviewed paper. As with any research project, there are some considerations like understanding how I normalized and rounded the scores, which might have an effect.

Further research

Now we have drawn some conclusions, we can open the door for further experimentation. I think the following approaches might be interesting:

  • Rerun the experiment with more and different datasets
  • Use more metrics for evaluations
  • Use another benchmark different from user score (which is often biased)
  • Compare more tools than the Power BI and the Python library
  • Rerun the experiment over time, as the pre-trained models will probably improve

In case you are to do some research in the above mentioned steps , I would be very interested in reading it, so please tag/send it to me.

About me: My name is Bruno and I work as a data scientist, currently specializing in Power BI. You can connect with me via my website: https://www.zhongtron.me

--

--