SENTIMENTAL ANALYSIS USING VADER
interpretation and classification of emotions
Sentiment analysis is a text analysis method that detects polarity (e.g. a positive or negative opinion) within the text, whether a whole document, paragraph, sentence, or clause.
Sentiment analysis aims to measure the attitude, sentiments, evaluations, attitudes, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.
Why is Sentiment Analysis difficult to perform?
Though it may seem easy on paper, Sentiment Analysis is a tricky subject. A text may contain multiple sentiments all at once. For instance,
“The acting was good , but the movie could have been better”
The above sentence consists of two polarities!!!
VADER
VADER ( Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.
VADER sentimental analysis relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.
For example- Words like ‘love’, ‘enjoy’, ‘happy’, ‘like’ all convey a positive sentiment. Also VADER is intelligent enough to understand the basic context of these words, such as “did not love” as a negative statement. It also understands the emphasis of capitalization and punctuation, such as “ENJOY”
Polarity classification
We won’t try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a positive, negative or neutral opinion.
Document-level scope
We’ll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.
Coarse analysis
We won’t try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we’re not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.
Broad Steps:
- First, consider the text being analyzed. A model trained on paragraph-long reviews might not be effective. Make sure to use an appropriate model for the task at hand.
- Next, decide the type of analysis to perform. Some rudimentary sentiment analysis models go one step further, and consider two-word combinations, or bigrams. We will be going to work on complete sentences, and for this we’re going to import a trained NLTK lexicon called VADER.
DATASETS TO USE
For this model you can use a variety of datasets like amazon reviews, movie reviews, or any other reviews for any product.
import nltk
nltk.download('vader_lexicon')from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
VADER’s SentimentIntensityAnalyzer()
takes in a string and returns a dictionary of scores in each of four categories:
- negative
- neutral
- positive
- compound (computed by normalizing the scores above
Let us analyze some random statements through our sentimental analyzer
a = 'This was a good movie.'
sid.polarity_scores(a)OUTPUT-{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)OUTPUT-{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}
Use VADER to analyze Reviews
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/reviews.tsv', sep='\t')
df.head()
df['label'].value_counts()
OUTPUT-neg 5097
pos 4903
Name: label, dtype: int64
Clean the data (optional)
This step to clean any blank spaces within the reviews.
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)
blanks = [] # start with an empty list
for i,lb,rv in df.itertuples():
if type(rv)==str:
if rv.isspace():
blanks.append(i)
df.drop(blanks, inplace=True)
Adding Scores and Labels to the DataFrame
Now we’ll add columns to the original DataFrame to store polarity_score dictionaries, extracted compound scores, and new “pos/neg” labels derived from the compound score. We’ll use this last column to perform an accuracy test. The reviews in this method will be classified into negative, positive and, neutral ratio.
df['sc ores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df.head()
Now will call out compound as a separate column and all values greater than zeroes will be considered a positive review and all values less than zero would be considered as a negative review.
df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df.head()
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df.head()
So now we have got a complete analysis of every review as either positive or negative.
Now let us pass some new reviews to test how our model performs!
# Write a review as one continuous string (multiple sentences are ok)
review = 'The shoes I brought were amazing.'
# Obtain the sid scores for your review
sid.polarity_scores(review)OUTPUT-
{'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'compound': 0.5859}review='The mobile phone I bought was the WORST and very BAD'# Obtain the sid scores for your review
sid.polarity_scores(review)
OUTPUT-{'neg': 0.539, 'neu': 0.461, 'pos': 0.0, 'compound': -0.8849}
Conclusion
The results of VADER analysis don’t seem to be only remarkable but also very encouraging. The results show the advantages which will be attained by the utilization of VADER in cases of web sites wherein the text data could be a complex mixture of a range of text.
ADDITIONAL RESOURCES
There are two of my other articles published in Towards Data Science publication on the related topics for this blog. Do have a read on those for better understanding in Natural Language Processing
Stemming vs Lemmatization — https://link.medium.com/JWpURpQjt6
Word vectors and Semantics — https://link.medium.com/tuVCswhYu6