
Introduction
Building a model can be a lot easier than you think. Not every classification task needs machine learning models. Even very simple approaches can give you good performance. This article covers VADER
, a lexicon and rule-based model for sentiment analysis. We will first understand what VADER is and finally, evaluate its performance for our classification task.
What is VADER?
Valence Aware Dictionary and sEntiment Reasoner or VADER for short is a lexicon and simple rule-based model for sentiment analysis.
It can efficiently handle vocabularies, abbreviations, capitalizations, repeated punctuations, emoticons (😢 , 😃 , 😭 , etc.), etc. usually adopted on social media platforms to express one’s sentiment, which makes it a great fit for social media sentiment text analysis.
VADER has the advantage of assessing the sentiment of any given text without the need for previous training as we might have to for Machine Learning models.
The result generated by VADER is a dictionary of 4 keys neg, neu, pos and compound:
neg, neu, and pos meaning negative, neutral, and positive respectively. Their sum should be equal to 1 or close to it with float operation.
compound corresponds to the sum of the valence score of each word in the lexicon and determines the degree of the sentiment rather than the actual value as opposed to the previous ones. Its value is between -1 (most extreme negative sentiment) and +1 (most extreme positive sentiment). Using the compound score can be enough to determine the underlying sentiment of a text, because for:
- a positive sentiment, compound ≥ 0.05
- a negative sentiment, compound ≤ -0.05
- a neutral sentiment, the compound is between ]-0.05, 0.05[
Now that we understand the main concepts, let’s dive into the implementation.
How to use VADER?
The goal of this section is to provide you with all the prerequisites such as the dependencies, the dataset, and the actual implementation of VADER.
If you prefer video, you can watch the video walkthrough of the article:
Prerequisites and basics
As mentioned in the title, we will be using the VADER library, to do so we need to install [nltk](https://www.nltk.org/_modules/nltk/sentiment/vader.html)
, download, and import the lexicon with the following instructions.
SentimentIntensityAnalyzer.polarity_score()
function provides the polarity of the text rendering the dictionary format explained previously. To be able to perform the predictions, we need to create an instance of SentimentIntensityAnalyzer
** beforehand (line 1**2).
Let’s get warmed up by predicting the underlying sentiment of the following examples.
# Output of example1
{'neg': 0.0, 'neu': 0.585, 'pos': 0.415, 'compound': 0.75}
Observation example 1: The previous result shows that the sentence does not have any negative information (neg=0). It has some neutral and positive tones (neu=0.585 and pos=0.415). However, the overall sentiment is positive, because compound > 0.05
# Output of example2
{'neg': 0.0, 'neu': 0.373, 'pos': 0.627, 'compound': 0.8284}
Observation example 2: As you can see from this example, the compound jumped to 0.82, which makes the sentence more positive than the one of the first example.
# Output of example3
{'neg': 0.619, 'neu': 0.381, 'pos': 0.0, 'compound': -0.8449}
Observation example 3: From this last sentence, we can see that the sentence does not have any positive information (pos=0). It has some neutral and negative tones (neu=0.424 and neg=0.576). Overall, it has a most extreme negative sentiment due to the compound score which is close to -1. My guess here is that removing the exclamations will make the sentiment less negative. Why don’t you give it a try 🙂
Performance on a large dataset
Now that we understand the basics, let’s try to evaluate the performance of VADER on large data. Before that, we will need to perform a few preprocessing.
Load data for preprocessing
We are going to use this license-free tweets dataset available on the Sentiment140 website, in order to know how well VADER does.

We are only interested in two main columns.
- ‘4’, corresponding to the polarity of the tweet (0: negative, 2: neutral, 4: positive).
- ‘@stellargi..right’, corresponding to the actual tweet.
The following function will rename those columns to a more understandable format, then makes the correspondence between the digits and the string format of the polarity then finally returns the formated data.
The following image corresponds to the first 3 rows after applying the format_data()
function to the original data set (lines 16 and 17).

How good is VADER on the data?
Before that, we are going to use the following helper functions which will immediately return the polarity (pos, neg, or neu) instead of the dictionary output.
On line 19, we create a new column vader_prediction()
corresponding to the predictions of VADER. Then, on line 22 we show 5 random rows of the data

From the original polarity column and VADER’s prediction, we can finally generate the performance (precision, recall, and f1 score) running these few instructions.

The model seems to be doing a good job because it is much better than a random guess (accuracy = 0.5)! The same observation can be made from the f1-scores of each polarity.
Conclusion
Congratulations! 🎉 🍾 You have just learned how to use VADER for social media sentiment classification. VADER can be a good starting point and be used as your baseline model for such a task before diving into further building machine learning models. I hope you have enjoyed reading this article, and that it gave you the skills needed to perform your analysis. Please find below additional resources to further your learning.
Feel free to add me on LinkedIn or follow me on Twitter, and YouTube. It is always a pleasure to discuss AI, ML, Data Science, NLP stuffs!
Bye for now 🏃🏾