A tutorial to one of the most elegant and beautiful theorems in Natural Language Processing.

Classification lies at the heart of Machine Learning and Human Intelligence. Recognizing voices or faces or what images we see on a daily basis all come under the umbrella of classification. Now coming to Naive Bayes , for anyone starting out in the field of Natural Language Processing , this is the first step one takes towards this goal ;that is why it is of paramount importance we understand how to implement this and what’s going on under the bonnet . In this article we are going to learn how to implement Naive-Bayes from scratch and use it for Sentiment analysis.
INDEX:
- SENTIMENT ANALYSIS
- TERMINOLOGY
- NAIVE-BAYES THEOREM
- DERIVATION
- TRAINING THE MODEL
- SOLVED EXAMPLE
- CONCLUSION
SENTIMENT ANALYSIS
Now sentiment analysis as the name suggests is basically the task where we classify the sentiment of the statement or more simply the emotion one particular statement is trying to convey ; whether its positive or negative , sad or sarcastic; insulting or wholesome and kind . Let’s elaborate this point with a few more examples;
Let’s say there’s a new restaurant in Town and you and your friend decided to have dinner there and experience this place. You really enjoyed yourself and when your friend asks for your opinion, you say something along the lines :

1." This place is great . The food is delicious and the ambience is quite enjoyable"
Now your friend has contrary beliefs . This new place was not to his liking . So when you in turn ask for his opinion he remarks:
2."This place is pathetic. The food is horrible and the ambience is quite overpowering for me and makes me really uncomfortable"
I think we can all agree that statement 1 expresses a positive sentiment while statement 2 represents a negative sentiment or an emotion. What’s also important for us to notice is what EXACTLY gave us the clues which allowed us to classify the sentences into positive or negative .
The Answer? The cues…
if you look closely… There are certain words in the sentences that dictate the emotion . Positive words like great, delicious, enjoyable etc. make the sentence positive while words like pathetic , horrible , uncomfortable etc. Make the sentence negative . In fact if we just replace these special cues in a sentence it completely changes its meaning. Let me show you:

In each of these examples , if you notice, by changing one pivotal word changes the entire flavor of the sentence while the rest of the sentence remains exactly the same . This is the essence of Sentiment Analysis . Also we are going to limit our discussion to binary classification , classifying the sentences as either positive or negative.
TERMINOLOGY
Now before digging into mathematics let’s first talk about terminology . What we are going to do here falls under Supervised Machine Learning ; which means that we will be provided with training inputs and each of these inputs will be associated with its correct output . Now it’s the job of your model to make sense of this data , observe and analyze the relationship between the given inputs and outputs and finally predict with reasonable accuracy the output ; given a new input.
Going further ; we usually denote input as x and output as y where y ∈[y, yb, yc,…..yn] classes and it’s your model’s task to predict which output class a particular input x belongs to . Now in this case we will be dealing with words and sentences so there is a slight change in terminology . Our input will be ‘d‘ for document (sentences basically) consisting of a number of features and our output would be ‘c‘ for class. Here the class will represent either positive(for positive sentiment) or negative (for negative sentiment).
So finally we will get an input d and our model has to learn to predict which class , ‘c’ , it belongs to .
Now that we are done with the nitty gritty and the fine details let’s start with the theorem .
NAIVE-BAYES THEOREM
Let’s first look into Bayes Theorem:
P(A | B) = P(B | A) * P(A) / P(B) ->(1)
Let’s Look into the terms:
- P(A | B) = Probability of event A happening given that event B happens
- P(B | A) = Probability of event B happening given that event A happens
- P(A) = Probability of event A happening
- P(B) = Probability of event B happening
The Bayes Theorem thus gives us a way to find the Conditional Probability . Bayes Theorem lies in the heart of the Naive Bayes theorem.
Now we are in a position to describe Multinomial Naive – Bayes Theorem. As the name suggests, this theorem uses a Bayesian Classifier with a simplified assumption about how the features interact .
One of the most important assumptions we have considered in Naive-Bayes is called the bag-of-words . It implies that all the algorithm really cares about is the word and it’s frequency i.e. how many times the word has appeared in our data . The position of the word in our sentence(document) does not matter at all. We only have to keep track of the number of times a particular word appeared in our document . That’s it .
Let me explain with an example:
" Tea makes me happy. Black Tea, Green Tea, Milk Tea, it does not matter what kind it is ; as long as it’s tea I am satisfied. I have grown up drinking tea and every sip reminds me of the good old days"
Let’s analyze it further:

This is basically what bag-of-words concept boils down to . It does not matter where the words tea , I, happy , satisfied etc. have been used in the sentences , all that matters is it’s frequency.

DERIVATION
Now let’s try to formulate this mathematically.
If you may recall, our main goal was to find the class (whether positive or negative sentiment) given a particular sentence(document). So we can approach this problem this way:
- Suppose we have a set of possible classes C.
- We find the probability of a document being in a particular class.
So essentially conditional probability of a class given a document .
- We iterate over all the classes and find which class has the maximum conditional probability ; giving us our answer.
Combining all the steps together we get :

Here the term ĉ denotes the class with the maximum probability .
We were already introduced to the Bayes Theorem ; so now we can plug in the formula for conditional probability in eqt(2)

We can simplify this equation even more . While iterating over the classes our document of course does not change , only the class changes ; so we can safely remove the P(d) from our denominator without causing any major problems. So our modified equation thus becomes:

The term P(d|c) is called likelihood probability
The second term P(c) is called prior probability
We can simplify it even further by dividing each document into a collection of features f1 , f2, f3, ……..fn.

At this point of our derivation we will make a very important assumption . We will assume that the probability of each feature f given is class is independent of each other. This is a very crucial step and it reduces the time complexity of our problem by a huge margin. Let’s understand that a bit more.
If two events X and Y are independent of each other then the probability of the events occurring together (P(X and Y)) becomes:
P( X ∩ Y) = P(X) * P(Y)
which means:
P( f 1 | c ∩ f 2 | c) = P(f 1 | c) * P(f 2 | c)
We can simplify eqt(5) even further! Also , assuming the events are independent from each other we do not have to take into account how each feature is related to one another or the probability of one feature occurring given another feature. This saves us a lot of computing power .
Thus our final equation becomes:
P( f 1 , f 2 , …., f n |c) = P( f 1 |c) · P( f 2 |c) · … · P( f n |c) ->(6)
or

Now of course the features in a sentence would be it’s words… so if we replace the features in our equation with wi for the word at the i-th position we can re-frame our equation as follows:

Phew!!!! we are finally done with the derivation. Now, lets move onto how we can apply this concept in a practical problem .
TRAINING THE MODEL
- Calculate the Prior Probability . We will first find the number of documents belonging to each class . Finding the percentage of the documents in each class will give us the required prior probability.
Let’s assume the number of documents in class c is Nc.
Total number of documents is assumed to be Ntotal.
So , P(c) = Nc / Ntotal ->(9)
- Calculate the Likelihood Probability. This is where it gets slightly tricky . Our main goal is to find the fraction of times the word wi appears among all words in all documents of class c. We first concatenate all documents with category c into one big "category c" text. Then we use the frequency of wi in this concatenated document to give the likelihood probability .

Here V is for Vocabulary which is a collection of all words in all documents irrespective of class
We will however face a very unique problem at this point . Suppose the document we have as input is ,
d = "I loved that movie"
The word "loved" is only present in the positive class and no examples of "loved" is present in the negative class input .Now from eqt(8) we have to find the probability by multiplying the likelihood probability for each class.If we calculate out likelihood probability for the word "loved" for the class "negative" we get:
P("loved" | "negative") = 0
Now if we plug in this value in eqt(8) the entire probability of our class "negative" becomes zero ; no matter what the other values are .
To combat this problem we will introduce an add-on , Laplace Smoothing Coefficient , to both the numerator and the denominator . Our equation will be modified as follows:

Here a is the Laplace smoothing coefficient . We usually consider its value to be 1.
- Plug in the prior and the likelihood probability in eqt(8).
Now that we have calculated our prior and likelihood probability we can simply go ahead and plug it in .
There are few ways we can optimize this process though:
a. Using Logarithm: if we apply log on both sides of eqt(8) we can convert the equation to a linear function of the features, which would increase efficiency quite a lot.
This is the original equation we had:

Now if we apply Logarithm on both sides we get a linear function:

b. Stop Words: Words like the, a , an , was , when etc. do
not usually contribute to the sentiment of the statement . We can remove them entirely to streamline our model training.
c. Unknown Words: Every time you face a word which is present in the test dataset but absent in the vocabulary created from the training data , it is advisable to drop the words entirely and not consider them in the probability calculations.
d. Binary Multinomial Naive-Bayes : This is a slightly modified version of the multinomial Naive-Bayes. Here we are going to place more importance on whether a word is present or not than its frequency . As we have already seen a single word can bring about a massive change in the sentiment of the sentence and thus it would be a logical way to disregard how many times that particular word appeared in a sentence and concentrate whether that particular word is present or not in the document.
SOLVED EXAMPLE
Finally , now that we are familiar with Naive Bayes classifiers we can implement this knowledge in an example .
Training Dataset:

Test Dataset:

This is a fictitious dataset on movie reviews. The movie reviews have been divided into positive and negative classes respectively.
Let’s Solve the problem: (we will consider smoothing coefficient, a, as 1)
For the first test case:
Prior probability:
P(c = ‘positive’) = 3/6 = 1/2 , P(c = ‘negative’) = 3/6 = 1/2
Likelihood probability:
P(‘Great’ | c = ‘Positive’) = 1 + 1 / (9 + 19) = 0.0714
P(‘movie’ | c = ‘Positive’) = 2 + 1 /(9 + 19) = 0.1071
P(‘Great’ | c = ‘Negative’) = 0 + 1/(10 + 19) = 0.0344
P(‘movie’ | c = ‘Negative’) = 0 + 1/(10 + 19) = 0.0344
Finally we apply our findings in eqt(8) and return the maximum probability :
P(c = ‘positive’) P(‘Great’ | c = ‘Positive’) P(‘movie’ | c = ‘Positive’) = 0.5 0.0714 0.1071 = 0.00382
P(c = ‘negative’) P(‘Great’ | c = ‘Negative’) P(‘movie’ | c = ‘Negative’) = 0.5 0.0344 0.0344 = 0.000591
Thus we can say the test case belongs to positive class.
For the second test case:
Prior probability:
P(c = ‘positive’) = 3/6 = 1/2 , P(c = ‘negative’) = 3/6 = 1/2
In this example we encounter unknown words : this , is. We will not consider them in our calculations.
Likelihood probability:
P(‘boring’ | c = ‘Positive’) = 0 + 1 / (9 + 19) = 0.03571
P(‘pathetic’ | c = ‘Positive’) = 0 + 1 /(9 + 19) = 0.03571
P(‘and’ | c = ‘Positive’) = 0 + 1 /(9 + 19) = 0.03571
P(‘boring’ | c = ‘Negative’) = 1 + 1 / (10+ 19) = 0.0689
P(‘pathetic’ | c = ‘Negative’) = 1 + 1 /(10 + 19) = 0.0689
P(‘and’ | c = ‘Negative’) = 1 + 1 /(10 + 19) = 0.0689
Finally we apply our findings in eqt(8) and return the maximum probability :
P(c = ‘positive’) P(‘boring’ | c = ‘Positive’) P(‘pathetic’ | c = ‘Positive’) P(‘and’ | c = ‘Positive’) = 0.5 0.03571 0.03571 0.03571 = 0.00002277
P(c = ‘negative’) P(‘Great’ | c = ‘Negative’) P(‘movie’ | c = ‘Negative’) = 0.5 0.0689 0.0689 * 0.0689 = 0.00016354
Thus we can say the test case belongs to negative class.
CONCLUSION
Sentiment analysis is widely used in social media monitoring , market research etc. It is one of the most important aspects of Natural Language Processing . Naive – Bayes Classifier with a great first step towards Natural Language processing . Hopefully you enjoyed reading this article .
The python code implementing Binary Multinomial Naive-Bayes Classifier can be found in my github repo . The dataset has also been included in the repo.