The world’s leading publication for data science, AI, and ML professionals.

Naive Bayes Explained with Course Ratings Prediction Example in Python

Introduction to Naive Bayes

Image created by Americana Using Canva
Image created by Americana Using Canva

Bayes’ Theorem

Image screenshot by Americana
Image screenshot by Americana

Bayes’ Theorem is a mathematical formula that allows us to calculate the ‘reversed’ conditional probabilities. It is often used when we have some prior belief about the probability of an event happening, and want to incorporate that information to calculate the conditional probability of an outcome. We could define the terms as follows:

  • P(A|B) is referred to as the posterior probability
  • P(A) is the prior probability
  • P(B|A) is the likelihood
  • P(B) is the evidence.
  • We can then restate the formula as :

*Posterior = Likelihood Prior / Evidence**

Naïve Bayes

  • Given a set of data {(x1,y1) …… (xn,yn)}
  • y takes a limited set of discrete values( categories) from 1 to k

Naïve Bayes applies the concept of Bayes Theorem to compute P(yi=1|xi), P(yi=2|xi) …. P(yi=k|xi) from the value of P(xi|yi=1), P(xi|yi=2)….P(xi|yi=k) and the value of P(x) & P(y=1)….P(y=k)

Our prediction is simply the y value that gives the maximum among all P(yi=1|xi), P(yi=2|xi) …. P(yi=k|xi).

It could get confusing when dealing with all these notations, putting it into a classification example will make the computation clearer.

Imagine we have an classical spam email classification problem, where we get an email saying "buy free food". Comparing to a list of word that could occur, we vectorize x into a row vector where 1 appears at the position corresponding to "buy", "free" and "food" and 0 otherwise.

Image created by Americana Chen using Canva
Image created by Americana Chen using Canva

When using Bayes Theorem, we can estimate P(Spam), P(Buy) and P(Buy|Spam) simply using our sample:

  • P(Spam)= number of spam emails in the sample / total number of sample emails
  • P(Buy)= total number of emails in the sample where "buy" appeared / total number of sample emails
  • P(Buy|Spam) = P(Buy and Spam) / P( Spam) = Number of spam emails that contains the word "Buy" / total number of Spam emails in the sample

Finally we take the product of each individual posterior probability and that would give us the probability that a particular email is spam given the content of the email. Note that this is where the model is "Naïve", it assumes that each posterior probability is independent. In this case, we’ve assumed that the fact that we know "buy" appeared in the email doesn’t give us any additional information about whether "free" would also appear in the email.

Thinking about it intuitively, the assumption might not hold true. In reality, if we know the email contains "buy", "food", we might associate it with words such as "sale", "free", "discount" instinctively. So we need to be aware of the imperfect nature of the assumptions of this model when we’re using it.


Example of Course Rating Prediction using python

The dataset we’re going to use in this example is the "Course review on Coursera" dataset from Kaggle, which can be accessed from the link below:

Course Reviews on Coursera

It’s always helpful to look at our data and get an idea of its general structure before we make decisions on how to approach the problem. The dataset description provided by the uploader have suggested that it consists of 1.45 million course reviews posted by students and participants on Coursera ( one of the largest online courses provider). Considering that we have a large dataset, dropping the rows with missing entries are not likely to result in an outcome that is extremely biased. After dropping missing values, our dataset end up having 1,454,571 rows with 5 columns.

In order to predict user ratings for courses, we will focus on using the ‘reviews’ column as the predictor and ‘ratings’ column as the target, but first let’s explore how many categories are we trying to fit our predictions to. The code below simply tells us that the dataset covers 604 courses with 5 possible ratings.

Sklearn is very beginner friendly for handling these kind of Machine Learning tasks. Just using a few lines of code we could easily construct a model, train it using our training data and make our predictions on test data.

First we need to use the count vectorizer to transform the reviews into a sparse row vector that takes value 0 for entries corresponding to words that didn’t appear in a specific review and takes a value of 1 if it appears in the review. In this case, *x turns out to be a 1108508 sparse matrix, with 30 stored elements in Compressed sparse Row format,** which saves us some memory.

We then split the Xs (sparse matrix recording occurrence of words in each comment ) and Ys(ratings) into training set and test set.

Finally, we use MutinomialNB to fit our training data and returns the score on our test data. The accuracy of the model is 81%, which many not be very convincing at first glance, but considering that we’re essentially doing sentiment analysis using short lines of comments and there’s large variation in how people express their positive or negative attitudes toward a course, 81% achieved by this simple algorithm is quite convincing.

Looking at the distribution of classification, we can see that most of the ratings are predicted to be around 4 and 5, it is reasonable that people tend to give more positive ratings than negative ratings.

You could also try to write some examples of comments and test if the algorithm could predict it correctly! This one is an obvious 5 star rating, but if we put in more ambiguous words such as ‘confusing’ , ‘dry’, the result could be different.

We could try to improve our model by using stop_words= ‘english’, which removes commonly used words such as ‘the’, ‘a’, ‘an’, allowing us to focus on the words that could actually convey meaning. However the resulting accuracy becomes only 80.6%. One possible explanation for this is that because most ratings are around 4 & 5, frequently encountering these stop words in test set comments could lead to biased predictions in the final outcome. So if we just blindly guess 4 or 5 because there’re similar stop words, higher accuracy is easily obtained. The fact that using stop_word reduced the accuracy revealed the problem in our original model.

Another parameter we could specify to try to improve the performance of our model is min_df, which means the word will only be considered if it appears at least 30 times among all comments. This might make the model more generalizable and avoid being affected by typos, slangs, or special cases.

It turns out that this also reduced the accuracy of our model!

In this case, the best way to improve the performance of the model is just to look at where we misclassified and how far off we are from the correct answer:

Taking a closer look at the specific cases of misclassification among the first 100 samples, we can see that there are cases where the user’s comment is indeed very vague and it could be difficult even for a human to predict an accurate rating, for example:

comment : A few grammatical mistakes on test made me do a double take but all in all not bad.

prediction : [3]

actual : 4

The model might have predicted a lower rating because it recognized the negative sentimental connotation of the work mistakes and bad.

comment : Excellent course and the training provided was very detailed and easy to follow.

prediction : [5]

actual : 4

This comment conveys a strong positive sentiment and I believe a human will also interpret it as a five star rating.

On the other hand, there are also cases where the model’s prediction is far off from the actual rating:

comment : The ProctorU.com system took 2 times the amount of time spent on this course over 3 days to complete. It is the worse production user system I have used in 20+ years of my IT career. You should switch to another vendor.

prediction : [1]

actual : 5

It would be quite obvious to a human that this is a negative rating, but the model actually predicted it to be 5 star!


To summarize, the learning process is not a difficult task, a few lines of code will do the job for us. The more challenging part and the job of data scientists are to evaluate the model, look at specific cases of misclassification and find out which part of the sentence could be misleading and tune our model carefully to adjust for the uncertainty.


Follow me / connect on:

LinkedIn: Americana Chen

Instagram: @africcccana

Facebook: Americana Chen


Related Articles