Adding Error Bars to 5-Star Reviews: A Bayesian Approach

Siavash Yasini
Towards Data Science
10 min readApr 26, 2020

--

Picking the right color socks 🧦 using PyMC3

When you’re shopping for a new pair of socks online you would naturally look for the ones with the highest review scores. And very often you see items with 5-star ⭐⭐⭐⭐⭐ reviews ️️️️show up on your search query. But how willing are you to buy a pair of socks with a 5-star review from only two users? Would you pick that over a pair with an average of 4.5 stars out of 100 reviews? Perhaps not, but why? Although you know that 5 is definitely greater than 4.5, there is something about only having two reviews that makes you hesitate.

When the number of reviews is small, you have every right to be doubtful about the validity of the rating. These ratings are just a number (an average) without any statistical uncertainty. If I tell you that the average height of my two cousins is 3’ 5” (~100 cm) that doesn’t really tell you much. Are they both preschoolers with the same height, or is one of them a basketball champion and the other one a toddler? Averages of a small number of observations are usually not that informative because they can carry huge uncertainties.

The situation is somewhat similar with the two 5-star reviews, but there is a way to remedy that. Aside from the average, 5, we know the distribution (think histogram of the reviews) as well. The goal of this article is to show how using Bayesian statistics we can translate the number of reviews, or more precisely their distribution, to infer the uncertainty about the true rating. Do you need to buy a new pair of socks and don’t know how much you can trust the item with two 5-star reviews? Here we’ll learn how to attach error bars to those reviews and use them to rank various pairs in our search query.

Note: This analysis is limited in scope and should be considered a completely theoretical exercise in bayesian inference. One of the main assumptions in the analysis — that all reviews are honest — is clearly not valid in most real-life cases.

All the code used to produce the results can be found in this notebook.

Model

I am going to model each review as a series of coin tosses — this is an article on statistics after all!

For clarity, I will use the word review to refer to individual score for the product, and the word rating for the average of all user reviews; An individual review score can be an integer between 1 to 5 (e.g. a 3 or 4) but the rating can be a real number in that same range (e.g. 3.8). According to the law of large numbers, the observed rating should converge to the true rating of the product. Obviously, here I am making a few implicit assumptions, namely (1) the item has a true objective rating (based on quality, fairness of price, etc.) and (2) that all the user reviews are honest; the latter assumption clearly does not apply to most real-life reviews which could contain paid or fake reviews.

Back to the model: let’s assume that each customer decides the review score for the socks using four tosses of a biased coin. Each time the coin lands heads, one star is added, if it lands tails no star is added. The better the product, the higher the bias of the coin p(the probability of heads).

Why four tosses and not five? The minimum rating for the items is one star (we don’t have 0-star reviews), so that leaves us with only four coin flips. The sum of the number of observed heads plus 1 (the first coin) gives us the final review score for the socks. For example the sequence H, T, H, H is equivalent to a 4-star review and T, H, T, T results in 2 stars. In this model each review is essentially a result of a Binomial process where the probability of the outcome can be expressed with the binomial distribution as

Eq. 1

where p is the bias of the coin, n is the total number of flips (here 4), and k is the number of observed heads. The 1 in k+1 is added to fix the rating scale which starts from 1 instead of a 0. The key assumption in this model is that the bias of the coin p is related to the true rating of the product such that p=0 always results in the coin landing tails or equivalently 1-star rating (terrible product) and p=1 always results in heads which means 5-star ratings (awesome product). Meanwhile, p=0.5 results in an average of two heads per toss corresponding to a 3-star rating (average product). To comfort yourself mathematically, you can use the mean of the binomial distribution np, which for p = [0, 0.5, 1] respectively yields 1, 3, and 5 ratings (I’ve added the extra 1 here). Let’s confirm this using some python code:

Great! Now we can use this model to generate some mock data to play with. Before we move on, I am going to define a scaler object that converts ratings to probabilities and vice versa. Our observable is the rating (average of reviews) which ranges between 1 to 5, but we would ultimately perform inference on the probability p ranging between 0 to 1. So it’ll be useful to have an easy way to translate back and forth between the two.

Data Generation: Mock Sock Reviews

Now that we have a generative model for the ratings, we can use it to generate mock data. To simulate the reviews for each product all we need is the number of reviews and the true rating of the item which determines the true bias of the coin p. Each review is the sum of the observed heads in four (biased) coin tosses, so all we need to do is to draw random numbers from a binomial distribution with parameters n=4 and p=scaler.r2p(true rating).

Let’s follow these steps to generate some mock sock reviews! To make things easy, I’m going to create a Sock data class with three main attributes: n_reviews, true_rating, and color.

Cool! Now, let’s say we have three different types of socks on the menu: blue, orange, and red.

I have assigned the true ratings of 3.2, 4.0, and 4.5 to these items i.e. the blue socks are ok, the orange ones are good, and the red pair is great! Our goal as a sock shopper is to buy the best pair of socks we can find out there, but the problem is that the true rating of these items are not observable. The only thing that we can observe is a statistical estimate of the true rating a.k.a. the average customer rating. The average rating is a good estimate of the true rating if we have a lot of reviews, but what if we don’t?

In this example, we have assumed that the orange socks (4.0 stars) are objectively inferior to the red ones (4.5 stars) but imagine that the only two costumers who bought the orange pair for some reason were super excited about their purchase and decided to give it 5 stars (this means that all their coins randomly landed heads). In this case, the average rating of 5 (which we see) would be larger than the true rating of 4.0 (which we don’t see). So by only comparing the average ratings, we would draw the incorrect conclusion that the orange pair is superior to the red pair.

To simulate this, I’m going to cheat and choose the seed of the random number generator so that both reviews for the orange socks end up as 5-stars (this is already implemented in the Sock data class). Let’s have a quick look at the normalized histogram of all the reviews.

The dashed lines are the underlying true ratings and the solid lines are the average of all reviews. Remember that the true ratings, which are not subject to randomness, are unobservable. The goal here is to pick the pair of socks with the highest true rating. As you can from the plot, if we go by the highest average rating and pick the orange pair we would be making a mistake, but how can we adjust our metric to pick the correct color?

Bayesian Inference

Perhaps not surprisingly, I’m going to use the Bayes’ theorem to infer the true rating of the socks using the observational data as evidence. We want to find the true rating of each product, given all the user reviews:

Eq. 2

The first term in the numerator on the right-hand side is the prior (our initial belief on the distribution of true rating) and the second term is the likelihood. To be conservative, we can use a flat prior for the rating, that is to say, we assume that before seeing any of the reviews it’s equally likely for the socks to have any rating between 1 to 5 stars. The likelihood for each individual review is a binomial distribution shown in Eq. 1 above, so to construct the full likelihood we only need to replace each k with the observed review score and multiply everything together. Remember that there is a one-to-one mapping between the bias of the coin p (see Eq 1.) and true rating (Eq. 2) and I’m using them interchangeably.

In order to find the probability distribution of the true rating of each item, I’m going to use PyMC3 to sample their posteriors. The following function takes in an instance of the Sock object and returns the MCMC samples from the posterior in Eq. 2. Going over the details of how PyMC3 works is beyond the scope of this article but feel free to ask questions in the comment section or post issues on the Github Repo if anything is confusing.

Note: In the prior section of the above function I’ve used a Beta function notation for p to describe the flat prior because Beta is the conjugate prior to the binomial distribution. This is technically unnecessary since we’re not solving this problem analytically but I did it anyway so I could write this little note about it!

Now let’s sample the posterior for all the socks and find the mean and standard deviation of the probability distributions.

Now in addition to an average, we have a standard deviation too! We can add this to the original histogram as error bars (I’m being a bit careless here simply adding the std error around the mean which is not strictly speaking correct for non-normal data). Here is what it looks like:

As expected, the socks with a larger number of reviews have smaller error bars. Very interesting, but maybe not as informative as we hoped. The inferred rating for the orange socks with only 2 reviews is still larger than the inferred rating of the red one with 100 reviews. Then how do we pick the correct pair?

Let’s check out the KDE of the posterior to see the actual shape of the distribution. This plot shows the posterior probability of the true rating for each pair which is: our initial belief of what the rating might be (uniform probability of all ratings in this case) updated based on the likelihood of the observation of the user reviews (binomial likelihood of Eq. 1).

Let’s make a few observations here. Just as a reminder, the (blue, orange, red) socks with the true ratings of (3.2, 4.0, 4.5) each had (20, 2, 100) reviews. First, we see that the width of the posterior PDFs are inversely proportional to the number of reviews for each item: the more reviews we have, the less uncertain we are about the final estimate of the true rating. Second, the peak of the blue and red PDFs are pretty close to the true ratings, but for the orange pair not so much. This makes sense because for the orange socks we only had 2 observations which are not very limiting. The long tail of the orange posterior admits to the fact that it’s very uncertain about the final result. But still, it’s giving us the information that we were looking for in this analysis.

What if instead of using the mean of the posteriors (50th percentile) to rank the socks — as we did in the second plot — we use the 5th percentile? This way we are 95% sure that the true rating lies above this number. With this ranking strategy either if the socks have lower ratings or if they have a smaller number of reviews — which means a wider tail in the distribution — they will sink lower in the list. Using the 5th percentile for our posterior PDFs results in the following ordering:

Voila!

And that’s it… Now that we’ve added error bars to our sock reviews, we can confidently buy the best pair. And that’s certainly what Reverend Thomas Bayes would have done!

Important Note: Do not attempt to implement this analysis in real life. Socks or other types of undergarments purchased solely based on the results of this analysis might not match with the rest of your outfit.

--

--

Ungifted Amateur, Python Enthusiast, Latte Artist, Ex-Cosmologist, Sr Data Scientist @ Fanatics