How to pick products on Amazon (using bayesian statistics to help you decide)

A Case Study of Using Bayesian Inference for Decision Making

Pedro Martins de Bastos
Towards Data Science

--

Image by the author.

One question many of us come across while shopping on Amazon is: how do I compare two ratings? Amazon ratings give us two important pieces of information: the distribution of ratings (how many were five stars, four stars, three stars, two stars or one star) and how many reviews each product received. Comparing distributions can be pretty straightforward: we can compare the average ratings between two products and then decide. How do we incorporate the number of reviews, however? And does it matter?

Let us say that I’m in the market for a computer monitor and searching for options on Amazon. I found two monitors I like. Monitor 1 has an average rating of 4.5 and 3130 reviewers.

Image by the author.

Monitor 2 has an average rating of 4.7, but 172 reviewers.

Image by the author.

How do we pick between the two? Intuitively, we know that products with more reviews are more popular (another signal pro the product’s quality). Some of us might also realize that more reviews (3130 vs 172) mean that the average rating you’re seeing is more ‘reliable’. In other words, more ratings mean that, through the crowdsourcing of information, we are closer to the ‘true’ average for that product. How do we understand the magnitude that the number of ratings has on the reliability of the average, however?

One way to answer this question would be to use bayesian inference to try to infer our product’s ‘true’ ratings distribution. How does this help? Bayesian inference will allow us to create an initial model for the rating distribution and update that model with the data we have. Once we ‘train’ our model based on all the available data, we can see how the model’s variance differs based on how much data was used to train it. If you’d like a full breakdown of how to do Bayesian inference in Python, please look at another article I wrote here.

Coming Up With a Model

Decision-Making Strategy

What is an adequate model for our problem? This is where data scientist meets a subjective, almost artistic requirement: the data scientist must recognize the pattern from their own experience and intuition.

The first thing we want to come back to is our goal: to be able to make better decisions. At this stage, we might also question what ‘better decisions’ mean. In game theory, there are a few different mutually exclusive strategies to maximize a ‘payoff’, which in this case is the rating we would give to a product (which should correlate with our level of satisfaction). We have, for example, the conservative ‘minimax’ strategy. ‘Minimax’ means minimizing the maximum harm. In other words, we ask ourselves: what is the worst possible scenario, and which strategy will lead us to the best worst-case scenario? If we knew we were going to fail, which option would lead to the least harm? In the context of choosing Amazon products, we can create a strategy inspired by minimax where we are instead avoiding the option with the highest risk of us giving out a rating of one.

Conversely, the optimist’s strategy would be the ‘Maximax.’ For this strategy, we seek the option that opens us to the best possible scenario. One optimistic strategy inspired by Maximax would be to look at both products and pick the one with the highest probability that we will rate it a five.

Finally, we might adopt a balanced strategy that many of us may already be familiar with: the highest expected value. The ‘highest expected value’ strategy is equivalent to simply choosing the strategy with the highest mean rating. If we have a product with a mean of 4.5, and another with a mean of 4.7, then we simply choose the 4.7 if we take the past scores to be predictive of how we might rate the product. Daniel Kahneman, the Nobel-prize-winning father of behavioural economics, advocates using the highest expected value strategy. He argues that humans tend to frame decisions narrowly, which leads us to be either pessimistic or optimistic based on how we are feeling. If we instead make an effort to frame our decisions broadly and trust the statistics, we will realize that eventually our luck and misfortune cancel out, and the outcomes regress toward a mean. If we buy several products throughout our lifetime, being too optimistic or too pessimistic will lead us to miss out: the expected value is our best guess. Therefore, the expected value is, through this lens, the best long-term strategy.

Nevertheless, we always must consider the nuances of a particular decision before adopting any strategy. I would argue that in the case of Amazon purchases, what you really want is to find the one product that best suits your needs, especially because you can return a product and exchange it for another one. This means that, in this case, we might heavily discount the risk of getting a product we really dislike as we can just give it back. Thus, we focus on finding the product that will give us the highest chance of complete satisfaction.

As such, let us say that our strategy is to find a way to maximize the probability that we will rate a product a 5/5. To do this, we will take our data and create a model that describes the distribution of ratings. With this model in hand, we can then predict the probability that a new rating will receive each of the five possible levels. More importantly, with a Bayesian inference model, we will also have a better idea of the uncertainty behind these predictions by knowing the distribution of our model’s parameters.

Let’s say that our model is a simple vector of five elements, corresponding to the percentages (or probabilities) that a product will receive in each of the possible ratings. For product one, that would be [0.73, 0.14, 0.06, 0.03, 0.04]. To go a step further, and to be able to incorporate the idea of uncertainty and how it is related to the number of ratings, we want to know not only what the parameters seem to be right now but also how they seem to be distributed. We want something like a bell curve for each of these parameters (each of the five percentages) so that we know how far away we can expect the probabilities to sway from the means.

Thankfully, there are two distributions that can help us achieve this: the multinomial distribution and the Dirichlet distribution. The multinomial distribution takes in a ‘simplex’ vector, i.e., a vector whose elements are between 0 and 1 and add up to 1, and a number of events. It outputs a vector of the same size with integer values correlated to that input vector. So, if we do:

α~Multinomial(n = 50 ratings, probabilities = [0.73, 0.14, 0.06, 0.03, 0.04])

We could get something like:

[32, 10, 3, 1, 4]. It is as if we are simulating 50 people giving out ratings, with probabilities correlated to our simplex input vector.

In Python:

This closely mimics our use case for Amazon product ratings. If we know what the vector p looks like, we can simulate how 100 people, for example, might rate that product. Note that what we want is the opposite, however: we want to take info on how many people voted for each category and come up with a distribution for the five percentages/the probability vector. This is where the Dirichlet distribution comes in.

The Dirichlet distribution takes in the vector and outputs a sample for the probability vector, i.e. a simplex vector with the same number of inputs as α. Therefore, if we input the vector [32, 10, 3, 1, 4] into the Dirichlet distribution, we might sample something like the probability vector [0.73, 0.14, 0.06, 0.03, 0.04]. I.e.:

p~Dirichlet([32, 10, 3, 1, 4])

outputs something like [0.73, 0.14, 0.06, 0.03, 0.04].

In Python:

We can see how the multinomial and Dirichlet distributions seem to fit nicely with each other. In fact, they have a special name in statistics: they are known as ‘conjugate priors’. This is because when we use the Dirichlet distribution as a prior and the Multinomial as a likelihood function for bayesian inference, the posterior will also be of a Dirichlet distribution. Furthermore, the posterior can be computed analytically just based on our data. Wikipedia has a very useful list of existing conjugate priors, and how we might compute a posterior based on the conjugate prior pair.

Bayesian Inference

The first step for our bayesian inference is to define distributions for our prior, likelihood, and posterior. As a quick recap, bayesian inference follows Bayes’ theorem:

P(B|A)=P(A|B)*P(B)P(A)

Where:

P(A|B) = the likelihood

P(B) = the prior

P(A) = the evidence

P(B|A) = the posterior

For this model, we want to find out the posterior distribution of our probability vector p given the data we have access to.

Our likelihood will follow the multinomial distribution: note how the likelihood distribution always mirrors the distribution of the actual data but that they are not the same. The data is distributed with some parameter p according to the multinomial distribution. The likelihood function, however, takes in an observed datum and outputs the probability of its observance given a different proposed set of parameters p. Thus, in the likelihood function, we compute the probability of each set of parameters p and not of observing a particular datum. Please refer to my older article for a deeper dive.

Our prior, in turn, will follow the Dirichlet distribution. Based on our existing knowledge, we create a prior that makes sense to us (or, perhaps, we create an uninformative prior by keeping it uniform, by using the parameters, α =[1, 1, 1, 1, 1] for instance). If we sample from this uninformative prior, we will not get good results for what our model’s parameters look like (we might get something like [0.12, 0.2, 0.35, 0.18, 0.15], which we know is not close to reality). That is why we need to perform the bayesian inference: to modify this prior, based on the likelihood function and on the data we witness, and therefore create a posterior that is a closer match to reality.

The Conjugate Prior Trick

As I mentioned, the two big advantages of using a conjugate prior (and, indeed, of being lucky to find that our situation is a natural occurrence of a conjugate prior) are that (a) the posterior will follow the same distribution as the prior and (b) the posterior is easy to compute analytically. To compute the posterior of a multinomial/Dirichlet conjugate prior, all we need to do is to add the data we observe to the α vector’s elements in our Dirichlet distribution.

Let us say that our Dirichlet prior, which we create based on our intuition, is:

p~Dirichlet(α = [1, 1, 1, 1, 1])

Our new α, for the posterior, is computed by:

New α = α+Xn, where Xn is a vector with the data we observe (i.e., the number of votes in each category for our product).

So if we observe Xn=[5000, 300, 100, 50, 3], where 5000 is the number of 5-star ratings and so forth, our new will simply be [5001, 301, 101, 51, 4]. Our posterior, therefore, would be:

p~Dirichlet( α = [5001, 301, 101, 51, 4])

Using Our Trick

Now we can apply our trick to our problem. First, we have to choose our prior. We might have an inclination that, for amazon products of decent quality, i.e. those that we might care to consider buying, most reviews are usually of a 5-point category, and the points are progressively smaller. We might also note that angry customers usually give out 1s more frequently than 2s, and there is frequently another bump in the frequency of 1s compared to 2s. These patterns help guide us to shape our prior (remember, this is our inclination about the product’s rating distribution WITHOUT looking at the data).

Let’s say that we come up with the following prior: [50, 40, 30, 20, 25]. Great! We can even sample from this prior if we want, using statistical software (I like to use the scipy.stats library in python). We want to do better than our prior, however. Because we are using the conjugate prior trick, we can come up with the posterior very easily: we simply add the data to our prior! Let’s look at monitor 1’s rating data from above: [2285, 438, 188, 94, 125] (I multiplied the percentages by the number of ratings). Our posterior, following the formula I showed above, would then be the simple addition of the two vectors:

So our posterior is the vector [2335, 478, 218, 114, 150]. Our full posterior, in turn, is:

p~Dirichlet([2335, 478, 218, 114, 150])

Sampling from the Posterior

Now that we have the posterior, what do we do with it? Remember, the posterior is a distribution over the parameters of our model. In this case, the posterior describes the ratios of how many product reviews will fall under each of the five categories. If we get one sample using our model, this is what it looks like:

We can take this output to mean that 72% of ratings are expected to be 5s, 15% to be 4s, 6% to be 3s, 4% to be 2s and 4% to be 1s. However, the Bayesian approach’s usefulness is in sampling multiple possibilities for the parameters and seeing the distribution of these parameters.

Let’s focus only on the 5-rating category. Let’s say that we want to sample 1000 samples from the posterior and observe what the distribution looks like:

Image by the author.

This is when we start to see the benefit of using bayesian inference. Not only do we know the mean, i.e. our best guess, for the probability of a five, but we also know how confident we are about that assertion. Notice the 95% confidence interval: it shows us that, for 95% of the samples we measure, we expect our descriptive statistic (the mean) to lie within those boundaries. The confidence interval will ultimately be determined by how much data we have: we will see that, for the Dirichlet distribution (and for most distributions), the more data we have, the tighter the confidence interval.

Before we look at how the confidence interval changes, however, we can take a look at the distributions for the remaining categories. I wrote some code to sample from the posterior and then create what I call a ‘column histogram’:

Column histograms are useful when we want to compare the distribution of a few different categories at once. Another good option is a box chart. For the column histogram, instead of using bars and the height of the bars to signal density, we use dots and the density of dots. Each dot in the column histogram represents one sampled datum. Have a look at a column histogram made from sampling 100 samples from our posterior:

Image by the author.

Note that I used an unbiased prior ([1, 1, 1, 1, 1]) for this inference. The purple column for rating 5 shows how most samples are close to the mean (the horizontal black line on that column). As we get closer to the 99% confidence interval boundary (the horizontal green lines) the points become more sparse. This is less visible, and still true, for the remaining columns.

This graph, above, is for monitor 1, the monitor with 3130 amazon ratings (or, in other words, 3130 data points that we used to feed our bayesian model). What if we do the same process, and plot posterior samples, for a product with way fewer ratings? What if we run this pipeline for monitor 2? This is the graph we would see:

Image by the author.

See how the confidence intervals are a bit wider than before? Why is this so different? The key is that, for this model, since we have only 172 ratings, this is how our Dirichlet looks like (assuming an uninformative prior of [1, 1, 1, 1, 1]). If we bias the prior a little bit, say by inputting the prior as [10, 5, 4, 2, 3], we change the apparent means:

Image by the author.

In this case, the prior didn’t do much to help decrease the variance in each category (and the variances were already fairly small), so we might wish to use an unbiased prior.

What if, hypothetically, we had not 172 but 20 ratings for this monitor (following the same ratios that we see in the 172 data)?

In that case, we would get an even wider confidence interval:

Image by the author.

Now we can start to see that the means are much more uncertain and that the confidence intervals are large. A bit of a philosophical aside here: is the purpose of our model to look like our data, or is the purpose of our model to learn about reality? It might be tempting to wish our model looks exactly like our data, and in so doing create a ‘good’ model. But when our data is paltry, we have to distrust it. For monitor 2, in the hypothetical case of 20 ratings, we might be tempted to be upset that our model cannot capture the means for the categories, instead creating these large intervals for our possibilities. However, we must also be suspicious of the data itself: there simply isn’t enough of it. Any means that we compute using that data could be biased based on randomness alone. We should trust that gut instinct we get about being suspicious of a 5-star rating with only 20 reviewers: the product simply hasn’t been tested by enough people, and the percentages we see for each category are likely to change a lot as more reviews come in.

The next question I pose, then, is this: how many people are enough people? At what point can we say that we can trust our model?

Studying the Rate of Convergence

Convergence towards the ‘true’ parameters of a model is the goal of every machine learning model. Assuming that the parameters stay relatively constant in nature, a good model would, with the right amount of data, be able to approach those values. Analogously, we can think of how it takes quite a few votes/ratings for the ‘true’ means of a product to be approximated.

One way we can investigate this convergence is by simulating our model with different numbers of data (or reviews). Put simply, we can trace how the standard deviation (or variance) of our model changes with the increase in the number of reviews. This is exactly what I did next, take a look at this graph:

Image by the author.

Again, the standard deviations I computed are the standard deviations for the posterior samples for our five parameters (the five different means, one for each rating category). This is using the data from monitor 1. As we increase the number of ratings, from 0 to 30, we can see a steep decrease in the standard deviations. From that point on, the slope tapers out and we have a slower decrease in standard deviation. This shows that after 30 or so reviews, our model has a pretty good idea of what the distribution of ratings looks like and that there should be little variation with more reviews added. This also means that we don’t need a very high number of reviews to have a good grasp on which of the two products is the better buy.

Making the Purchase Decision

Now that we have all of the statistical tools at our disposal, which of the original two monitors should we choose? Let's look at our model’s distributions for the 5-category ratings, in the same graph:

Image by the author

First notes about this graph: I plotted two ‘column histograms’. One for the distribution of the posterior for the percentage of 5s for monitor one on the left, and the same thing for monitor 2 on the right. The first thing you might notice is that the variance for monitor 2 is much higher. The same is true for the mean: monitor 2’s mean sampled probability of a 5 is around 0.87 as expected versus the 0.73 of monitor 1.

Let's circle back to our decision-making strategy: I argued in favour of adopting a maximax-inspired strategy. This strategy means attempting to maximize the chances of our best-case scenario: of buying a product and coming to the conclusion that it is a 5-star product. As such, we can focus only on the ‘5’ rating column. If we compare our model’s results for these two monitors, one is definitely a better choice: monitor two, despite its overall higher variance and uncertainty, seems to have a consistently higher probability of a 5 rating. We can also quantify this by asking the question: for every mean that we sample for each population, how frequently is monitor 2’s sample larger than monitor 1’s sample? In the case above, the answer is 100%. For larger samples, we might witness outliers that bring this number down to 99% or less, but the conclusion is fairly straightforward: almost always monitor 2 is the better buy.

Another simple way to verify that monitor 2 is better is simply to note that the lower bound for monitor 2’s 99% confidence interval (around 0.8) is higher than the upper bound for monitor 1’s 99% confidence interval (around 0.75).

Biases and Obstacles

More often than not, after you’ve put a model into production you eventually realize that there are biases affecting your results and that over time your model’s accuracy seems to decrease. This may be counter-intuitive: with more time, and more data, you would expect the model to become better. However, the increase in accuracy should be expected only if our population, the data we are analyzing, remains the same. In the case of Amazon ratings, we may have a ‘biased’ population:

  • The first bias I’d point out is that there might be a sampling bias for the early adopters of a product. People who buy a product with little to no ratings might be consistently more forgiving than late-game adopters, people who are skeptical and wait for others to try things first.
  • The second bias is the reviewer’s bias: who is it that posts reviews on Amazon? Can we expect the population of reviewers to be an accurate representation of the population of people who bought a product? I would say that this is likely not the case. There might be a bias towards people who have strong feelings about a product: people who love a product make sure to go praise it, and people who hate it make sure to make their feelings known. People who are lukewarm about a product might not take the trouble to review it.

For purchase decisions, the second bias I presented above might not be an obstacle (the bias should be uniform for all products on Amazon), but the first bias should be. For the first bias, we are finding a different reason as to why a low number of reviews might be an issue: early adopters might have a bias pro being forgiving (or maybe they are extra skeptical). A deeper dive into this problem would include an analysis of these types of biases and tools/strategies to counter them.

Conclusion

My purpose with this article was to show how we can factor in the number of samples, and the confidence that number implies, into our decision-making. I showed how we can use Bayesian inference to build a model for Amazon’s rating system, and how we can plot how our confidence changes with an increased number of ratings. With all of that information at hand, we can then make more informed decisions.

Thank you for reading, and please do leave a note or contact me directly if you have thoughts, critiques or questions about the article!

--

--

I am a Data Science major whose passion is to understand how to improve human wellbeing, and education, with science and technology!