Evaluation Metrics for Recommendation Systems — An Overview

Understanding the purpose and functionality of common metrics in ML packages

Pratik Aher
Towards Data Science

--

Recently, while experimenting with a recommendation system project, I found myself using a variety of evaluation metrics. So I compiled a list of metrics that I found helpful and some other things to consider while evaluating recommendation systems. These metrics are commonly found in ML packages, yet understanding their purpose and functionality is essential.

Recall @K

Recall@K gives a measure of how many of the relevant items are present in top K out of all the relevant items, where K is the number of recommendations generated for a user. For example, if we are building a movie recommender system where we recommend 10 movies for every user. If a user has seen 5 movies, and our recommendation list has 3 of them (out of the 10 recommendations), the Recall@10 for a user is calculated as 3/5 = 0.6. Usually, the average is taken across all users for evaluation.

It is a simple yet significant metric from a business point of view, as we can show how good a system is in bringing real value in terms of predicting user behavior.

Range : 0–1

Precision @K

Precision@K gives a measure of “out of K” items recommended to a user and how many are relevant, where K is the number of recommendations generated for a user..

For a recommendation system where we recommend 10 movies for every user. If a user has watched 5 movies and we are able to predict 3 out of them ( 3 movies are present in our recommendation list) then our Precision@10 is 3/10.

It is a very important metric from a scale and ranking point of view because, in the real world, there is a limit to how many recommendations you can serve to the user. This can be related to: attention span (users want to able to see relevant recommendations at first glance, so having relevant recommendations at the top is crucial), and, memory requirements: suppose you are only able to store 100 recommendations per user, then you want to be precise in what you choose.

Range : 0–1

F1 @K

F1 Score is a combination of Precision and Recall using harmonic mean. This is the same as the regular F1 score and does not differ in the context of the recommendation systems. The harmonic mean nature makes sure if either Precision or Recall has a really high value, then it does not dominate the score. F1 Score has a high value when both precision and recall values are close to 1.

Range : 0–1

Ranking related metrics:

As we discussed above, when talking about precision, it is crucial to have relevant recommendations at the top. There are various methods to measure if relevant recommendations are indeed at the top. These measurements are not only used in evaluation but also used as a loss metric for ranking models.

Mean Average Precision @K

One way of measuring how good a recommendation list is at predicting relevant items based on their position in the list is using “Mean Average Precision”.
Let’s first understand what Average Precision is. If we recommended K items, out of which Q is relevant then the Average precision is defined as :

In case, if all the relevant items are at the top then Average Precision score for that user is high.

Example :

List of Recommendations : [”Top Gun”, “Arrival”, “Gladiator”]

Ground truth : [“Arrival”, “Gladiator”]

Precision @K’s = [0, 1/2, 2/3]

Average Precision (AP) = (1/3)[(1/2) + (2/3)] = 0.38

The mean in MAP is just average precision(AP) values across all users :

Range : 0–1

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank measures the position of the first relevant item discovered within a recommendation list. Reciprocal Rank (RR) is used when we only care about the position of highest ranked result. Here, rank is the position of an item in the list of recommendations.

The reciprocal is useful because it makes sure that items that have a lower rank (e.g. Rank 20) get a lower score because the reciprocal of a big value is a really small value. So it benefits if most relevant items are predicted to be at the top of the list.

Reciprocal Rank only cares about the first relevant item. For Example,

List of Recommendations : [”Top Gun”, “Arrival”, “Gladiator”]

Ground truth : “Arrival”

Then, Reciprocal Rank (RR) = (1/2) = 0.5

In the context of recommendation systems we could also use MRR , if we have multiple values in recommendation systems, we can average them.

List of Recommendations : [”Top Gun”, “Arrival”, “Gladiator”]

Ground truth : [“Arrival”, “Gladiator”]

Then, Mean Reciprocal Rank (MRR) = 1/2* ((1/2) + (1/3)) = 0.41

Range : 0–1

Normalized Cumulative Discounted Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) is the measure of how good a ranked list is. The idea is that if relevant items are ordered from most relevant to least relevant then the NDCG score is maximized if the most relevant items are recommended at the top of the list.

Let’s break this down using an example :

To try to stick to the previous example: if we identify a user as an action movie watcher, then let’s assume relevancy scores as :

“Top Gun”, “Gladiator”: 2 (most relevant)

“Toy Story”: 1

“The Whale” : 0 (least relevant)

List of Recommendations :

[”Top Gun”, “Toy Story”, “The Whale”, “Gladiator”] ⇒ [2, 1, 0, 2]

Cumulative Gain (CG): Cumulative gain at a position p is the relevancy score at that position. So for the entire list, it is: 2 + 1 + 0 + 2 = 5

The cumulative gain does not take into account the position of items. So, if an item the most relevant item is at the end of the list (like “Gladiator”) then it is not reflected in the CG score.

To deal with that, we introduce Discounted Cumulative Gain (DCG), where we assign a score/discount to each position by which the relevancy score will be penalized.

So, if a relevant item like “Gladiator” is put at recommended at the end of the list, it will be discounted by 1/log2(n) (where n is the size of the list : It will be multiplied by a much smaller number like 0.2 so its contribution to score will be really small) compared to the first item which will not be discounted.
DCG scores are highest if all the relevant items are at the top.

For the items, Set A: [2, 1, 0, 2] :

let’s compare this to Set B: [2, 2, 1, 0], where all the relevant items are at the top :

Clearly, the DCG of set B is higher than the DCG of set A. Also, et B is what we call Ideal Discounted Cumulative Gain (IDCG), which gives us the DCG of the ideal list where items are perfectly sorted according to their relevancy scores.

What if we need to compare DCG scores to two lists of different sizes?
That is the case where IDCG comes into the picture, we divide our DCG scores by IDCG scores and get a value between 0–1. This score is called Normalized Discounted Cumulative Gain (nDCG).

Now we can compare nDCG scores of two lists of different sizes.

nDCG Range : 0–1

This is an overview of some of the metrics that are widely used to evaluate recommender systems.

Things to consider while evaluating recommender systems:

I plan to put together articles on these topics soon, and in that case, I will link them here.

  1. Popularity Bias

There can be a great amount of popularity bias in recommender systems and it is difficult to detect and fix because popular items are relevant and score high on most of the standard metrics. There are various ways to measure and fix popularity bias and I plan to put together an article that talks about them.

Image by Author

2. Position bias

Position bias occurs when items placed higher on a list are more likely to be viewed or bought, irrespective of their actual relevance. As a result, items with lower rankings receive less engagement. This might also affect other metrics, and there are methods to mitigate this.

3. Degenerate Feedback loop

When users are limited to interacting with suggested items, and algorithms rely on user feedback from these suggestions for training, a negative feedback loop can emerge. This loop reinforces displaying of items that have been previously shown, potentially resulting in a negative user experience over time due to a significant portion of relevant items remaining undiscovered by the user.

--

--

Hi there! I'm Pratik, a Machine Learning enthusiast who shares insights through how I understand the world