Various evaluation metrics are used for evaluating the effectiveness of a recommender. We will focus mostly on ranking related metrics covering HR (hit ratio), MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain).
Recommender systems eventually output a ranking list of items regardless of different modelling choices. So it is important to look at how to evaluate directly ranking quality instead of other proxy metrics like mean squared error, etc.
HR (Hit Ratio)
In recommender settings, the hit ratio is simply the fraction of users for which the correct answer is included in the recommendation list of length L.

As one can see, the larger L is, the higher hit ratio becomes, because there is a higher chance that the correct answer is included in the recommendation list. Therefore, it is important to choose a reasonable value for L.
MRR (Mean Reciprocal Rank)
MRR is short for mean reciprocal rank. It is also known as average reciprocal hit ratio (ARHR).

Note that there are different variations or simplifications for calculating RR(u). For implicit dataset, the relevance score is either 0 or 1, for items not bought or bought (not clicked or clicked, etc.).
Another simplification is only to look at one top relevant item in the recommendation list, instead of summing up for all. in case of implicit dataset, there is no ordering of relevance per se, it is sufficient just to consider any one relevant item on top of the list.
One could argue that hit ratio is actually a special case of MRR, when RR(u) is binary, as it becomes 1 if there is a relevant item in the list, 0 otherwise.
MAP (Mean Average Precision)
Let’s first refresh our memory on precision and recall, especially in the Information Retrieval are.
What are precision and recall?
In short, precision is the fraction of relevant items in all the retrieved items. It is used to answer how many items among all recommendations are correct.
And recall is the fraction of relevant items in all relevant items. It is to answer the coverage question, among all those considered relevant items, how many are captured in the recommendations.


What is precision@k?
Building upon it, we can also define precison@k and also recall@k similarly. Precision@k would be the fraction of relevant items in the top k recommendations, and recall@k would be the coverage of relevant times in the top k.
What is Mean Average Precision?
Now back to MAP.
MAP is the mean of Average Precision. If we have the AP for each user, it is trivial just to average it over all users to calculate the MAP.
By computing a precision and recall at every position in the ranked sequence of documents, one can plot a precision-recall curve, plotting precision p(r) as a function of recall r. Average precision computes the average value of p(r) over the interval from 0 to 1.
This is essential the area under the precision-recall curve. In a discrete manner, it can be calculated as follows

We can finally calculate the MAP, which is simply the mean of AP over all users.

NDCG (Normalized Discounted Cumulative Gain)
NDCG stands for normalized discounted cumulative gain. We will build up this concept backwards answering the following questions:
- What is gain?
- What is cumulative gain?
- How to discount?
- How to normalize?
Gain for an item is essentially the same as the relevance score, which can be numerical ratings like search results in Google which can be rated in scale from 1 to 5, or binary in case of implicit data where we only know if a user has consumed certain item or not.
Naturally Cumulative Gain is defined as the sum of gains up to a position k in the recommendation list

One obvious drawback of CG is that it does not take into account of ordering. By swapping the relative order of any two items, the CG would be unaffected. This is problematic when ranking order is important. For example, on Google Search results, you would obviously not like placing the most relevant web page at the bottom.
To penalize highly relevant items being placed at the bottom, we introduce the DCG

By diving the gain by its rank, we sort of push the algorithm to place highly relevant items to the top to achieve the best DCG score.
There is still a drawback of DCG score. It is that DCG score adds up with the length of recommendation list. Therefore, we cannot consistently compare the DCG score for system recommending top 5 and top 10 items, because the latter will have higher score not because its recommendation quality but pure length.
We tackle this issue by introducing IDCG (ideal DCG). IDCG is the DCG score for the most ideal ranking, which is ranking the items top down according their relevance up to position k.

And NDCG is simply to normalize the DCG score by IDCG such that its value is always between 0 and 1 regardless of the length.

Notes
- Wikipedia on NDCG is pretty good: https://en.wikipedia.org/wiki/Discounted_cumulative_gain
- Wikipedia has a very nice list of evaluation metrics used in IR: https://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=793358396#Average_precision
- This article explains MAP very well both in IR and object detection: https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52