Balancing Act: Addressing Popularity Bias in Recommendation Systems

Published in

Towards Data Science

7 min readAug 18, 2023

You woke up one morning and decided to treat yourself by buying a new pair of shoes. You went on your favorite sneaker website and browsed the recommendations given to you. One pair in particular caught your eye — you loved the style and design. You bought them without hesitation, excited to wear your new kicks.

When the shoes arrived, you couldn’t wait to show them off. You decided to break them in at an upcoming concert you were going to. However, when you got to the venue you noticed at least 10 other people wearing the exact same shoes! What were the odds?

Suddenly you felt disappointed. Even though you initially loved the shoes, seeing so many others with the same pair made you feel like your purchase wasn’t so special after all. The shoes you thought would make you stand out ended up making you blend in.

In that moment you vowed to never buy from that sneaker website again. Even though their recommendation algorithm suggested an item you liked, it ultimately didn’t bring you the satisfaction and uniqueness you desired. So while you initially appreciated the recommended item, the overall experience left you unhappy.

This highlights how recommendation systems have limitations — suggesting a “good” product doesn’t guarantee it will lead to a positive and fulfilling experience for the customer. So was it a good recommendation after all ?

Why is it crucial to measure popularity bias in recommendation systems ?

Popularity bias occurs when recommendation systems suggest a lot of items items that are globally popular rather than personalized picks. This happens because the algorithms are often trained to maximize engagement by recommending content that is liked by many users.

While popular items can still be relevant, relying too heavily on popularity leads to a lack of personalization. The recommendations become generic and fail to account for individual interests. Many recommendation algorithms are optimized using metrics that reward overall popularity. This systematic bias towards what is already well-liked can be problematic over time. It leads to excessive promotion of items that are trending or viral rather than unique suggestions. On the business side, popularity bias can also lead to a situation where a company has a huge inventory of niche, lesser-known items that go undiscovered by users, making them difficult to sell.

Personalized recommendations that take a specific user’s preferences into account can bring tremendous value, especially for niche interests that differ from the mainstream. They help users discover new and unexpected items tailored just for them.

Ideally, a balance should be struck between popularity and personalization in recommendation systems. The goal should be to surface hidden gems that resonate with each user while also sprinkling in universally appealing content now and then.

How to measure popularity bias ?

Average Recommendation Popularity

Average Recommendation Popularity (ARP) is a metric used to evaluate the popularity of recommended items in a list. It calculates the average popularity of the items based on the number of ratings they have received in the training set. Mathematically, ARP is calculated as follows:

Where:

|U_t| is the number of users
|L_u| is the number of items in the recommended list L_u for user u .
ϕ(i) is the number of times “item i” has been rated in the training set.

In simple terms, ARP measures the average popularity of items in the recommended lists by summing up the popularity (number of ratings) of all items in those lists and then averaging this popularity across all users in the test set.

Example: Let’s say we have a test set with 100 users |U_t| = 100. For each user, we provide a recommended list of 10 items |L_u| = 10. If item A has been rated 500 times in the training set (ϕ(A) =. 500), and item B has been rated 300 times (ϕ(B) =. 300), the ARP for these recommendations can be calculated as:

In this example, the ARP value is 8, indicating that the average popularity of the recommended items across all users is 8, based on the number of ratings they received in the training set.

The Average Percentage of Long Tail Items (APLT)

The Average Percentage of Long Tail Items (APLT) metric, calculates the average proportion of long tail items present in recommended lists. It’s expressed as:

Here:

|Ut| represents the total number of users.
u ∈ Ut signifies each user.
Lu represents the recommended list for user u.
Γ represents the set of long tail items.

In simpler terms, APLT quantifies the average percentage of less popular or niche items in the recommendations provided to users. A higher APLT indicates that recommendations contain a larger portion of such long tail items.

Example: Let’s say there are 100 users (|Ut| = 100). For each user’s recommendation list, on average, 20 out of 50 items (|Lu| = 50) belong to the long tail set (Γ). Using the formula, the APLT would be:

APLT = Σ (20 / 50) / 100 = 0.4

So, the APLT in this scenario is 0.4 or 40%, implying that, on average, 40% of items in the recommended lists are from the long tail set.

The Average Coverage of Long Tail items (ACLT)

The Average Coverage of Long Tail items (ACLT) metric evaluates the proportion of long-tail items that are included in the overall recommendations. Unlike APLT, ACLT considers the coverage of long-tail items across all users and assesses whether these items are effectively represented in the recommendations. It’s defined as:

ACLT = Σ Σ 1(i ∈ Γ) / |Ut| / |Lu|

Here:

|Ut| represents the total number of users.
u ∈ Ut signifies each user.
Lu represents the recommended list for user u.
Γ represents the set of long-tail items.
1(i ∈ Γ) is an indicator function equal to 1 if item i is in the long tail set Γ, and 0 otherwise.

In simpler terms, ACLT calculates the average proportion of long-tail items that are covered in the recommendations for each user.

Example: Let’s say there are 100 users (|Ut| = 100) and a total of 500 long-tail items (|Γ| = 500). Across all users’ recommendation lists, there are 150 instances of long-tail items being recommended (Σ Σ 1(i ∈ Γ) = 150). The total number of items across all recommendation lists is 3000 (Σ |Lu| = 3000). Using the formula, the ACLT would be:

ACLT = 150 / 100 / 3000 = 0.0005

So, the ACLT in this scenario is 0.0005 or 0.05%, indicating that, on average, 0.05% of long-tail items are covered in the overall recommendations. This metric helps assess the coverage of niche items in the recommender system.

How to fix reduce popularity bias in a recommendation system

Popularity Aware Learning

This idea takes inspiration from Position Aware Learning (PAL) where the approach is to rank suggests asking your ML model to optimize both ranking relevancy and position impact at the same time. We can use the same approach with popularity score, this score can any of the above mentioned scores like Average Recommendation Popularity.

On training time, you use item popularity as one of the input features
In the prediction stage, you replace it with a constant value.

xQUAD Framework

One interesting method to fix popularity bias is to use something called at xQUAD Framework. It takes a long list of recommendations (R) along with probability/likelihood scores from your current model, and builds a new list (S) which is a lot more diverse, where |S| < |R|. The diversity of this new list is controlled by a hyper-parameter λ.

I have tried to wrap the logic of the framework :

We calculate a score for all documents in set R. We take the document with the maximum score and add it to set S and at the same time we remove it from set R.

To select next item to add to ‘S’, we compute the store for each item in R\S (R excluding S). For every item selected for adding to “S”, P(v/u) goes up so the chance of a non-popular item getting picked up again also goes up.

If you liked this content, find me on linkedin :).

References

https://arxiv.org/pdf/1901.07555.pdf

https://www.ra.ethz.ch/cdstore/www2010/www/p881.pdf