The world’s leading publication for data science, AI, and ML professionals.

A Real World, Novel Approach to Enhance Diversity in Recommender Systems

Addressing the long tail problem and enhancing the recommendation experience for users on the Headspace App

Recommender Systems

DALLE Generated Image by Author
DALLE Generated Image by Author

The Choice Between Two Recommendation Systems

Following weeks of diligent work, you finally have a well-deserved evening free to indulge in a nice restaurant experience. Two friends have offered restaurant recommendations. One friend tends to stick to the tried and true, rarely venturing into uncharted territory. Once a joke lands, they will repeat it incessantly with slight variations. While their taste is dependable, their recommendations have never left you thoroughly impressed. This first friend mirrors a recommender system afflicted by the long tail problem; they recommend popular restaurants that are dependable and palatable to the majority but not necessarily tailored.

Conversely, there’s the friend who constantly embraces risk, blurting out ideas in a haphazard manner. You’re always left wondering about their taste preferences. While they’ve suggested some remarkable dining spots, they’ve also thrown in a fair share of dreadful ones, all delivered with the same level of enthusiasm. The second friend is like a recommender system that throws out random suggestions, occasionally leading to pleasant surprises but also disappointments. Whose advice would you follow for the evening?

In my article, "Beyond Accuracy: Embracing Serendipity and Novelty in Recommendations for Long Term User Retention," I discussed the importance of going beyond mere accuracy metrics to address the long tail problem and enhance the recommendation experience for users. In this article, I will discuss a real-world implementation when I worked as a machine learning engineer at Headspace Health. While there is a paper published under the Creative Commons Attribution 4.0 International License that goes in-depth with the implementation, I’ll explain the rationale behind each decision made and share the unexpected findings we encountered during the process.

Photo by Daniel Korpai on Unsplash
Photo by Daniel Korpai on Unsplash

The Problem

Headspace Health has an awesome content creation team that produced a large repertoire of diverse content. However, our recommender system suffered heavily from the long-tail problem, meaning a substantial portion of high-quality content remained obscured or undiscovered. This posed a barrier to user exploration and engagement, particularly in areas where discovering new content is pivotal for sustained interest and growth. It became evident that addressing this challenge was essential for enhancing the user experience and ensuring the effectiveness of our platform in supporting individual wellness goals.

The Initial Model – Why TiSASRec?

Our recommender system employed the TiSASRec model, which stands for Time Interval Aware Self-Attention for Sequential Recommendation.

Sequential recommendation models are a type of recommendation system that takes the order of user interactions into account when suggesting items. This is in contrast to traditional models that might only consider the items themselves or the user’s overall preferences, without factoring in the sequence of their actions. However, sequential models often don’t explicitly consider the time intervals between those interactions.

TiSASRec was created to address the importance of time intervals and modified the attention mechanism to prioritize information relevant to the current time. While TiSASRec initially outperformed previous models, it gradually began to exhibit a bias towards shorter, more popular content over longer, niche content. This tendency resulted in recommendations becoming repetitive and predictable over time.

The Solution

We aimed to enhance the Diversity of our recommendations without introducing excessive randomness or inconsistencies. Given our confidence in TiSASRec, we pursued an approach that leveraged the contextual insights it captured while enhancing diversity. Our solution involved implementing a wrapper around the model outputs, which we termed "semantic sampling."

Photo by Dawit on Unsplash
Photo by Dawit on Unsplash

Semantic sampling involved extracting language embeddings from the titles and teasers of the model output recommendations. Subsequently, we computed cosine similarities between the output and all other content pieces in the library. Content selection was based on a combination of factors including semantic similarity to the original recommendation, popularity, and various diversity-based metrics.

This methodology allowed us to broaden the coverage of our content while ensuring relevance to the user.

Figure 2. Creative Commons Attribution 4.0 International License
Figure 2. Creative Commons Attribution 4.0 International License

Choosing the Right Embedding Model

We conducted tests on different embedding models by examining the distribution of cosine similarity scores. Given the wellness-focused nature of the content, we anticipated a narrow range of similarity scores. However, we aimed for an embedding model that would yield a broader range of scores while still exhibiting a left-skewed distribution, aligning with our expectations.

Ultimately, we opted for LaBSE due to its distribution meeting our criteria. This decision was further supported by LaBSE’s strong performance in paraphrasing similarity tasks, making it an appropriate choice for our recommendation system.

The Evaluation

The consensus was that our ultimate goal remained centered on engagement. We anticipated a potential initial dip in engagement but hypothesized that engagement would improve long term as niche content became more visible to relevant groups. However, maintaining engagement while enhancing content visibility and diversity metrics would still be a big win. In the best-case scenario, users would encounter new content and experience pleasant surprises, leading to a sense of "serendipity." Conversely, the worst-case scenario involved users noticing a decline in the personalized content they relied on, leading to a loss of interest and trust. Hence, alongside engagement metrics, we tracked "beyond accuracy" measures, including coverage, entropy, rarity, and intra-list diversity (ILD), as detailed in the previous article.

To evaluate performance, we conducted an A/B test involving approximately 140,000 control users and 140,000 treatment users over a period of 42 days. The primary metric utilized was the content start rate, defined as the number of times users initiated interaction with the content divided by the number of times the content was displayed to the users.

The Results

The implementation of an approach aimed at enhancing the diversity of recommendations yielded remarkably positive results in the real world.

Figure 3. Creative Commons Attribution 4.0 International License 3.
Figure 3. Creative Commons Attribution 4.0 International License 3.
Table I. Creative Commons Attribution 4.0 International License
Table I. Creative Commons Attribution 4.0 International License

High Level Performance Summary

  • Coverage improved by 13.5–26.7%
  • Entropy improved by 19.2–22.1%
  • Rarity improved by 562.8–765.9%
  • ILD improved by 2.8–8.8%
  • Content starts improved by 2.26%
Table II. Creative Commons Attribution 4.0 International License
Table II. Creative Commons Attribution 4.0 International License

Across all metrics, the performance enhancements were evident with the introduction of semantic sampling. Notably, approximately 20% of our previously overlooked content library was now being surfaced to users. Additionally, the statistically significant increase in content starts, despite our initial expectation of potential decrease, was a pleasant surprise. This positive outcome emphasized the effectiveness of the implemented approach, demonstrating its capacity to not only improve recommendation diversity but also drive user engagement over the duration of the experiment.

Limitations

However, despite its effectiveness, one notable constraint is the potential oversight of highly niche content or themes not represented in the current recommendations. Content pieces with low semantic similarity to the rest of the items in the library may still remain undiscovered. There’s an opportunity here to explore alternative avenues for addressing this gap and further enhance the inclusivity and comprehensiveness of our recommendations.

DALLE Generated Image by Author
DALLE Generated Image by Author

Conclusion

This work highlights the importance of understanding user behavior and the dynamic interplay between personalized and serendipitous experiences. The semantic similarity-based approach we employed offered a more nuanced solution compared to solely relying on diversity-based metrics for resampling. By ensuring that the sampled content shared thematic similarities with the user-preferred items, we maintained relevance and consistency in the recommendations. This safeguarded against the introduction of content that might seem incongruous or irrelevant to users’ interests. This research on the balance between Personalization and serendipity is a fascinating area for researchers and practitioners in the field of recommender systems to explore. It is my hope that this will inspire further innovation in refining recommendation algorithms to better cater to individual preferences while still introducing novel and unexpected content.


Related Articles