How Spotify Implemented Personalized Audiobook Recommendations

Introduction

Spotify is the most popular music-streaming app in the world. In addition to songs and albums, Spotify has a great collection of podcasts and talk shows. They have recently introduced audiobooks in their app. Like any other offering, Spotify wanted to ensure that its audiobook recommendations catered to user’s preferences. Hence, they developed a Graph Neural Network-based recommendation algorithm to personalize audiobook recommendations.

This article discusses the challenges Spotify faced in delivering personalized audiobook recommendations and the exploratory data analyses conducted to address them. It explores Spotify’s innovative solution: a two-tower graph neural network model designed to enhance audiobook personalization.

Challenges

As audiobooks were a recent addition to Spotify’s content library, they faced some challenges –

There was a data scarcity issue as the content type was newly introduced. There were fewer user interactions for audiobooks compared to other content types. Many users were unaware of audiobooks on Spotify.
Audiobooks are currently available for Premium users but were initially launched under the direct-sales model, meaning users had to pay explicitly to listen to audiobooks. There was even more scarcity of explicit signals that Spotify could use to build a Recommendation System.

This article will explore the exploratory data analyses they performed, the model architecture, the model deployment, and the model evaluation.

Exploratory Data Analyses

Spotify analyzed users’ known historical preferences for music and podcasts and content similarities between podcasts and audiobooks. Spotify’s initial data analysis reveals a strong correlation between audiobooks and podcasts. User interactions with podcasts can be valuable in understanding audiobook user preferences. For instance, an audiobook about an entrepreneur’s biography has similarities with a podcast with an entrepreneur guest. They observed that over 70% of audiobook users had previously interacted with podcasts. However, 25% of the users contributed to 75% of streaming hours and 20% of the audiobooks contributed to 80% of streaming hours, indicating data scarcity.

Spotify analyzed more than 800M streams on its platform over a 90-day period. The data for this analysis was limited to podcast and audiobook streams. They studied co-listening patterns among the users and performed embedding analysis. They used cosine similarity as a distance metric and plotted the cosine similarity distribution.

Exploratory Data Analyses. Pic Source ([1])

Observation 1 – Similarity in Audiobook and Podcast preference

Spotify sampled 10000 user pairs who had streamed at least one common audiobook (in other words, co-listened) and sampled 10000 user pairs randomly. They fetched the user embeddings from their production podcast recommendation model to study similarities between podcasts and audiobooks.

Users who co-listened to at least one audiobook tend to have higher podcast embedding similarity scores than users chosen at random (see Figure 2B). This implies that users with similar audiobook tastes are more similar in their podcast preferences than users chosen at random.

Observation 2 – Audiobook content is important

Spotify used Sentence-BERT to generate content embeddings for all audiobooks and podcasts. They used content metadata like title and description. Spotify randomly sampled 10000 audiobook pairs co-listened by at least one user and 10000 audiobook pairs.

The co-listened audiobook pairs have higher cosine similarities between their content embeddings than random audiobook pairs (see Figure 2C).

Observation 3 – Podcast interactions will help Audiobook preference understanding

Spotify constructed a podcast-audiobook interaction graph. Podcasts and audiobooks represent the nodes. These nodes are connected if at least one user co-listened to them. They sampled 10000 audiobook pairs connected by at least one podcast and randomly sampled 10000 audiobook pairs. The cosine similarity of the Sentence-BERT content embeddings was used for this analysis.

Two audiobooks co-listened with the same podcast had higher cosine similarities than two audiobooks chosen randomly.

Model Architecture

Spotify introduced a 2T-HGNN model, consisting of a heterogeneous graph neural network (HGNN) and a two-tower (2T) model. This model was scalable (for real-time serving) and modular, meaning HGNN and 2T could be used independently and for various other business use cases.

Heterogeneous Graph Neural Network Model

Spotify constructed a co-listening heterogeneous graph consisting of two types of nodes: podcasts and audiobooks. The edges between the nodes are connected if at least one user has listened to both. Thus, this graph has information about audiobook-audiobook, audiobook-podcast, and podcast-podcast relationships. These nodes are represented by Sentence-BERT content embeddings, generated from content metadata such as title and description.

The HGNN model is trained on all three relationships, as it contains more information about content and user preference than only audiobook interactions. This solves the data scarcity issue.
It is a GraphSAGE model with 2-hops message passing. For instance, as shown in the figure above, if Audiobook A1 is connected to Podcast P1 (1-hop relation), and Podcast P1 is connected to Audiobook A4, then the implication is that Audiobook A1 and Podcast A4 are somehow related (2-hop relation).
GraphSAGE updates the node embeddings by sampling and aggregating embeddings from each node’s local neighborhood. For every node, it samples a fixed number of neighbors, aggregates their embeddings across 2 hops, and combines these with the node’s embedding. This allows GraphSAGE to generalize embeddings to new nodes, thus solving the cold-start problem.

The HGNN model is optimized by a contrastive loss function. The loss function aims to increase the cosine similarity between connected nodes in the graph (positive pair samples) and decrease the cosine similarity between disconnected nodes (negative pair samples). All the edges of the graph are traversed to train the model. They kept one positive pair and randomly sampled negative pairs for each step of gradient descent optimization.

The co-listening graph is imbalanced. There were fewer audiobook-audiobook interactions than podcast-podcast interactions. Due to the scarcity of audiobook-audiobook interactions, they undersampled the podcast-podcast interactions to mitigate imbalance, prioritize the main objective (learn audiobook preference), and better train the models.

Two-Tower Model

The two-tower model (2T) architecture has gained massive popularity among the recommendation system community. The HGNN component of 2T-HGNN learned audiobook and podcast embeddings using user interactions. The 2T component introduces user personalization. The 2T consists of two deep neural networks, called towers, one for user representation, and the other for enhanced audiobook representation.

The user tower is fed inputs such as user demographic information, user’s music preference embeddings, and user’s audiobook and podcast preference embedding. The music embedding is obtained from one of Spotify’s in-house music recommendation algorithms. The audiobook and podcast preference embedding is obtained by taking the mean aggregate of the HGNN audiobook and podcast embeddings he/she interacted with in the last 90 days.
The audiobook tower is fed inputs such as audiobook metadata (genre, language), the Sentence-BERT content embedding of its title and description, and the HGNN embedding.
The 2T model produces two output embeddings (user embedding and audiobook embedding) from each tower.

The 2T model is trained using a contrastive loss function, which tries to project user embeddings closer to audiobook embedding when there is an interaction, and far away from audiobook embedding with no interaction. The interactions were primarily strong signals like "stream". Later, Spotify analyzed various weak signals like "intent to pay", "follow", and "preview" and added them as user interactions for 2T model training.

Model Deployment

2T-HGNN is trained daily. Firstly, the HGNN model is trained. The resulting audiobook and podcast embeddings are passed to the 2T model for its training. The 2T model generates enhanced audiobook embeddings stored in a vector database for an approximate nearest neighbor match. During inference, the user features/embeddings are passed through the user tower of the 2T to obtain an enhanced user embedding. This is followed by a vector similarity search between the enhanced user embedding and audiobook index to fetch the top k audiobooks for the user.

The modular structure of 2T-HGNN enables training the HGNN on a different schedule from the 2T model. For instance, the HGNN could be trained weekly to reduce costs, while the 2T model is updated daily to maintain fresh user representations.

Model Evaluation

Offline Evaluation

The model was first evaluated offline using standard ranking metrics like Hit-Rate@K, Mean Reciprocal Rank, and coverage.

Hit-Rate@K: This measures the proportion of users for whom at least one relevant item appears within the top K recommended items.
Mean Reciprocal Rank (MRR): This metric evaluates the ranking position of the first relevant item in the list of recommendations. It calculates the reciprocal rank of this item (e.g., 1 for first position, 0.5 for second, etc.), and averages this score across all users to reflect the overall ranking quality.
Coverage: This measures the diversity of items recommended across all users.

Offline Model Evaluation on metrics like Hit Rate, Mean Reciprocal Rank and Coverage. Pic Source ([1])

The 2T-HGNN model’s performance was compared with models such as the popularity model (ranking based on popularity), HGNN-w-users (a tripartite GNN with users as nodes), LLM-KNN (content-based embedding similarity search), and 2T (a two-tower model without HGNN embeddings). The 2T-HGNN outperformed all the models on Hit-rate@10 and MRR metrics. It performed poorly in coverage, meaning that 2T-HGNN had a popularity bias.

Online Evaluation

An A/B experiment was conducted using 2T-HGNN as a candidate generator to assess its online performance for the "Audiobook for You" section on Spotify’s homepage. This experiment involved 11.5 million users divided into three groups: one using the current production model, one with recommendations from a 2T model, and one from the 2T-HGNN model. The following business metrics were used for online evaluation

Stream rate – This metric tracks the volume of audiobook streams generated by the recommendations. "Rate" (the number of streams listened to by the users divided by the number of streams shown to the user) is used to normalize the numbers for fair comparison.
New audiobook start rate – This metric tracks the number of new audiobooks users started listening to. "Rate" (the number of new streams started by the users divided by the number of new streams shown to the user) is used to normalize the numbers for fair comparison.

A/B Test Results Using metrics like Stream Rate and New audiobook start rate. Pic Source ([1])

Results showed that 2T-HGNN significantly increased the rate of new audiobook starts and led to higher audiobook stream rates, whereas the 2T model showed a smaller increase in start rate and no significant impact on stream rate.

References –

I hope you find the article insightful. Thank you for reading!