
Do you ever wonder how the section, "Recommended (based on what’s in your playlist)," on your Spotify works? This article solves part of the mystery and builds a recommendation in the process. Let us break down the "Recommendation Algorithm" in streaming services.
Context
This is an article in a four-article series about using Spotify Playlists Datasets to build an automatic playlist continuation (APC) algorithm. See the previous and following articles here by my teammates.
- Part I: Extracting song data from Spotify’s API in Python
- Part II: EDA and Clustering
- Part III: Building a Song Recommendation System with Spotify
- Part IV: Deploying a Spotify Recommendation Model with Flask
Now that we’ve learned how to import the data and build a clustering model on the data. We now take a look at using a recommendation system to predict the next songs of a playlist.
Introduction
You might have heard the term "Recommendation System (RS)" when YouTubers are discussing the latest tactics to get more views or when you or your friends compare the "Recommended for you" list on Netflix. In a nutshell, recommendation systems _recommend things that the people might like_ based on your own watch history or you and friends watch history as a collective. From a non-Data Science practitioner’s perspective, that’s pretty much all you need to know about RS… Oh, but you ARE a Data Science practitioner. If that’s the case, we need to dig a little bit deeper into the ideas and code behind a recommendation system.
Before we deep dive into the RS and the code, let’s take a step back and clarify some questions:
- _What is a recommendation system?_As I have mentioned before, the goal of a recommendation algorithm is to recommend or predict items a user might like based on their data or based on the entire user database. We will talk about precisely how to do this and the different types of RS later on, but for now, here’s a conceptual pipeline to show the process of recommending a song.

- _Why should we use a recommendation for song prediction tasks?_The other question is why recommendation systems are used in a song prediction task. As you read in Part II, it is possible to use a cluster-based algorithm to predict the songs, however, it lacks the flexibility to add other features to the system, such as a classification predictor. In other words, a clustered-based algorithm is one type of recommendation system. However, compared to the two other types of RS introduced below, a cluster-based algorithm lacks flexibility. In fact, Both content-based filtering and collaborative filtering can include the clustering outcome into the models, creating a hybrid RS.
Now, let me shortly introduce the two most common recommendation systems in the industry.
Recommendation System

Based on the object on which the filtering technique is applied, which can classify RS into two types: Content-based filtering and Collaborative filtering. Woah, lot’s of big words here. Let me explain. Filtering refers to "selection" – we are selecting features that are similar rather than dissimilar to one another. For example, out of the following words:
Your bag of words: cat, cow, bird, bridge
Which word is the odd one out? Intuitively, the word "bridge" pops out. How come? You might think that all the others are a type of animal. You could also think that all others are alive, all others have eyes, etc. The point is: there is a set of distinct features that separates bridge with all the other words in the bag.
Now, we have a new word, dog, and we want to know what word is closer to dog in the bag. Naturally, we would work say, "Dog is more similar to cat, cow, and bird, compared to bridge." If some ask you the reasons, you would answer that dog is a type of animal, alive and have eyes, etc.
Wait. This seems oddly familiar. Yes, you are using the same process when you wanted to differentiate bridge with the other words in the bag, or precisely, the same set of features. By finding the similarities in features among the items in the bag, we essentially, "filtered" bridge out of the bag.
Voilà. You now know how to do content-based filtering! Content-based filtering uses the features of each item to find the similarities items. By assigning a score to how similar each item is, we can recommend an item based on how similar it is to all other items in the dataset.
In the context of Spotify playlists, we use the features (loudness, tempo, etc.) of each song in a playlist to find the average score of the whole playlist. Then, we recommend a song that has a score similar to the playlist but is not in the playlist.
Content-based Filtering: Recommend songs that are similar to the other songs in the dataset.
"But, wait. You said another type of filtering. How is there a different way to select and filter the items in a bag other than their features?"
Good question! Simply, there isn’t! The other type of filtering does not filter the items in the bag, but the bag itself! Confused? Let me explain with the same analogy:
Now, your friend comes in with another bag of words with the following words:
Your friend’s bag of words: road, tunnel, steel, goat.
Using our previous approach, we can see that the new word "dog" is more similar to your bag compared to your friend’s bag based on the item features since the majority of the words in your bag are a type of animal, have eyes, etc., compared to your friend’s bag with mostly inanimate objects.
However, there is another way to approach this problem. What if you have a group of friends with their own bag of words as the following:

If the word is in their bag, the value would be "YES" or else it would be "NO". Now, I colored all values with either red for YES or blue for NO and separated the words into two parts based on the black line.

We see that the words above the black line are animals, while the words below are inanimate objects.
Now, I am not sure if the word dog should be put in your or your friend’s bag since both of you do not have the word in your bags at the moment. To do so, I conducted the following investigation:
- I looked at your friends A, B, and C and see what words are in their bag.
- Surprisingly, I found that you have 3 mismatches with Friend A and your friend has only 1 mismatch with Friend C.
- I checked whether they have the word dog in their bags or not.
- Friend A has dog and friend C does not have dog.
- I put dog your bag rather than your friend’s.
We see that we "filtered" data based on what you and your friends have in common or, in other words, collaboratively. You now have just performed collaborative filtering (CF)!
Since you and your friends are the owners of the bag and we look for the similarities among the items in your bags, this type of collaborative filtering is called user-user CF, where the emphasis is put on the items held among users.
In the context of song prediction, we would look at look the similarities in the songs of each playlist and recommend a song in one playlist if the similarity of songs is high with another playlist and that song is not in the other playlist.
Collaborative Filtering (CF): Recommend songs based on the overlap of songs in playlists in the dataset.
There are alternative ways to do collaborative filtering, including item-item CF and the broader umbrella model-based CF, but we will not be going into details in this article.
Please refer to this article by Abhijit Roy for more information.
Implementation

In this section, I will mainly be implementing content-based filtering due to the constraints of this project.
Looking at the annotated recommendation system pipeline above, we will first look at the features of the Spotify data based on the data cleaning from Part I. Then, we will conduct featuring engineering to process the data into a form that could be fed into the content-based filtering algorithm as well as the similarity measure used. Lastly, we will shortly discuss some thresholds and metrics associated with the model.
Preprocessing
Before extracting any features, there are two preprocessing steps needed after importing the raw data extracted.
- Data Selection
The data selection procedure involves two tasks. The first one is dropping duplicate songs. Since the data imported was originally Spotify playlist data, it is crucial to delete replicate songs that exist between multiple playlists. The process involves collecting the artist name and the track titles so that we do not accidentally delete songs that have the same name but by different artists.
This is conducted simply via pandas dataframe manipulation:
Due to the abundance of data from the original source, we want to select features that are relevant to other later on feature engineering and recommendation (more information in full notebook). This was also conducted using pandas:
- List Concatenation
After selecting the useful data, due to the import format of a dataframe, we need to convert the genres
columns back into a list.

This is done by using the split()
function:
Features
The next step is to feature engineer the various data. We can classify the variables in the data into three types based on the source of the data, namely metadata, Audio data, and text data.
- Metadata
Metadata refers to the attributes related to the song but not the song itself (e.g., popularity and genres). In this project, I treated the metadata in two ways.
For genre data, I used one-hot encoding, a common technique to transform categorized data into machine-readable features. This is done by converting each category into a column so that each category can be represented as either True or False.

This can be executed using pandas package:
However, the genres in Spotify are not balanced distributed with some genres more prevalent while others are more obscure. In addition, one artist or track could be associated with multiple genres. Hence, we need to weigh the importance of each genre to combat overweighing specific genres while underestimating others.
Therefore, TF-IDF measures are introduced and applied to the genre data. TF-IDF, also known as Term Frequency-Inverse Document Frequency, is a tool to quantify words in a set of documents. The goal of TF-IDF is to show the importance of a word in the documents and the corpus. The general formula for calculating TF-IDF is:

- Term Frequency (TF): The number of times a term appears in each document divided by the total word count in the document.
- Inverse Document Frequency (IDF): The log value of the document frequency. Document frequency is the total number of documents where one term is present.
The motivation is to find words that are not only important in each document but also account for the entire corpus. The log value was taken to decrease the impact of a large N, which would lead to a very large IDF compared to TF. TF is focused on the importance of a word in a document, while IDF is focused on the importance of a word across documents.
In this project, the documents are analogous to songs. Therefore, we are calculating the most prominent genre in each song and their prevalence across songs to determine the weight of the genre. This is much better than simply one-hot encoding since there are no weights to determine how important and widespread each genre is, leading to overweighting on uncommon genres.

To implement this, we used the TfidfVectorizer() function from scikit learn.
For popularity data, I used treated as a continuous variable and only normalizing it into a range between 0 and 1. The idea is songs that are popular are likely to be heard by people who like popular songs, while songs that are less popular are likely to be heard by people who have the same taste.
This was done via the MinMaxScaler() function in scikit learn:
- Audio
Audio data refers to the audio features of the song extracted using the Spotify API. For example, the loudness, tempo, danceability, energy, speechiness, acousticness, instrumentalness, liveness, valence, and duration. In this project, the only manipulation I conducted is on these data was normalization based on the maximum and minimum values of each variable.
In addition, one-hot encoding was conducted on several other audio features, such as the key of the track. The constraint of this method is similar to that of one-hot encoding – we do not know if people view equally among different keys. By assuming every key is equally weighted, it is less likely to obtain the best representation of the data mathematically. Therefore, possible hyperparameter tuning could be required to improve the prediction.
- Text
The only text feature I ended up using was the track name. I conducted sentiment analysis finding the polarity and subjectivity of the track name.
- Subjectivity (0,1): The amount of personal opinion and factual information contained in the text.
- Polarity (-1,1): The degree of strong or clearly defined sentiment accounting for negation.
The goal of the sentiment analysis is to extract additional features from the tracks. By doing so, we can extract sentiment data other audio features via textual information. For example, if the general mood of the song titles of a playlist is positive, then this can be utilized to recommend positive songs. However, due to the short length of the titles, the two metrics cannot produce optimal results. Hence, the weighting of the two metrics is rated low.
To conduct sentiment analysis, TextBlob.sentiment was used:
Now, we combine every feature engineering method into one function and output the data into a large feature dataframe:
Process
After extracting all the data of each song, we now discuss the process of performing a content-based filtering algorithm. Two steps were implemented in the algorithm. These two steps are needed every time some enters a new playlist query:
- Playlist Summarization

In this step, we want to summarize all songs in a playlist into one vector that can be compared to all other songs in the dataset to find their similarities.
First, we would need to import a playlist dataframe. The only thing needed in the dataframe is the track id. Having the track id data, we can first separate between songs that are in the playlist and those that are not. It is important to exclude the songs in the playlist since we do not want to recommend existing songs.
Then, we find the features of those songs based on the dataset we’ve generated before in the last section. Hence, it is important that our dataset includes as many songs as possible to decrease the possibility of no-matching songs in the playlist during this step.
Finally, we add all the feature values of each song in the playlist together as a summarization vector.
In other words, this vector describes the whole playlist as if it is one song.
- Similarity and Recommendation
After retrieving the playlist summarized vector and the non-playlist songs, we can find the similarity between each individual song in the database and the playlist. The similarity metric chosen is cosine similarity.
Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two-dimensional, the visual representation would look similar to the figure below.

Once the two vectors are pointing generally in the same direction, then they are similar. This is also the reason why I did not find the mean of the songs but simply added them up. In our situation, the song vectors are hyperdimensional so we cannot illustrate it nicely in a graph. However, the mathematical intuition is still the same.
Formally, the mathematical formula can be expressed as:

In our code, we used the cosine_similarity()
function from scikit learn
to measure the similarity between each song and the summarized playlist vector.
One big advantage of doing this is the time complexity of the whole algorithm is equal to a matrix multiplication since we are performing the cosine similarity measure between each row vector (song) and the column vector of the summarized playlist feature.
Result
One of the biggest problems with this model is that there are almost no metrics to evaluate whether the recommendation is good or bad. Since in this project, a clustering-based approach was conducted in Part II. We’ve decided to compare the two recommendations to see if there is any consensus or not.
Comparison


To compare the result of the different recommendation systems, we found this playlist called "Mom’s Playlist" and input the data into our different engines. (upper-left figure shows the top 20 songs in Mom’s playlist)
From the figure above, we see that The results were very different. Here, the two methods are the content-based filtering approach and the clustering method using K-nearest neighbors. For more information on the latter, please read part II of the series. We see that across the lists, the only similarity is that Beyoncé appeared in two of them – however, the songs don’t match. The discrepancy in the result could indicate two possibilities. The most probable reason is that all two models are not performing well. This could be because of the dataset size, lack of hyperparameter tuning, or model constraints. However, there is also a possibility that one of the models is performing the best, while the other one is not keeping up. Regardless of which, this shows the problem of not having a proper metric to train the model, resulting in a lack of technique to measure success that can help improve the model.
Lastly, this also shows the advantages of big tech companies in the field of recommendation systems. In an open-source environment, it is hard to measure the success of your system without deploying and receiving feedback from the users. In terms of song recommendation, this can be the number of users adding recommended songs to their playlist. By looking at the metrics, we can perform A/B testing to see which model or parameters perform the best and update the model accordingly. Nevertheless, understanding recommendation systems gives you a solid start to excel in this field. Hopefully, after reading this article, you will gain more clarity in the black box under the infamous but essential "recommendation algorithm".
Summary
In this article, we first learned the basics of recommendation systems, involving the two general approaches of content-based filtering and collaborative filtering methods. Then, we built a content-based filtering recommendation system using Spotify playlists data from scratch. This involves the procedure of feature preprocessing, feature generation, and recommendation modeling. We also talked about some constraints of each section to show that this method could be further optimized to secure a better result. Lastly, we compared the recommendation with other song recommendation systems, specifically, the clustering methodology, and see the difficulties in measuring success for recommendation systems in an open-source environment.
After building a model, you cannot do anything with it if you do not know how to deploy it! If you want to learn the deployment process of this model, please read my teammate’s part on model deployment next in the series.
Reference
- Full Notebook On Github
- Full Project Github Repository
- Google Developers, Content-based Filtering (2021)
- A. Roy, Introduction To Recommender Systems- 1: Content-Based Filtering And Collaborative Filtering (2020), on Towards Data Science
- M. Thaker, Spotify Recommendation System (2020), on Github
- W. Scott, TF-IDF from scratch in python on a real-world dataset. (2019), on Towards Data Science
- P. Shah, Sentiment Analysis using TextBlob (2020), on Towards Data Science
- Spotipy, spotipy documentation (n.d.)