
Table of Contents
∘ Introduction ∘ Problem Statement ∘ Why Causal ML? ∘ Data Collection ∘ Exploratory Data Analysis (EDA) ∘ Data Modeling ∘ Model Explanation ∘ Building a Web App ∘ Limitations ∘ Conclusion
Introduction
What makes a song tick? It’s easy to justify your love for a song when the artist hits a high note or recites a thought-provoking verse. It’s also easy to like a song solely because it was performed by one of your favorite artists. However, that alone does not account for the current music landscape. In this saturated market, where countless tracks have similar voices, genres, and styles, some tracks just happen to outperform others.
This begs the question: are there more hidden/latent audio factors that influence our inclination toward certain tracks? This project attempts to answer this question by leveraging causal ML to build a tool that can help identify potential drivers of Spotify song popularity.
Note: All source code and the web app itself can be accessed in the repository provided at the end of this article.
Problem Statement
The goal of the project is to build a Machine Learning model that predicts the popularity of Spotify tracks based on user-defined features. The model will be deployed in a web app that can be accessed by other users.
Fortunately, Spotify quantifies many of its tracks’ otherwise qualitative audio data, which makes this project possible to carry out. For instance, the Spotify API offers the danceability
feature, which provides a numeric value denoting how suitable a song is for dancing. For access to all of the provided audio features as well as their descriptions, feel free to visit the Spotify API documentation.
Spotify tracks also contain the popularity
variable, which is the target label for this machine learning project. According to the documentation:
The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
Why Causal ML?
It is standard to prove causation by conducting experiments (e.g., A/B testing) that track a predictive variable and its influence on a target variable. Unfortunately, it is infeasible to conduct experiments where songs that only differ in one variable can be compared to each other. After all, songs comprise many elements that are difficult to control with precision.
On the other hand, causal ML grants users the opportunity to create endless simulations of songs and derive their predicted popularity scores. Although the model will not directly indicate which variables cause high popularity (correlation is not causation), it will give impetus and direction for any subsequent research and analysis.
Data Collection
The data used for the project was pulled using the Spotify API. Specifically, it used Spotipy, the Python library for the Spotify Web API to procure data on tracks and artists. The data collection entailed pulling tracks released in 2022 from the most successful artists on the platform in order to get exposure to high-popularity tracks.
Procuring the information required multiple steps as the features of interest needed to be pulled from different API endpoints.
The procedure for procuring the training data is as follows:
- Collect data on all top artists
Since song popularity depends on the artists themselves, it is important to collect basic information like their follower count and music genres. The following function is used to collect data for 1000 artists.

2. Collect all tracks released by the top artists
After collecting data for all artists, Spotipy was used to collect all Spotify tracks for each artist. The following function collects data on all tracks released by a given artist. Note that the maximum number of records in a specific query is 1000.

3. Collect all audio feature data for all tracks of top artists
Next, Spotipy is used to collect all audio data for the tracks collected in the previous step. The following function collects audio data for the given track (each track is identified by a track id).

4. Merge all datasets
By joining the datasets and storing each feature in individual columns, the resulting dataset contains artist data and track data, which can be used to train ML models to predict song popularity.

Overall, the raw dataset comprises 101,700 tracks and 22 columns.

Reminder: the definitions of the audio features (columns 7–18) are provided in the Spotify API documentation.
Exploratory Data Analysis (EDA)
Before the dataset can be used for training machine learning models, a thorough analysis should be conducted to determine what elements should be added, discarded, and changed.
Performing EDA will uncover more information on the data, which will provide more insight into what processes and transformations should be executed prior to training any machine learning models.
- Handling missing data
Songs that are missing audio features are removed from the dataset.
2. Handling duplicate Data
All records with duplicate track ids are omitted.
3. Examining the popularity variable
Next, the distribution of values in the popularity
feature, the target variable, is visualized.

There is a lack of high-popularity tracks, which poses a challenge since the model needs to be able to correctly identify tracks with a high popularity scores. One way to address this issue is to use an evaluation metric that heavily penalizes bigger errors, such as mean squared error (MSE) or root mean squared error (RMSE).
4. Examining the predictor variables
Similarly, the distribution of predictor variables was visualized.

Since the predictive features have different ranges of values, they will have to be scaled to mitigate bias from the model. Moreover, the scaling method will need to handle the outliers exhibited in some of the features (e.g., liveness, acousticness).
5. Dropping features
Features such as track ids and artist ids that have no bearing on song popularity are dropped from the dataset.
6. Computing correlation between predictors
Multicollinearity, a concept where independent variables are strongly correlated, will hamper the performance of the trained model. To avoid this, the variation inflation factor (VIF) for each variable was computed with the following function.

Since the predictive features yield VIF values less than 5, there is no evidence of multicollinearity in the dataset.
6. Handling categorical features
There are two categorical features in this dataset that need to be addressed prior to training the model: key
and genre
.
The key
feature represents the key the track is in. Currently, the key is denoted by numbers, but this implies that some keys are greater than others, which is false. Thus, this feature will be one hot encoded.
Unfortunately, one hot encoding isn’t a feasible method for the genre
feature since the feature has 609 unique values! One hot encoding a column with this many unique values will only yield a high dimensional dataset. Instead, the sub-genres will be consolidated into one binary variable named is_pop_or_rap
, which is 1 if the song is either pop or rap, and is 0 if the song is in a different genre.
Data Modeling
The EDA has shed some light on how the modeling phase should be carried out. It indicates that the models will yield better performance if tuned based on the MSE or RMSE metric.
It also shows that the data will need to be subject to one-hot encoding and standardization prior to being used to train the models.
1. Preparing the training and testing data
The data is first split into training and testing sets.
2. Creating a baseline
A baseline will help contextualize the performance of the actual models. For this study, the baseline model was a linear regression model with default parameters.
3. Creating a feature engineering pipeline
Prior to being trained by machine learning algorithms, the predictors will need to undergo feature engineering. First, the categorical feature key
will be one hot encoded. After that, all features will be scaled with standardization.
The transformations and the model were stored in a pipeline object using the following snippet.
4. Training the models
Multiple machine learning regression algorithms were trained with the data after feature engineering. The regressor models used were the following:
- Linear regression (baseline)
- Lasso regression
- Random forest regressor
- LightGBM regressor
- XGBoost regressor
For each regression model, hyperparameter tuning was applied based on the mean squared error metric to ascertain the best hyperparameter combinations. The best hyperparameters for each model were chosen using the following function.
After the hyperparameters for the models were determined, the models were evaluated against the testing set based on the RMSE and MAE metrics. The following table summarizes the performances of all of the models.

The baseline model yielded an RMSE and MAE of 17.65 and 13.03 against the testing set, respectively. While all of the other models outperformed the baseline, the random forest regressor was the stand-out performer, with an RMSE and MAE of 13.10 and 8.04, respectively.
Given that the random forest model yielded the best performance against the testing set, it will be deployed in the web app.
Model Explanation
Shapley Addictive Explanations (SHAP) helps explain the random forest regressor’s performance by showing how much each feature contributed to the predictions. This will provide shed some light into how predictions are being made (i.e., which features influence the target) and can even show evidence of deficiency in the model or training data.

According to the plot, the regressor’s predictions are most heavily impacted by the followers
, duration_s
, and speechiness
features. The model deems that songs with artists with a high following, medium duration, and high speechiness (i.e. verbosity) will yield high popularity. Intuitively, it makes sense for these factors to influence popularity, but it is surprising that the model doesn’t rank features like energy
or is_pop_or_rap
highly.
Overall, the model’s evaluation metric scores suggest that there is room for improvement. The summary plot makes it clear that some features are not favored by the model despite their ties to popularity being backed by domain knowledge. It also suggests that the limited features in the data itself prevent the model from accurately gauging popularity.
Building a Web App
Deploying the model through a web app is an effective way of using it to generate predictions for multiple songs.
The random forest regressor is deployed in a streamlit app, which predicts the popularity of tracks with features chosen by users.
The streamlit app can be run by entering the following command into the terminal:
streamlit run app.py

The users can select their track’s features in the sidebar and click on the "predict" button to view the model’s prediction. With this web app, users can leverage the ML model to predict song popularity for endless combinations of song parameters.
Limitations
Although using causal ML can help provide some insight into what audio features have an influence over song popularity, there are some limitations in this approach that need to be addressed. Specifically, there are certain subjects that the study has failed to sufficiently delve into.
- Numerical representation of audio features
While Spotify does provide users with the means to quantify certain features in a song (e.g., energy), the assigned numeric values may not be adequate for representing the otherwise qualitative features.
2. Artist/Album/Song Marketing
There’s no doubt that a song’s popularity has a lot to do with how the artist markets their songs as well as themselves. Unfortunately, there is little representation of this factor in the collected data. For future reference, it’s worth considering the influence of entities like record labels and artists’ engagement on popular social media platforms (e.g., Twitter, Instagram).
3. Lyrics
While you can’t magically make someone love a song by saying a certain word, phrase, or sentence, lyrics no doubt play a role in how people enjoy tracks. Perhaps a topic modeling method like Latent Dirichlet Allocation (LDA) would provide some insight into the type of lyrics that garner more attention for different genres.
4. Customer Demographic
Listeners of different demographics are likely to have different standards for tracks. Instead of lumping the entire audience into one group, it might be preferable to segment the audience by age/gender/race and look into what types of songs appeal to these groups.
5. Gradual Change in User Preference
Finally, a significant shortcoming of the model is that it only accounts for songs released in 2022. Even if it adequately captures the elements that yield high popularity in tracks, the standards in music will inevitably change over time. As a result, this model will have to be consistently trained with new data to remain usable.
Conclusion

Overall, I took a shot at answering a question that is in the minds of most artists and record labels. By leveraging machine learning, we are able to run advanced simulations to make conjectures on how songs with certain features will be received by the listeners.
If you are interested in running the streamlit app on your device or if you just wish to examine the source code, you can access the project’s Github repository here:
GitHub – anair123/Identifying_Drivers_Of_Song_Popularity_With_Causal_ML
I hope you had as much fun learning about this project as I did making it.
Thank you for reading!