The world’s leading publication for data science, AI, and ML professionals.

Predicting Ticket-Selling Artists Using Spotify Audio Features And Pollstar Data

Will your artists sell tickets?

Image by the author
Image by the author

Why Do We Care?

As a talent buyer and a concert promoter myself, predicting ticket sales is an everyday task for me. Box office records, streaming counts, number of social media followers are the common metrics the music industry uses to decide if artists and shows are worth investing in. The question is: what if an artist has no box office history, has no active social media account, or has no streaming history? Is there any other metric we can take into account? Can we predict an artist’s selling potential by their audio features instead of streaming counts?

The Project

The goal of this project is to predict if an artist has potential to be a top ticketing-selling artist by using Spotify audio features. This project includes four parts: (1) Data Analysis Of Spotify Data (2) Feature Comparison Between Ticket-Selling Artist and Non Ticket-Selling Artists (3) Music Genres Analysis (4) Machine Learning Models.

Data Source

The datasets used in this project are from two sources: Spotify and Pollstar. Pollstar is a trade publication for the concert industry. The dataset I obtained from Pollstar includes 834 artists’ weekly box office records including average ticket sales gross, average tickets sold, average ticket price, etc. from 2017-early 2020.

Pollstar Dataset
Pollstar Dataset

The two Spotify datasets obtained from Kaggle contain (1)audio features of 160K+ songs released between 1921 and 2020 and (2) music genres of each artist.

Data Analysis Of Spotify Data

Understanding the trend of music could help investors’ evaluation and direct them toward better investments in the concert industry. Therefore, I decided to conduct a data analysis solely on the Spotify dataset first to explore the relationships for each audio feature. For definition of each audio feature, please check here.

Audio Features Correlation Matrix
Audio Features Correlation Matrix

The correlation chart shows that year vs popularity, energy vs loudness, and energy vs acousticness are highly correlated.

I am interested in seeing how each audio feature changes over time, so I created some line charts to examine if there is any trend in specific eras.

Based on the graph, it seems that the newer a song is, the higher the popularity is ; however, please note that the way Spotify calculated its popularity is based on not only total streaming count but also how recently tracks were played. Some songs might have a higher streaming count but have not been played much recently. In such a case, the popularity might decrease. Therefore, I would not simplify the conclusion of newer songs being more popular than old songs.

Danceability decreased significantly from 1935 to 1950 and has increased over time after 1950.

Artists make less and less acoustic and instrumental tracks.

Music is louder and more energetic now than before.

Tracks after 1950 feature far fewer spoken words than tracks before 1950; however, after 2000, spoken words in tracks started increasing. Liveness of tracks fluctuate, but in general it is getting lower and lower over the years.

Positive vibe of music (valence) decreased significantly after 1940 and started increasing around 1950 but started decreasing again after 2000. After the 1960’s, the duration of songs significantly increased, but it started decreasing after 2010.

The tempo of tracks are getting faster and faster after 1950. Explicit lyrics increased significantly after 2000. It is very interesting to see that explicit tracks increased in a very short time after 2000, which happened very recently. I was curious about the relationship between explicit and other audio features. I found out that although explicit songs account for only 9% of total tracks, the average popularity of those is 55% higher than non-explicit tracks. Also explicit tracks have higher danceability and energy than non-explicit songs.

After exploring the relationship between years and audio features, it’s possible to see that some turning points shown in the charts fell with the same time period. I was wondering if these turning points might be associated with social environment/context. Here are my thoughts on this:

  • Pre-War Music vs Post-War Music

Some of the audio features have significant changed between 1940–1950 such as the decrease in danceability, tempo, loudness, and energy. Was this change caused by wars or any other social events during that time?

  • The Rise of Rap Music

This table shows the number of explicit songs by genres in different years in a descending order and most of the explicit songs after 2000 are hip hop and rap music. Was the increase of explicit lyrics associated with the rise of hip hop music?

  • Streaming Era

The decrease in duration of a song is probably associated with the streaming era. Streaming services had brought a huge impact on how artists write music and get paid. The impact has made artists create works that will hook listeners’ attention in a short time. I am already beginning wondering if TikTok will make music even shorter.

Feature Comparison Between Top Ticket-Selling Artist and Other Artists

Pre-Processing

After gaining some insights on the music trend from Spotify data, I would like to explore the difference of each audio feature between top ticket-selling artists and other artists. After some data cleaning, I merged Spotify dataset and Pollstar dataset on the artist column. The artists originally included in the Pollstar dataset will be assigned to the top_artist column mentioned below as "YES", and the rest of artists will be "No".

Feature Engineering

I added the following columns for exploratory data analysis.

  • top_artists: YES means an artist is in Pollstar dataset; otherwise, NO.
  • active_years: how many years an artist has/had released tracks.
  • number_of_releases: how many tracks an artist has on Spotify.
  • duration_minutes: the duration of track in millisecond converted to the duration in minute ( [‘duration_ms’]*1.6666666666667E-5).

As mentioned, my goal is to predict artists’ ticket-selling potential by their audio features, so I aggregated the dataset based on each artist to obtain the average number and standard deviation of each feature.

Below are the feature comparisons between top ticket-selling artists and other artists:

Tracks of top ticket-selling artists have higher popularity, higher energy, and are less acoustic.

Tracks of top ticket-selling artists are slightly faster, and louder and less instrumental.

Tracks of top ticket-selling artists are slightly longer and less positive.

On average, top ticket-selling artists have been active in the market longer. Top ticket-selling artists have (had) been active in the market for 14 years as compared to 6 years for other artists. Top ticket-selling artists release 34 tracks on average, as compared to the 7 tracks for other artists.

Based on this graph, again, top ticket-selling artists, on average, have higher popularity on Spotify. However, regardless if artists sell or not, artists whose releases feature explicit content are more popular than artist who release non-explicit content on Spotify.

Music Genre Analysis

In general, more and more music genres emerge over the years, but I am more interested in the top music genres that top ticket-selling artists play. Please note that I did not use Spotify’s "popularity" to define which genres are more popular; instead, I use the number of the tracks of each genre to define what music genre is popular in different years. The popularity feature on Spotify defines the popularity of each track, but there might be some situations where the genre of a track with high popularity is not widely recognized. That’s why I would rather define the popularity of each genre by how many tracks produced than define the popularity by Spotify feature.

Visit here for the interactive dashboard!
Visit here for the interactive dashboard!

The interactive plot above shows that music genres of top artists play in different years. You can also see how the popularity of music genres vary over the years. Rock music dominates the industry in 70s and 80s. After 90s, though, there’s less and less rock music as pop and hip hop artists started dominating the market. K-pop and Latin music began gaining recognition recently too.

I did not include music genre in the dataset I used for machine learning. There are around 1000 distinct music genres in this dataset, so it would take a great deal of time and work to re-categorize these genres. It could even be a solo project, so I look forward to working on it in the future and seeing the different outcome from adding genres to my machine learning dataset.

Machine Learning Models

This dataset I used for training models is an **** imbalanced dataset. The dataset includes 834 top ticket-selling artists which we knew from the Pollstar dataset, and other 18851 non ticket-selling artists.

Data Imputation

There are some null values in the standard deviation columns. The null values in the standard deviation columns simply means that some artists only have one track on Spotify, so the standard deviation of each audio feature cannot be computed. Therefore, I replaced the null value with 0. I also dummified the target variable (1: top_artist, 0: non top_artist).

Models

There are two models used in this project: (1) Random Forest Classifier (2) Logistic Regression.

Random Forest Classifier

Auc_roc score is one of the metrics for evaluating model performance. However, as this is an imbalanced dataset, auc_roc score solely might not be able to measure the performance well. In such a case, I would like to take precision and recall into consideration. Precision and recall explain what proportion of the prediction is correct and how well the target class is predicted correctly, respectively. I ran GridSearch several times when conducting random forest classifier and narrowed it down to three results. The scores among these three models are not all that different, but there is difference of false positive rate and false negative rate.

From left to right: Model 1, Model 2, Model 3
From left to right: Model 1, Model 2, Model 3

Model 1:

  • cross validation score (auc_roc): 0.92
  • train score (auc-roc): 0.98
  • test score(auc-roc): 0.91
  • false postive rate: 0.07
  • false negative rate: 0.33

Model 2:

  • cross validation score (auc_roc): 0.92
  • train score (auc-roc): 0.99
  • test score(auc-roc): 0.90
  • false postive rate: 0.04
  • false negative rate: 0.45

Model 3:

  • cross validation score (auc_roc): 0.91
  • train score (auc-roc): 0.99
  • test score(auc-roc): 0.90
  • false postive rate: 0.03
  • false negative rate: 0.55

Feature Importance

The feature importances from these 3 models are similar too. Number of releases, average popularity, active years are the top 3 indicators when making a prediction. The score of these 3 features are also much higher than other features.

Logistic Regression

The performance of logistic regression is not as good as those of random forest classifier, so we will not consider this model.

I would suggest choosing the model with the highest precision as our final model. In this case, it will be the Model 3 from random forest classifier. As the cost of predicting a non ticket-selling artist to be a top ticket-selling artist is higher than the cost of predicting a ticket-selling artist to be a non ticket-selling artist, we should really focus on how precisely our model can detect ticket-selling artists. For example, you would expect to spend more money on venues, equipment, staff, hospitality and so on for Concerts or tours of top ticket-selling artists; However, if the model made the wrong prediction, it would be highly possible to lose the money you invested in as the ticket sales would not be as good as you expected.

Conclusion

Music is sensitive to the social environment and context. Although we have advanced technology to help us predict the performance of artists, evaluating if your artist fits the current social environment or context would be as vital as, or even more important than, a model score.

Number of releases, average popularity, active years are the most important features in our model. Based on our exploratory Data Analysis, these three features also show the significant difference between ticket-selling artists and other artists. However, explicit is a very distinct feature in my exploratory analysis and explicit lyrics has been increasing significantly since 2000 and continues growing, but it is not shown in the feature importance graph. One possible explanation is that explicit content is much more popular in both groups of artists, so this feature are not relevant enough to predict an artist’s box office performance. Nevertheless, since this is a distinct feature in our exploratory data analysis, I would take this feature into consideration as well.

In this project, the cost of type I error (false positive) is higher than type II error (false negative), choosing the model with the highest precision rate will help us avoid type I error.

Music genres could be an indicator for investors to know the current trend in the market. Based on our analysis, rock music is not as popular as it was. Instead, pop and hip hop started dominating the music industry recently. Investing in a hip hop artists might be less risky than investing in a rock artist.

Music business is a delicate business that is closely connected to our daily life and society. Choosing a person to invest in, in my own perspective, is way more complicated than we imagine. This project is a good start for my own music business journey, and I hope it also helps anyone who is interested in this industry.

Source

  1. Yamac Eren Ay, Spotify Dataset 1921–2020, 160k+ Tracks, (2020), Kaggle.
  2. Concert Pulse 2017–2020, Pollstar.

Related Articles