The world’s leading publication for data science, AI, and ML professionals.

Forecasting the Sound of Mainstream Music with Deep Learning

Utilizing LSTM Recurrent Neural Networks to Forecast Annual Trends of Audio Features

Photo by Fixelgraphy on Unsplash
Photo by Fixelgraphy on Unsplash

Undoubtedly one of the most popular music streaming platforms today, Spotify, is highly driven by data. Data scientists utilize machine learning recommender systems for personally-curated playlists, which are one of the most popular and eerily accurate features of the platform. More thought-out, and even novel features, are constantly sought out to keep users interacting with the platform.

So I was thinking of how to get creative with Spotify’s developer data.

Have you ever heard the term "ahead of their time" when someone was referring to an artist? What if we could predict, today, who will be ahead of their time tomorrow? Of course, musicians who are "ahead of their time" are largely famous but they are more importantly often responsible for initializing a change in the landscape of their genres. A change in the course of sonic history.

Enter Data Science …and Spotify’s impressive algorithmic audio analysis of song features for every single song.

Spotify uses algorithms to give each song a numerical profile, based on features such as "acousticness", "danceability", and "tempo". We can aggregate these features over time to answer our question.

Goals:

The main goal of this project is to analyze and forecast the annual average (mainstream) trends of features of a dataset of over 160,000 songs on Spotify. I will attempt to forecast what the values of each feature will likely be in the future, using Long Short Term Memory (LSTM) Recurrent Neural Networks. This will essentially be a data-driven, numerical representation of what the average music might sound like in 5 years. A playlist of songs that can then be attributed most closely to those future feature values can be created and titled "Tracks Ahead of Their Time", and can serve as a popular attraction for app users.

The Data:

There are over 160,000 tracks, released from 1920 to 2020, each with 14 feature values:

<iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fjovian.ai%2Fembed%3Furl%3Dhttps%3A%2F%2Fjovian.ai%2Foac0de%2F1-preprocessing-eda%2Fv%2F4%26cellId%3D12&dntp=1&display_name=Jovian&url=https%3A%2F%2Fjovian.ai%2Foac0de%2F1-preprocessing-eda%2Fv%2F4%26cellId%3D12&image=https%3A%2F%2Fjovian.ai%2Fapi%2Fgist%2F3a9243bdcae64223b6772e70443b59a4%2Fpreview%2F2fad6332a6ed4f3a80f840c7d5762b32%3Fsize%3Dl&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&scroll=auto&schema=jovian" title="Click "View File" to see the whole EDA/Processing notebook on Jovian.ai." height="200" width="800">

Some of feature definitions from the Spotify developer page:

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Exploring the Data

How has music evolved, numerically, over the years? Here are a couple examples of what I saw during EDA:

'acousticness' feature boxplot over time
‘acousticness’ feature boxplot over time

There is a sharp distinction for acousticness. Highly acoustic music in the early 20th century. Technological advancements in instruments, amplifiers, and production equipment is likely the reason for most recent music to have less acoustic instruments and overall sound.

This data’s distribution is telling me that the mean values for acousticness per year might not be a great representation of the data as they are subject to skew by the outliers. We might need to create a new yearly data frame with median values instead. The median values will more accurately represent the majority of the data.

A violinplot for 'key' over time
A violinplot for ‘key’ over time

Here we have an idea of the usage of different musical key scales, a categorical feature, over time. 0 and 7 key scales are used the most, as indicated by the thickness of their violins, in most years.

Cleaning

At a first glance, this dataset looked extremely clean to me. It had no missing values, no odd characters, etc. But upon closer inspection and visualization from EDA, some details started to emerge which needed to be dealt with.

For one example, the ‘speechiness’ definition tells us: "Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words."

We see now know to filter out any tracks which contain speechiness values over 0.66, as they are made up of songs which are not actually music. In other words, these are recorded speeches and perhaps stand-up comedy recordings which aren’t going to play a part in music trends.

Feature Engineering

I noticed during EDA that I will need to create a new data frame which has grouped each year for each value by the median and not the mean. This is because of the lop-sided distribution of outliers that were rendering the mean values a not-so-accurate representation of the measure of central tendency. Have a look at this comparison of median ‘acousticness’ vs. mean ‘acousticness’ over time:

The easiest way I was able to create this new data frame was by assigning the "groupedby" medians to separate columns and then merging them, because there are many original columns which are not relevant for us such as ‘explicit’ or ‘name. Take a look at my code here:

<iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fjovian.ai%2Fembed%3Furl%3Dhttps%3A%2F%2Fjovian.ai%2Foac0de%2F1-preprocessing-eda%2Fv%2F4%26cellId%3D115&dntp=1&display_name=Jovian&url=https%3A%2F%2Fjovian.ai%2Foac0de%2F1-preprocessing-eda%2Fv%2F4%26cellId%3D115&image=https%3A%2F%2Fjovian.ai%2Fapi%2Fgist%2F3a9243bdcae64223b6772e70443b59a4%2Fpreview%2Fc0d36d7811094508b751eff9ba71e8ea%3Fsize%3Dl&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&scroll=auto&schema=jovian" title="Click "View File" to see the whole EDA/Processing notebook on Jovian.ai." height="200" width="800">

Modeling

I performed ARIMA and LSTM modeling and tuning, initially using the ‘acousticness’ feature whose plotted time series looks like this:

The real goal here was to practice using LSTM models, however I did baseline modeling with ARIMA. After numerous ARIMA and LSTM models, using logged values, differenced data, LSTM layer stacking, Bidirectional LSTM layering, and more, the model which gave me the lowest Root Mean Squared Error (RMSE) was an LSTM model which utilized 1-lag differenced data.

validation loss curve for LSTM that used differenced data
validation loss curve for LSTM that used differenced data

So what exactly is an LSTM model? LSTM stands for "Long Short Term Memory" and this a type of Recurrent Neural Network (RNN). An LSTM cell selectively forgets part of it’s longer-term sequence (through the recurrent forget gate) with a series of sigmoid and tan activations. These remaining values are used to predict the next value, which is output into the next cell as the memory cell input for the next sequence.

If you would like a more in-depth and technical understanding of how to create an LSTM model, check out the post I wrote where I break down the coding process (link coming soon!).

Forecasting Future Values with Best LSTM Model

Since values (years in our case) are increasingly hard to predict the further out they are, we will cap our forecast at 5 years. Here is a chart representing the forecasts to 2025 for all features which were normalized (and could fit on the same y-axis):

Filtering For Artists "Ahead of Their Time"

I decided to drop some features from the modeling which had the same median values throughout history, such as ‘explicit’ and ‘mode’.

Unfortunately, our dataset did not contain any songs which completely matched all of the features’ predicted (median) values. I resorted to filtering through the data using an interval of each prediction’s model RMSE from it’s predicted value, for songs most close to the forecasted values of 2025.

Most "Ahead of Their Time" Artist: Twenty One Pilots – Tear In My Heart

Second Most "Ahead of Their Time" Artist: INXS – Beautiful Girl

Third and Fourth Most "Ahead of Their Time" Artists: Luh Kel, Lil Tjay – Wrong (feat. Lil Tjay) Remix

Clean Bandit, Zara Larsson – Symphony (feat. Zara Larsson)

There were no songs within our dataset which matched the range of every forecasted feature range. The one song we found which had the most in common with the 2025 forecast (7 out of 10) feature ranges was Twenty One Pilots – ‘Tear In My Heart’. This song was within our ‘danceability’, ‘duration_ms’, ‘energy’, ‘tempo’, ‘valence’, ‘popularity’, and ‘key’ error ranges from the actual forecast values. A song with 6 out of 10 shared features was INXS – ‘Beautiful Girl’. 2 songs shared 5 out of 10 values. 12 songs shared 4 out of the 10 values.

Conclusions

We have successfully forecasted each of our features. So what do they tell us? Our model predicts that by 2025, the average (median) song will possess the following characteristics:

Tempo will have increased to back around 122.2 BPM. Loudness will have continued to rebound, and reach a median of nearly -6db. (quite loud!) Instrumentalness in songs could possibly have made a comeback, increasing from 0. Speechiness (vocals/spoken words) will have plateaued. Acousticness will be much lower again, down to 0.08. Average Popularity will continue to grow. Valence (positivity/happiness) will have dipped and plateaued at 0.455. Energy will continue to rebound and hit 0.649. Danceability will more or less plateau as well, only rising by a small fraction to 0.697. In other words, peak danceability is here to stay for the next 5 years.

After subsequent research, it is often suggested that simple time series’, as in with less data points, generally do better with ARIMA modeling, while LSTM models are potentially better and more efficiently performing with very complex time series. This considered, we were able to obtain a lower RMSE with LSTM modeling, using stationary data, for our final forecasting of each feature!

Further Recommendations

If possible, we can access the Spotify API directly and query the entire Spotify library for those most ahead artists with all features. Spotify lists over 50 Million songs in total. There is a much better probability of finding songs which completely matched the future averages for each feature modeled.

It would be interesting to further extend this forecast to individual genres and see how each genre has evolved over time, and which artists within those genres have been playing music of those values much earlier.

Lastly, more time allocation to improving the model layering and/or parameters would help the accuracy of our predictions.


Feel free to check out the whole project on GitHub here and also send me a message if you have any questions or tips about my process!


Related Articles