Do Hit Songs Have Anything in Common?

Analyzing Spotify’s data using Python

Published in

Towards Data Science

7 min readMay 6, 2019

When you log in to Spotify.me, you will get a personalised summary of how Spotify understands you through the music you listen to on Spotify. It is pretty cool!

As someone who listens to music a lot and who likes to play around with data, this inspired me to see if I could analyse my own music.

I’ve also been very curious to know if there were certain ingredients that makeup hit songs. What makes them cool? Why do we like hit songs and do hit songs have a certain “DNA”?

Objective

This led me to try to answer two questions using data from Spotify in this post:

What does my music playlist look like?
Are there particular audio attributes that are common among hit songs?

Tools

Luckily, there are really simple tools out there to help us connect to Spotify, retrieve data and then visualise it.

We will be using Python 3 as the programming language, Spotipy, the Python library that allows you to connect to the Spotify Web API and we will use plot.ly and Seaborn for data visualisation.

Dataset

At the end of each year, Spotify compiles a playlist of the songs streamed most often over the course of that year and this playlist has 100 tracks. The dataset I used was already available on Kaggle: Top Spotify Tracks of 2018 . Top 100 songs from Spotify seem like a reasonable dataset to consider as our hit songs, don’t you think?

Let’s Get Started!

To get started, you will need to create an account at developer.spotify.com. After that, you can access the Spotify Web API Console directly and start exploring the different API endpoints.

Note: The link to the code I used for the entire project is at the end of the blog post.

After connecting to the Spotify Web API, we will create a Spotify object using the Spotipy Python Library, which we will then use to query our Spotify endpoint (a bit of a mouth full, I know 😃).

import spotipy
from spotipy.oauth2 import SpotifyClientCredentialsfrom spotipy import utilcid =”Your-client-ID” 
secret = “Your-Secret”client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Exploratory Data Analysis of My Playlist

This is one of the most important steps in data science. Our goal here is to understand the type of music in my playlist, retrieve any interesting observations and compare that with the audio features of the top 100 songs of 2018.

Plot Artist Frequency

By observing this histogram, we can see how often artists appear in one of my selected playlists.

Audio Features

Now let’s look at the audio features of the songs in this playlist. Spotify has collated a list of audio features for every song on Spotify! Here is a summary of the features we will be using below:

Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater the likelihood the track contains no vocal content.

Energy: This is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

Speechiness: “Speechiness detects the presence of spoken words in a track”. If the speechiness of a song is above 0.66, it is probably made of spoken words, a score between 0.33 and 0.66 is a song that may contain both music and words, and a score below 0.33 means the song does not have any speech.

Danceability: “ Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable”.

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)

Distribution of music styles in my playlist

From the observations:

Most of the songs in my playlist have a wide distribution of danceability and not too many ‘happy’ songs as shown by the high frequency of songs below 0.5 in valence. So you can say I like danceable songs (which is true!)
There is a steep slope downwards for speechiness, instrumentalness and a bit of liveness. This indicates to us that music featured in my playlist are generally less speechy, no instrumentals, and few songs with a live audience.
Acousticness appears approximately evenly distributed between 0 and 1, indicating that there were no preferences for this attribute in my music selection. ( I generally like acoustic songs but I wouldn’t go out looking for every acoustic cover of a song).
Finally, energy appears normally distributed with the tails on both ends indicating less likelihood of being added to my playlist. So basically, I like songs with average energy.
My songs are generally not that popular -__-

Exploratory Data Analysis of Top 100 Songs of 2018

After downloading and importing the dataset from Kaggle into our application, I started by analyzing the most popular artists in the dataset by the number of times they appeared on the list.

Artist in Top 100 Songs of 2018 by frequency

Artists With Highest Appearance in Top 100 Songs 2018

Post Malone (src:latimes.com) and XXXTENTACION (src:thesource.com)

Now let’s explore the audio features of the top 100 songs in our dataset to see what they look like! We will plot the same histogram as we did for my playlist so we can compare them later.

Distribution of music styles in the top 100 songs of 2018

By observing the histogram we can see that tracks in the top 100 charts are:

Very high in danceability and energy but low in liveness, speechiness and acousticness (we can already see some signs that my playlist isn’t as cool as the top 100 😞).

For example, Drake’s “In My Feelings” from our dataset, is highly danceable and also has a relatively high energy value.

Finally, I decided to plot a radar chart of the top 100 songs and superimpose the audio features of my playlist for easy comparison.

The Top 100 songs from Spotify are in blue while my top songs are in orange.

Conclusion

So I think I got the answers to both my questions at the start of this post. I got to see what my music looks like and there seems to be a DNA for hit songs. The audio features from my playlist are a bit similar to the top 100 songs but I have more acoustic songs and a few live songs.

Want to make a hit song? Make sure it is danceable, has a lot of energy and a bit of valence (positivity and a feel-good vibes).

I am quite happy with the results but I want to build on this in another post.

Here are my recommended next steps:

See how to use your playlist to determine your personalities and recommend adverts that you might like.
Use Machine Learning Clustering algorithm, K-Means, to see which songs are similar to yours and you can use that to eventually discover new songs that you might like.
Use Machine Learning to predict “Popularity” of songs based on their audio features

You can become a Medium member and enjoy more stories like this.

You can get the code to the entire project on GitHub.

Thanks to Alvin Chung, Ashrith and John Koh for their helpful articles on this subject. Spotify and Spotipy, thanks for the awesome API and library!