The world’s leading publication for data science, AI, and ML professionals.

Spotify Case Study: Is there a secret to producing hit songs?

Can I launch myself to stardom with my garage song?

Photo by Yomex Owo on Unsplash
Photo by Yomex Owo on Unsplash

After several months enrolled in a Data Science program and preparing myself for a career transition, I discovered that to become truly successful in this field, one should apply acquired knowledge into projects. This knowledge should answer any question regarding analyzing data.

However, one may encounter several new problems: Does this question match my current skillset or is it too ambitious? Where should one start? Is this question even relevant?

Here, I’m proposing a more beginner-friendly approach: answer a question that has already been answered, but add one’s own personal touch to it.

I will walk you through one of my very first Exploratory Data Analysis (EDA) projects on a Spotify Music dataset. This will help on providing some more context.

You can find this dataset here (tracks.csv).

Part 1: Data Manipulation and Cleaning

Firstly, we must import the required libraries for this project:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import datetime
import calendar

Next, let’s read our DataFrame and take a look at a first sample:

df = pd.read_csv('../Spotify Tracks.csv')
df.sample(10)

Now that we’ve seen the columns for the different kinds of variables in our DataFrame, let’s take a look at its data types to see if some wrangling is required:

df.info()

Interesting, seems like the release_date column is set as a string. We better set it as a date to ensure a smoother analysis. Let’s also create a new column for the track’s release year and month.

For this, we will use the _pd.to_datetime()_ function to set this column as a date, and the .dt method to extract its year and month. We’ll use a lambda function as well to assign every month its corresponding name with the calendar library:

df['release_date'] = pd.to_datetime(df['release_date'])
df['release_year'] = df['release_date'].dt.year
df['release_month'] = df['release_date'].dt.month
df['release_month'] = df['release_month'].apply(lambda x : calendar.month_name[x])

Let’s take another look:

df.info()

This looks right! But seems like the id and id_artists won’t be that useful for our analysis. So we’ll take them out using the .drop method:

columns_to_drop = ['id','id_artists']
df.drop(columns = columns_to_drop, inplace = True)
df.sample(5)

Alright, now that we have the right columns, let’s do one last thing before we start with the fun stuff: check duplicates.

df.duplicated().value_counts()

Getting rid of the duplicates:

df.drop_duplicates(inplace = True)

df.duplicated().value_counts()

Worked like a charm. Now let’s get to the good stuff.

Part 2: Data Exploration and Analysis

Alright, so now we should formulate the main question that will drive our analysis: "Can anyone release a hit song based on track attributes data?"

We’ll set the style and color palettes for our charts (it has to be green if we are dealing with Spotify data, right?)

sns.set_palette('BuGn_r')
sns.set_style('darkgrid')

Let’s start taking a look at the distribution of tracks by popularity:

plt.figure(figsize = (14,8))
sns.histplot(df['popularity'], kde = True)
plt.title('Distribution of Tracks by Popularity')
plt.show()

Interesting, more than 45k songs are in a popularity graveyard. And most of the songs are distributed between 1 and 40 points of popularity approximately.

It seems that the music market can be quite competitive right now, huh? But has it always been like this? Let’s take a look:

plt.figure(figsize = (20, 8))
sns.countplot(data = df, x = 'release_year', color = '#1DB954')
plt.title('Distribution of Tracks Produced Over Time')
plt.xticks(rotation = 90)
plt.show()

Looks like it’s getting more and more competitive year after year, with nearly 140k songs produced in 2020, a year where many people had a lot of free time at home. This dataset contains data for tracks until April 2021, so we’ll have to wait and see how many songs will be produced this year.

Now let’s take a look at some overall metrics in our dataset. We’ll use the .describe() method for that, but adding some additional parameters to distribute the data along several percentiles:

df.describe([.01,.1,.2,.3,.4,.5,.6,.7,.8,.9,.99])

There are several findings in this table, but what really caught my attention is the fact that there’s a big leap between the 90% and 99% percentiles in the popularity variable, compared to the previous ones (19 points). So it seems that a few great hits are close to scoring 100 in popularity.

This means that there is a select group of tracks being quite popular in Spotify. Is it possible to get there by putting the right chords and rhythm into our song? Let’s plot a correlation chart to find out:

#First we make a list with the track attributes we want to compare
track_attributes = ["popularity","acousticness",
                    "danceability", 
                    "energy", 
                    "duration_ms", 
                    "instrumentalness", 
                    "valence", 
                    "tempo", 
                    "liveness", 
                    "loudness", 
                    "speechiness"]
#Then we plot
plt.figure(figsize = (10,8), dpi = 80)
sns.heatmap(df[track_attributes].corr(),vmin=-1, vmax=1, annot=True, cmap = 'BuGn_r' )
plt.xticks(rotation = 45)
plt.show()

We can see in this heatmap that there are no significant correlations between popularity and the track’s attributes. Still, it would be worth to dive deep into the three attributes that showed a positive correlation: danceability, energy and loudness.

Let’s give it a shot:

corr_vars = ['danceability', 'energy', 'loudness']

list(enumerate(corr_vars))

So after listing the variables we want to inspect and converting them in to a generator object, we plot those three attributes taking a random sample of 500 ocurrences for each chart:

plt.figure(figsize = (20, 15))

for i in enumerate(corr_vars):
    plt.subplot(1,3,i[0]+1)
    sns.regplot(data = df.sample(500), y = 'popularity', x = i[1])
plt.show()

Looks like this is confirming what we first saw in our correlation heatmap, but revealed something quite interesting for our analysis: most of the high popularity outliers are found within the highest ranges of the three attributes, specially for loudness. This might be a huge stepping stone towards solving our main question.

It would be best to subset our data to get the most popular songs, so we can see how present are these attributes:

popular_songs = df[df['popularity'] >= 80]

print(popular_songs.shape)

popular_songs.sample(10)

Here we have 954 songs, let’s see what they have in common by plotting their attributes by mean:

#First we list the attributes we want to see reflected in the plot
labels = [ "valence", "danceability", "energy", "acousticness","instrumentalness", "liveness","speechiness"]
#Then we plot those attributes by mean
fig = px.line_polar(popular_songs, theta = labels, r = popular_songs[labels].mean(), line_close = True)

fig.show()

This is great news, as we found one more attribute that may help us on producing our hit song: valence.

On another note, loudness is a quite strong contestant, as its values range from -60 to 0, having a mean of -6 indicates that the most popular songs tend to be quite loud.

Now, let’s take a look at how these have been part of the most popular songs over time:

#First we make a list of our attributes of interest
audio_attributes = ["danceability","energy","valence"]
#Now we plot our charts
plt.figure(figsize = (20, 8))
sns.set_palette('Accent')
plt.subplot(2,1,1)
for attribute in audio_attributes:
    x = popular_songs.groupby("release_year")[attribute].mean()
    sns.lineplot(x=x.index,y=x,label=attribute)
    plt.title('Audio Attributes Over Time for Popular Songs')
plt.subplot(2,1,2)
sns.lineplot(x = popular_songs.groupby('release_year')['loudness'].mean().index, y = popular_songs.groupby('release_year')['loudness'].mean())
plt.ylabel('Range')
plt.xlabel('Year')
Seems like the Grunge music really made valence go down during the 90s huh?
Seems like the Grunge music really made valence go down during the 90s huh?

Seems like popular songs have always been quite energic and loud during the last 50 years, as well as happy. This looks like a clear indicator people’s music taste.

So, anyone who produces a song with these attributes will surely get launched to stardom, right?

Let’s take a look at the top 25 artists with songs that are currently popular and compare it with a random sample:

popular_songs['artists'].value_counts().nlargest(25)
popular_songs['artists'].value_counts().sample(30)

Seems like Justin Bieber and Billie Eilish are quite popular right now with 11 current hit tracks! But this is not a surprise, as all of the artists in the top 25 list are very well known, as well as many of the ones in the sample list.

Part 3: Conclusion

Producing a hit track won’t necessarily depend on how happy, energic or loud your song is, but more likely it would be related to your current popularity as an artist.

Still, to increase the odds of launching a popular song, it might be good to add popular attributes in it, just like the ones we saw in our analysis. It could also be a great advice for new musicians to really invest time and effort into marketing their content.

Thanks for reading!

Find me on LinkedIn


Related Articles