The world’s leading publication for data science, AI, and ML professionals.

Music, Psychographics, and other interesting things

Let's dig into Spotify and understand ourselves more with help of Python

Photo by RUBENSTEIN REBELLO on Unsplash
Photo by RUBENSTEIN REBELLO on Unsplash

Rekindling the spirit from the old post, I continue down the rabbit hole in identifying the characteristics of the songs that I prefer.

Essentially, while working I listen to a few songs and this one was streamed 221 times last year as my Spotify’s 2020 Unwrapped told me.

After an initial audio profiling and spectrogram analysis, I could gauge that I prefer lower frequencies songs, beats in 115 to 130 range, and bright timbre but that’s all I could understand and I wanted more qualitative information.

So, I decided to exploit the developer’s API of Spotify and registered as a developer at https://developer.spotify.com/dashboard/applications. You can create any dummy app, give it whatever name you want to and Spotify will spit out a client Id and client secret key for you.

Of course! I will hide the client ID and my secret key. (Snapshot from my developer dashboard of Spotify)
Of course! I will hide the client ID and my secret key. (Snapshot from my developer dashboard of Spotify)

You can use the client ID and the secret key to access the data stored in songs, playlists etc.

Once you are equipped with the arsenal, you need a target to attack and here I will attack my playlist. I merely need its URI and Spotify provides it meekly.

Right-click on the three horizontal dots in front of a playlist’s name and go to share and from there you will have the Spotify URI which will be of the format ‘spotify:playlist:r34c3tgghrfhwforo45’


Los gehts!

1. Web Scraping and Preprocessing

Load the relevant libraries first.

#Imports
import pandas as pd
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.manifold import TSNE
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import turicreate as tc
from math import pi
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
%matplotlib inline

Provide your playlist URI, client Id and client secret to Spotify Web API that will create an object of spotipy with help of your credentials.

pl_id = Your own playlist Id 
client_id = Your own client id
client_secret = Your own client secret
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
#It will fetch the results of the playlist in a JSON object
results = sp.playlist_items(pl_id)
#Prints entre JSON object
print(json.dumps(results, indent = 4))
Image by author
Image by author

At the bottom of the JSON object, you can see the total number of songs in your playlist.

Let’s extract the track Ids and the respective names from the playlist. Spotipy has an ample number of examples on its Github repo. Please check it here.

Spotify’s web API doesn’t allow to read more than 100 objects at a time, so one has to keep updating an offset variable.

# Offset is for the JSON object i.e. from where the seek should be and from where it should start reading the records.
offset = 0
track_ids = []
track_names = []
# I chose the number 1100 because the total number of tracks are 1009
while offset < 1100:
    response_i = []
    response_n = []
    response_n.append(sp.playlist_items(pl_id, fields = 'items.track.name' , offset = offset))
    response_i.append(sp.playlist_items(pl_id, fields = 'items.track.id' , offset = offset))
    offset = offset + 100
    cnt = 0
    for d in response_n[0]['items']:
        val = response_n[0]['items'][cnt]['track']['name']
        if val != 'None':
            track_names.append(val)
        cnt = cnt + 1
    cnt_i = 0
    for d in response_i[0]['items']:
        val = response_i[0]['items'][cnt_i]['track']['id']
        if val != 'None':
            track_ids.append(val)
        cnt_i = cnt_i + 1
# Length of both track_ids and track_names should be equal.
len(track_ids)
len(track_names)

Cool! So, I have the track names and track Ids here. Let me push that in a dictionary and create a pandas dataframe. Also, there would be a few null values in this dataframe because I have travelled from one place to the other and many songs that I added to my playlist in 2012 might not be available in the region I am at (I really have a bone to pick with Spotify on this matter, why can’t I see my own content in different geographies even when I am a premium user and why am I plastered with local channels and playlists? I don’t even know the language or enjoy that type of Music)

#Create a dataframe
track_recs = {'Track': track_ids, 'Name' : track_names}
df_interim = pd.DataFrame(track_recs, columns = ['Track', 'Name'])
df_interim.shape
#Remove any null values
df_interim = df_interim.mask(df_interim.eq('None')).dropna()
df_interim.shape

Let’s see how the dataframe looks.

Image by author
Image by author

2. Feature Extraction

Now comes the meat of the process. Spotify has many quantitative variables to describe the music such as energy, danceability, popularity, speechiness, tempo etc

#Extract the quantitative features for the tracks
tracks = {}
tracks['acousticness'] = []
tracks['danceability'] = []
tracks['energy'] = []
tracks['instrumentalness'] = []
tracks['liveness'] = []
tracks['loudness'] = []
tracks['speechiness'] = []
tracks['tempo'] = []
tracks['valence'] = []
tracks['uri'] = []
tracks['duration'] = []
for track in df_interim['Track']:
    features = sp.audio_features(track)
    tracks['acousticness'].append(features[0]['acousticness'])
    tracks['danceability'].append(features[0]['danceability'])
    tracks['energy'].append(features[0]['energy'])
    tracks['instrumentalness'].append(features[0]['instrumentalness'])
    tracks['liveness'].append(features[0]['liveness'])
    tracks['loudness'].append(features[0]['loudness'])
    tracks['speechiness'].append(features[0]['speechiness'])
    tracks['tempo'].append(features[0]['tempo'])
    tracks['valence'].append(features[0]['valence'])
    #tracks['popularity'].append(features[0]['popularity'])
    tracks['uri'].append(features[0]['uri'])
    tracks['duration'].append(features[0]['duration_ms'])

It’s easier to create a dataframe from the dictionary above. I used the lame name for my dataframe 🙂 Mea culpa! Add the track name as well from one of the previous steps.

dataframe = pd.DataFrame.from_dict(tracks)
dataframe['name'] = df_interim['Name']
dataframe
Image by author
Image by author

Hmm…there is a lot of quantitative info here, interesting!!

3. TSNE representation

I want to see how these songs look on a reduced dimension scale, essentially I want to create embedding of these features. Hmm…but I know only of word embedding using Glove or word2vec.

Wait! Isn’t TSNE representation kind of embedding only? I believe so. Ok, let’s try this out.

First I need to drop the non-numerical columns i.e. URI and Name. Another requirement is to scale the data, I will use standard scaler for ease of use. Standard scaling is done w.r.t. mean and standard deviation and it is done to fit the data in more or less normal/Gaussian distribution.

df = dataframe.drop(['uri', 'name'], axis = 1)
df_s = StandardScaler().fit_transform(df)

Now, I am ready to compute the TSNE. As I want to plot a 2D graph, I will use the number of components as 2.

tsne_results = TSNE(n_components = 2, verbose = 1, perplexity = 50, n_iter = 10000).fit_transform(df_s)
tsne_results=pd.DataFrame(tsne_results, columns=['tsne1', 'tsne2'])
tsne_results['name'] = dataframe['name']
tsne_results['energy'] = dataframe['energy']
tsne_results['acousticness'] = dataframe['acousticness']
tsne_results['valence'] = dataframe['valence']
tsne_results['tempo'] = dataframe['tempo']
tsne_results['uri'] = dataframe['uri']

TSNE when computed takes a while and produces the following logs:

Image by author
Image by author

Let’s look at the dataset ‘tsne_results’ that we will use to plot.

Image by author
Image by author

To plot TSNE, I can use Matplotlib, Seaborn, Bokeh, or Plotly.

Let’s see what Matplotlib yields.

def plot_tsne():
    plt.scatter(tsne_results['tsne1'], tsne_results['tsne2'])
    plt.show()
plot_tsne()
Why do you do this? Always this blue colour! (Image by author)
Why do you do this? Always this blue colour! (Image by author)

There can be some insights hidden so I need to add a few colours.

def plt_tsne():
    sns.FacetGrid(tsne_results, hue = 'tempo', height = 8).map(plt.scatter, 'tsne1', 'tsne2' )
    plt.show()
plt_tsne()
(Image by author)
(Image by author)

So, there is a pattern after all. A plotly based TSNE plot is more revealing. The slow-paced ones are in the bottom right corner, others are spread throughout. It could be the case that my playlist suffers the bias that they have only one type of song. That isn’t so bad, to be honest, playlists are indeed biased.

def plot_tsne():
    fig = px.scatter(tsne_results, x="tsne1",    y="tsne2",color="energy", hover_data=["name"])
    fig.show()
plot_tsne()
I wish I could upload the HTML file so one could move the cursor around and see what each song is. (Image by author)
I wish I could upload the HTML file so one could move the cursor around and see what each song is. (Image by author)

4. What do I like?

Now, the time to reveal the truth. I think that I prefer classical rock bands Thanks to overexposure to Linkin Park, System of a Down, Metallica, Creedence Clearwater Revival etc but let the data talk.

# This is a quick and dirty data processing here.
radar_n = ['acousticness', 'danceability', 'energy', 'valence', 'instrumentalness', 'liveness']
df_radar = df[radar_n]
df_radar.mean()
# Use the above mean values to fill the values in the list below. 
# They are the mean values of all the songs.
r = [0.231858, 0.562153, 0.666474, 0.415243, 0.172336, 0.183510]
df_x = pd.DataFrame(dict(
    r=r,
    theta=radar_n))

Once that’s achieved, I want to rescale the values from 0 to 1, essentially normalise them.

x = df_x[['r']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
r_scaled = min_max_scaler.fit_transform(x)
lst = []
for val in r_scaled:
    lst.append(val[0])

Now we have all the data to build a radar plot.

N = len(radar_n)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += angles[:1]
values = lst
values += values[:1]
def pol_plot():
    fig = plt.figure(figsize = (10,10))
    ax = plt.subplot(polar=True)
    plt.polar(angles, values)
    plt.fill(angles, values, alpha = 0.3)
    plt.xticks(angles[:-1], radar_n)
    ax.set_rlabel_position(0)
    plt.yticks([0,0.4,0.6, 0.8], color = 'grey', size = 7)
    plt.show()
pol_plot()
The radar plot can reveal music preferences and personality traits(Image by author)
The radar plot can reveal music preferences and personality traits(Image by author)

The big reveal is that my song choices dominate in the high energy area which indicates that EDM, poptimism, and modern rock that are more beat-based and thus help in focus, dominate the playlist.

5. Let’s find similar songs

A helper function that will calculate the cosine distance between the TSNE values can help in finding which songs are closer to each other.

#Let's add an index column first
tsne_results['index'] =  np.arange(len(tsne_results))
cols = ['tsne1', 'tsne2']
tsne_results_cos = tsne_results[cols ]
similarity = cosine_similarity(tsne_results_cos)
similarity.shape
# Change the name of the song that you are searching, it should be exact same name as it is in the playlist.
idx = tsne_results[tsne_results['name'] == "Sansa"]
idx = idx['index'].values
df_temp = pd.DataFrame(list(zip(tsne_results['name'], similarity[idx[0]].tolist())), columns =['name', 'distance'])
df_temp['uri'] = tsne_results['uri']
df_temp.sort_values('distance', ascending = False)[0:15]

The df_temp dataset contains the songs that are closer to the song I am interested in.

Here we go!! (Image by author)
Here we go!! (Image by author)

Afterthoughts

This is interesting because almost all these songs repeat an undertone that I fixate to.

Based on the preference of the music, one can decipher the underlying psychographics as well.

To test this above approach further I should ideally have a large set that includes the songs that I haven’t listened to before and then I can assert whether this TSNE based cosine similarity will work. Another thing that can be done is to develop a classifier that will predict the songs in the test dataset as like/dislike.

Possibilities are inordinate, dive in!

PS: The code repo is present on my Github or holler at my LinkedIn.


Related Articles