Using Unsupervised Learning To Generate Artist Recommendations

With Scikit’s NearestNeighbour learner in Python

Aditya Mehta
Towards Data Science

--

In this article, I will explore the use of Unsupervised Machine Learning to generate artist recommendations using data from Spotify. While there are many algorithms that could have been used for this purpose, the one considered here is the NearestNeighbours learner, implemented using Scikit Learn in Python.

Let’s dive right into it!

1. The Data

We will use data from the Spotify Web API (which can be found here). The dataset contains data spanning the period from 1921–2020, covering ~27000+ artists!

For each artist, there are several fields that describe their music: loudness, acousticness, danceablility, popularity, etc. (details on what these mean can be found here). We will be using only these characteristics (and no other information, like which genre the artist belongs to) to train our model and then asking it to generate recommendations for artists similar to the one we specify.

2. The Algorithm

While many unsupervised learning algorithms can be used for this problem, we will consider one of the simplest — the NearestNeighbour learner. This algorithm essentially scans the n-dimensional space (n being the number of “features” or characteristics we give the model) for points that are closest to the one we specify. For instance, if we gave the model only Loudness and Popularity, and asked it to recommend 5 artists similar to Eminem, it would look through all the artists and find the 5 artists with the most similar values for Loudness and Popularity.
A useful way to visualise this would be to imagine that we generated a scatter plot, with Loudness on the x-axis and Popularity on the y-axis. The x-y co-ordinates of each artist can be thought of as their “address” and our problem boils down to finding the “nearest neighbours” (i.e. the artists that are the smallest “distance” away from Eminem in this 2-dimensional space). Here we will be working in an 11-dimensional space, but the same principle applies.
Fortunately, we don’t need to bother figuring out how to write the code to actually implement this algorithm, it is already done for us and we can directly use the SciKit Learn package in Python.

Finding the “Nearest Neighbours” in a 2 dimensional plane- Image by author

3. The process

i. First, we load the data into Python. It can be downloaded in the form of a csv file from Kaggle.

ii. We will then decide which “features” or characteristics we should pass to the model.

iii. Some of these features may have skewed distributions. For example, most songs have very low values for “instrumentalness” which means we have many artists who have a score near 0 for instrumentalness, but a few with a much larger score in comparison (see graph below). Machine Learning algorithms do not generally perform well with such distributions since the outliers introduce a lot of “noise” in the data. To work around this problem, we will scale our data using the StandardScaler function from SciKit Learn, which will reduce the influence these outliers have on our data. More details can on the scaler be found here.

Instrumentalness shows a skewed distribution: we need to fix this- Image by author

iv. After our data is scaled, we are ready to train our model. We pass the data to the Nearest Neighbour learner (documentation). While we have the opportunity to tune many of the technical parameters that this learner considers, for this problem, we will just let each parameter take the default values. The only thing we will tell our model is how many recommendations we want.

v. That’s it! We will now ask our model to scan through all the 27,000+ artists and recommend those similar to the one we specify.

Let’s get started:

import pandas as pd
import numpy as np
data_artist = pd.read_csv('./Spotify/data_by_artist.csv') #This is a datafram with 27621 rows (each row represents one artist) and 15 columns (each column has a unique feature)## Let's see the 10 most popular artists according to Spotify's algorithmdata_artist.sort_values(by='popularity',ascending=False).head(10)['artists']
Ten most popular artists on Spotify

We will now write a function that prepares the data, scales it, trains the model, and generates recommendations:

features= list(data_artist.columns[0:12]) # We select only the features we want to pass to the model
df = data_artist[features]
def getArtist_recommendations(data,artist,numArtists):
X = data.iloc[:,1:] #This contains all our features
Y = data.iloc[:,0] #This contains the name of the artists
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) #We have now scaled our data to reduce the impact of outliers


from sklearn.neighbors import NearestNeighbors
recommender = NearestNeighbors(n_neighbors=numArtists+1).fit(X_scaled) #Training the model took a single line of code :)
distances, indices = recommender.kneighbors(X_scaled)
output = pd.DataFrame(indices)
output['Artist'] = Y

recommendations = output.loc[output.Artist==artist,output.columns!='Artist']
Y[list(recommendations.values[0])][1:]
return(Y[list(recommendations.values[0])][1:])

That’s it! Let’s test how the model performs. We will ask it to recommend 10 artists that are similar to Eminem, Kendrick Lamar and Akon.

getArtist_recommendations(df,'Eminem',10)
getArtist_recommendations(df,'Kendrick Lamar',10)
getArtist_recommendations(df,'Akon',10)
Recommended Artists Similar to Eminem- Image by author
Recommended Artists Similar to Kendrick Lamar- Image by author
Recommended Artists Similar to Akon- Image by author

Happy listening :)

--

--