The world’s leading publication for data science, AI, and ML professionals.

ML Step-by-Step: Using KNN Algorithm to Classify Spotify Songs into Playlists

In this article I implement the KNN algorithm step by step to create a model that can sort songs into playlists based on their attributes.

Photo by Travis Ye well on Unsplash
Photo by Travis Ye well on Unsplash

As a song enthusiast, I thought the best way to get my hands into Machine Learning and Data Science was to meddle in my Spotify music data. The broad concept behind this project is to create a model that can take any song and, depending on its musical attributes, classify it into a playlist with the nearest, shared qualities. Instead of sorting it myself, the algorithm will do it for me!

The 2 playlists I used for this model were my "happy playlist" filled with upbeat, high-energy songs and my "euphonious playlist" which contains songs that are mostly slow, beautiful, and peaceful music.

For this project, I used Jupyter Notebook with ML Library scikit learn. I also used my personal Spotify playlists but you could also use Spotify curated playlists (they probably have more diversified songs in their playlists so I tried choosing playlists from mine that are diversified in artists as well).

You can find my code here on GitHub.

An overview of what I am covering:

  1. Brief Overview of Knn
  2. Accessing Spotify Credentials/Scraping Data
  3. Libraries/Reading in Datasets/Graphing
  4. Refining Datasets
  5. Creating the model
  • standardizing data
  • splitting data into training and testing sets
  • training the model
  • making predictions and evaluating the predictions
  • adjusting the K-value for a stronger model
  1. Conclusion

1. Brief Overview of KNN

KNN (aka. K Nearest Neighbors) is a type of Machine Learning algorithm that classifies a new data point based on how its neighboring points are classified. In the diagram below, A is the new data point. K = the number of nearest points (neighbors) around A. In this case, K=7. The model identifies the 7 points closest to A (smallest distance). Based on the classification of the 7 surrounding points, data point A classifies as Class Z because its closer to more points in Class Z than the other 2 classes. A is more similar to Class Z data points. KNN, in simple terms, is a similarity measure between a new point and its surrounding points.

Image by Author
Image by Author

Time to Start the Project

2. Accessing Spotify Credentials

A more detailed overview of how I accessed my Spotify Credentials can be found in my last article post where I go through the code step by step. You can also find the code to it here.

Essentially, I go to Spotify for Developers and use my Spotify log in to sign in to My Dashboard. Hit "Create an App" and once you name it you should see your Client ID and Client Secret on the left of the screen. From there you can paste the Client information in the appropriate places in the code.

To access the Spotify URI for a playlist, go to the wanted playlist, click the 3 dots at the top and while hovering over the "Share" option, press the alt key to see the the option "Copy Spotify URI" and click it. You can also do this without using the alt key by choosing "Copy link to playlist" but you will need to cut out everything but the code at the end of the URL.

Once you inserted your information into the right places in the code, you should be able to download a csv file containing the music attributes of each song in your chosen playlist.

There are multiple methods of accessing Spotify credentials and data but this is how I did it with help from this article.

3. Libraries /Reading in Datasets/Graphing

The goal for this step is to import the needed libraries, visualize the differences in musical attributes between each playlist to see which attributes we should keep or eliminate.

from sklearn.neighbors import NearestNeighbors
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

Libraries: Panda and numpys are for data exploration and matplotlib and seaborn are for data visualization. Sklearn is for carrying out the Knn Algorithm.

Reading in Datasets: I first uploaded the 2 csv files to Jupyter Notebook (1 file for each playlist I was using) and used Pandas to read in the datasets. "Happy" and "Euphon" are the names for my playlist objects.

happy = pd.read_csv("name of file.csv")
euphon = pd.read_csv("name of file.csv")

Graphing: I then made very simple graphs for each Spotify music attribute with matplotlib so I can distinguish the differences between each playlist. I kept the variables that showed distinct, considerable gaps between the 2 lines for each graph as well as a consistent pattern for each line. For my playlists, I noticed consistent pattern and gaps in energy, acousticness, and valence.

Red = euphonious playlist

Blue = happy playlist

All 3 Images by Author
All 3 Images by Author

Visually, you can see patterns where the euphonious playlist typically has low energy, high acousticness, and low valence whereas the happy playlist has high energy, low acousticness, and higher valence values. The other attributes had very mixed graphs and no consistent pattern that’s unique to each playlist. Removing the music attributes that don’t have distinct patterns results in a stronger model because there’s less noise and its easier for the model to distinguish between the 2 playlists.

4. Refining Datasets

Now that I narrowed down my dataset to these 3 definite attributes, I started to create and refine my final dataset. For this, I found that the most efficient option for me was to use Google Sheets.

I first imported both .csv files to each playlist into different tabs of the sheet. I then combined the 3 attributes from each dataset (energy, acousticness, and valence) and merged them into 1 dataset removing all other attributes (song names, artists, danceability etc.) The Numpy Library only takes in integer values so I multiplied each column of numbers (which were originally small decimals) by 100 to get integer values.

Finally, I added a "Target" column which identifies what playlist each data point belongs to. 1 = euphonious playlist; 2 = happy playlist. The target column serves as the output (playlist) while the attributes of energy, acousticness, and valence serve as the inputs of the model.

Image by Author
Image by Author

5. Now it’s time to create and train the model!

Essentially, all we are doing is splitting the entire dataset into 2 subsets (training and testing) and storing the training set into a model that will be used when classifying new data points. Training the model means fitting the classifiers (outputs) to their attributes (inputs).

After merging all my data together, I downloaded the sheet as a csv file and uploaded it into Jupyter Notebook.

Normalize (Standardize) the data: Normalizing the data is setting all numeric values in each column to a common scale so the difference between values are not distorted. Fundamentally, we’re creating z-scores and rescaling the data to have values between 0 and 1.

Here are some drawn out diagrams that explain it conceptually:

Image by Author
Image by Author
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(happy_eupho.drop('target', axis=1))
scaled_features = scaler.transform(happy_eupho.drop('target',axis=1))
happy_eupho_feat = pd.DataFrame(scaled_features, columns = happy_eupho.columns[:-1])

"happy_eupho" is the name of my csv

Above, I imported StandardScaler from sklearn and created an object of the standard scaler which is going to normalize the dataset. I then fit the object to all the data and dropped the "target" column because that is the output. I standardized (.transform) those same columns and created a pandas data frame (optional) for the standardized values.

Splitting data into training and testing sets: The KNN algorithm needs to take in 2 sets: the training and testing set. These sets come from the dataset itself where a share of the dataset gets trained (into a model) and the rest gets tested for the model’s accuracy.

from sklearn.model_selection import train_test_split
X = happy_eupho_feat
y = happy_eupho['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30, shuffle=True)

To actually split your dataset, you need to import train_test_split from sklearn and create the variables that represent your input (X) and output (y). Since we want to know what playlist is the best fit for certain songs, our playlist ("target" column) will be the output (y) and the input (X) will the the music attributes/features (in my case: energy, acousticness, and valence). The random state is the number the random generator starts with and doesn’t change the algorithm’s behavior so it can be any value. Shuffle is to make sure the data I split is randomized so it is best representative of the entire dataset. The test size is the ratio of the dataset that I am putting towards the testing set which I chose as 0.3, meaning 0.7 (70%) of my dataset is going to be trained.

Training the model: Now that we officially have 0.7 of the dataset split towards testing, it’s time to finally use KNeighborsClassifier from sklearn to train the model.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7,p=2,metric='euclidean')
knn.fit(X_train, y_train)

The standard way of finding the K value is to take the square root of n (number of data points in entire dataset). In my case, K = 7 (7 nearest neighbors). We then fit the knn object to the X_train and y_train (the inputs and outputs which make up the training set).

Testing set and evaluating the predictions: The model is now trained; it’s time to put it to test and check for accuracy by using the Classification Report.

prediction = knn.predict(X_test)
prediction
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))

The prediction object contains all the inputs (music attributes) within the testing set and the trained model outputs the prediction of which playlists each song should belong in. To see how accurate the model is, I’m going to use the Classification Report which gives us the Precision, Recall, and F1 values (in depth explanation here) by comparing the output values from the testing set with those of the training set. Right now, I will just focus on the average accuracy value which is the percent of correct predictions that were made.

Image by Author
Image by Author

On the first try, the average accuracy was a 0.82 which is alright, but could be better. This means that 82 percent of the model’s predictions were correctly identified to their respective playlists.

Since I wasn’t satisfied with this accuracy rate, I decided to change the k-values around to see which value returns better accuracy. With help from this article, I learned how to make an error chart which graphs the error rate for every k-value from 1–40 using outputs from a for-loop.

error_rate = []
for i in range(1,40):

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    prediction_i = knn.predict(X_test)
    error_rate.append(np.mean(prediction_i != y_test))
Image by Author
Image by Author

Using this chart, I substituted values near 5 and 10 and rested on K=10 which returned the lowest error rate and highest average accuracy of 0.87.

Image by Author
Image by Author

Conclusion

The aim of this project was to create a reliable model that can take a song and sort it into either a happy, upbeat, energetic playlist or a slow, soothing, calm playlist [happy or euphonious]. This can work for any playlist on Spotify. I would have preferred to have my precision and accuracy rate in the 90s range but since my best accuracy was an 87%, it indicates that there are some songs in both playlists that share the same musical attributes. To get a higher accuracy rate, I could have used a Spotify-curated playlist, which tend to have more songs, diversity of artists, and more consistent patterns in the music. Choosing genre playlists instead of "mood" playlists could also return a higher accuracy because there could be more of a clear difference in tone(ex. hip-hop and jazz). The challenges this model might have faced is distinguishing the difference between some songs in both playlists which may share the same genre.

This is the first time I put this algorithm in play and this article is mostly for people trying to learn how to implement KNN or starting their path in data science. I would love hear some feedback or questions!


Related Articles