
Spotify presents no shortage of playlists to offer. On my home page right now, I see playlists for: Rap Caviar, Hot Country, Pump Pop, and many others that span all sorts of musical textures.
While many users enjoy going through songs and creating their own playlists based on their own tastes, I wanted to do something different. I used an unsupervised learning technique to find closely related music and create its own playlists.
The algorithm doesn’t need to classify every song nor does every playlist need to be perfect. Instead, it only needs to produce suggestions I can vet and creatively name, saving me the time of researching songs across different genres.
The Data Set
Spotify’s Web API grants developers access to their vast music library. Consequently, data on almost 170,000 songs from 1921 to 2020 were taken and made available on Kaggle. The data spans virtually every genre and features both obscure and popular tracks.
Every song in the data set is broken down by several key musical indicators. Spotify themselves defines each measure, but briefly:
- Acousticness: how acoustic a song is. Songs with soft pianos and violins score higher while songs with distorted guitars and screaming score lower.
- Danceability: how appropriate a song is for the dance floor, based on tempo, rhythm stability, beat strength, and overall regularity. Infectious pop songs score higher while stiff classical music scores lower.
- Energy: how intense and active a song feels. Hard rock and punk scores higher while piano ballads score lower.
- Instrumentalness: how instrumental a track is. Purely instrumental songs score higher while spoken word and rap songs score lower.
- Liveness: how likely the the song was recorded in front of an audience.
- Loudness: how loud a track is.
- Speechiness: how vocal a track is. Tracks with no music and only spoken words score higher while instrumental songs score lower.
- Valence: how positive or happy a song sounds. Cheerful songs score higher while sad or angry songs score lower.
- Tempo: how fast a song is, in beats per minute (bpm).
Cleaning the Data
While tempting to dive into analysis, some data cleaning is required. All work will be done in python.
import pandas as pd
import numpy as np
# read the data
df = pd.read_csv("Spotify_Data.csv")
Before doing anything, the Pandas and NumPy libraries are imported. The final line simply converts the previously saved CSV file to a DataFrame.
from re import search
# Writes function to tell if string is degraded beyond recognition
def is_data_corrupt(string):
# Search for a name with letters and numbers. If no numbers or letters are found, returns None object
found = search("[0-9A-Za-z]", string)
# Return 1 if corrupt, 0 if not
return 1 if found == None else 0
At some point during the data collection process, a lot of text became corrupt. As a result, some artists and song names are listed as incomprehensible strings of text, such as "’咨璇’" and "蒬花". As much as I want to know ‘咨璇’s greatest hits, I can’t search and compile those values.
To filter corruption, a function called is_data_corrupt() is written that uses regular expressions to check if a string contains numbers, uppercase letters, or lowercase letters. It will mark any string that contains only punctuation or special characters as corrupt, which should find the problematic entries while preserving legitimate song and artist names.
# Create a helper column for artist corruption
df["artists_corrupt"] = df["artists"].apply(lambda x: is_data_corrupt(x))
# Create helper column for name corruption
df["name_corrupt"] = df["name"].apply(lambda x: is_data_corrupt(x))
# Filter out corrupt artists names
df = df[df["artists_corrupt"] == 0]
# Filter out corrupt song names
df = df[df["name_corrupt"] == 0]
After applying the function is_data_corrupt() to both the artists and name columns to create two new, helper columns, any string of text marked as corrupt is filtered out.
It’s worth noting that this only marks text that was degraded beyond recognition. Some text still contained partial degradation. For example, famous composer Frédéric Chopin was changed to "Frédéric Chopin". More extensive data cleaning remedies these entries, but those methods are out of scope for this article.
# Gets rid of rows with unspecified artists
df = df[df["artists"] != "['Unspecified']"]
A large share of artists aren’t listed, but instead given the placeholder value "Unspecified". Since unlisted artists create unnecessary difficulties for finding songs (song titles aren’t unique, after all), these are also filtered.
# Filter out speeches, comedy routines, poems, etc.
df = df[df["speechiness"] < 0.66]
Finally, purely vocal tracks, such as speeches, comedy specials, and poem readings, present an issue. Because the classifications are based on sound characteristics, these will cluster together.
Unfortunately a playlist composed of President Roosevelt’s Fireside Chats and Larry the Cable Guy’s routine makes for a very poor, but amusing, listening experience. Vocal tracks must be categorized on content, which goes beyond the scope of this data.
Consequently, I simply filtered any track with a "speechiness" value of over 0.66. While I may have filtered a few songs, the pay off of removing these tracks is worth it.
Data Exploration
Even before running an unsupervised model, looking for interesting patterns in the data provides insights to how to proceed. All visualizations done with seaborn.
sns.heatmap(df.corr(), cmap = "coolwarm")

The above heat map shows how strongly different numerical data correlates. The deep red shows strong positive relationships and the deep blue shows strong negative relationships. Some interesting correlations deserve a few comments.
Unsurprisingly, a strong relationship between popularity and year exists, implying that more recent music is more popular. Between a younger audience who prefers streaming and Spotify actively promoting newer music, this insight confirms domain knowledge.
In another unsurprising trend, loudness correlates with energy. Intuitively, high energy songs radiate intensity, which often comes with loudness.
Acousticness shares a few interesting negative relationships. Its inverse relationship with energy and loudness captures the idea of how most people image ballads. Acoustic songs, however, tend to be less popular and their number decreased over time.
While surprising given that piano and acoustic guitar tracks stubbornly remain in the public consciousness, this insights says more about the popularity of distortion guitars and synthesizers in music production. Before their advent, every song was considered acoustic and they lost their market share to modernity.
OPTICS Clustering
To create playlists, I used Scikit-Learn’s implementation of OPTICS Clustering, which essentially goes through the data to finds areas of high density and assigns them to a cluster. Observations in low density areas aren’t assigned, so not every song will find itself on a playlist.
df_features = df.filter([
"accousticness",
"danceability",
"energy",
"instramentalness",
"loudness",
"mode",
"tempo",
"popularity",
"valence"
])
Before running the algorithm, I pulled the columns I wanted to analyze . Most of the features are purely musical based, except popularity, which is kept to help group like artists together.
Noting that OPTICS clustering uses Euclidean based distance for determining density, not too many columns were added, because high dimensional data skews distance based metrics.
from sklearn.preprocessing import StandardScaler
# initialize scaler
scaler = StandardScaler()
# Scaled features
scaler.fit(df_features)
df_scaled_features = scaler.transform(df_features)
Next the data is changed to a standard scale. While the rest of the features are already normalized between 0 and 1, tempo and loudness use a different scale, which when calculating distances, skews the results. Using the Standard Scaler, everything is brought to the same scale.
# Initialize and run OPTICS
ops_cluster = OPTICS(min_samples = 10)
ops_cluster.fit(df_scaled_features)
# Add labels to dataframe
df_sample["clusters"] = ops_cluster.labels_
Finally, OPTICS is run on the data set. Note the parameter min_samples, which denotes the minimum number of observations required to create a cluster. In this case, it dictates the minimum number of songs required to make a playlist.
Setting min_samples too small will create many playlists with only a few songs, but setting it too high will create a few playlists with a lot of songs. Ten was selected to strike a reasonable balance.
Also note that the data set remains fairly large, so running the algorithm takes time. In my case, my computer worked for a few hours before returning the results.
Outcomes
As originally stated, I wanted a system to suggest playlists I could manually vet to save myself of the time of researching artists across different genres. Not every songs needed to be classified nor did the play lists need to be perfect.
Although there’s room for substantial improvement, OPTICS met my expectations. Grouping songs in clusters which followed a general theme, I found a set of interesting playlists that spanned genres I knew nothing about.
Unfortunately, most songs weren’t clustered, which means the algorithm lost a large amount of musical diversity. I tried experimenting with different clustering methods (DBSCAN and K-Means), but I received similar results. Quite simply, the data isn’t very dense, so a density-based approach was flawed from the beginning.
The playlists themselves, however, generally made fun suggestions. While occasionally suggesting bizarre combinations (for example, popular electronic dance music artist Daft Punk found themselves among classical composers), they remained fairly on topic. Consequently, I discovered new artists and learned a lot about music by running this project.
That’s the magic of Unsupervised Learning.
The Playlists
While I could write about different metrics to evaluate the validity of the playlists, I decided there’s no better judge than the crucible of the internet. I lightly edited some of my favorites and encourage anyone to listen and make their own judgement. Note that some songs may contain explicit lyrics.
- Thumping Beats: Tight lyrics and popping baselines in hip hop
- Synth Candy: A bowl of colorful pop songs
- Fast n’ Hard: When punk and hard rock hit you in the face
- String Section: An evening with violins and cellos
Conclusions
While there’s room for improvement, the OPTICS clustering met the requirement and created a diverse set of interesting playlists. Given the issues of density-based approaches, I would revisit this problem with hierarchical clustering, cosine similarity, or some other method to overcome the sparsity in the data.
More broadly, however, this showcases the power of unsupervised learning in finding patterns which even humans have difficulty quantifying. While not completely automated because it required vetting, using this algorithm showcased how humans and machines can work together to create a more enjoyable end product.