When Spotify introduced the #2020Wrapped, it allowed their listeners to share their year-in-review on Instagram Stories. This year-in-review includes the top 5 songs, the top 5 genres, as well as the top 5 artists that a listener has been listening to this year.
To prevent from being FOMO, I have also tried out in understanding the deep dive of what I have been listened to on the app throughout this year of difficult times.

Based on my #2020Wrapped, I am interested to find out if there is something in common among my Top 5 Spotify Artists – Lauv, The Chainsmokers, Gryffin, Kygo, and Martin Garrix.
In this exercise, Python is used for data scraping while the data while R is used for data analytics, data modelling, and data visualisation.
Import the dataset and the packages
Spotify allows every listener or data enthusiast to retrieve data from their amazing Spotify Developer Platform.
From there, one can access the Spotify Web API Console to scrape the data out through the use of Python and its package called Spotipy. You can get the code with regards to the scraping of data on my GitHub repository.
From the data available, I have only selected those variables related to the audio features. The audio features are as follows:
- Acousticness – A confidence measure from 0.0 to 1.0 of whether the track is acoustic, whereby 1.0 represents high confidence the track is acoustic.
- Danceability – How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- Energy – A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- Instrumentalness – Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- Liveness – Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides a strong likelihood that the track is live.
- Loudness – The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 dB.
- Speechiness – Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- Valence – A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- Tempo – The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, the tempo is the speed or pace of a given piece and derives directly from the average beat duration.
After retrieving the dataset from Spotify API through scraping, now it’s time to transit the programming language from Python to R. Before importing the dataset, let’s call the packages that would be used throughout this exercise.
library(tidyverse) # set of R packages (including ggplot)
library(readr) # read rectangular data
library(dplyr) # data manipulation
library(data.table) # data manipulation
library(fmsb) # radar chart
library(corrplot) # correlation plot
library(FactoMineR) # for Principal Component Analysis
library(factoextra) # for K-Means Clustering
library(qdap) # for text cleaning
library(qdapDictionaries) # for text cleaning
library(tm) # for text cleaning
library(wordcloud2) # for lyrics word cloud
library(syuzhet) # for sentiment analysis
The below datasets were scraped through Python’s Spotipy when accessing the Spotify API to call each of the Top 5 Spotify Artists from my #2020Wrapped.
lauv <- read_csv("Lauv.csv")
the_chainsmokers <- read_csv("The_Chainsmokers.csv")
gryffin <- read_csv("Gryffin.csv")
kygo <- read_csv("Kygo.csv")
martin_garrix <- read_csv("Martin_Garrix.csv")
Tidy & Transform
In order to merge the 5 datasets of the different artists, rbind
is implemented. This function can be used to bind several vectors, matrices, or data frames by rows.
top_5_artists_df <- do.call("rbind", list(lauv, the_chainsmokers, gryffin, kygo, martin_garrix))
Thereafter, the cleaning of the compiled dataset is necessary. This includes the dropping of unnecessary columns that would not be used in this exercise.
top_5_artists_df <- top_5_artists_df %>%
select(-X1, -track_number, -uri)
Before moving on to the next stage, Exploratory Data Analysis, I need to ensure if there are any null values in the dataset.
map(top_5_artists_df, ~sum(is.na(.)))
Fortunately, in the below results, there are no null values in both of the datasets else there is a need to drop data with null values.
> map(top_5_artists_df, ~sum(is.na(.)))
$album
[1] 0
$id
[1] 0
$name
[1] 0
$acousticness
[1] 0
$danceability
[1] 0
$energy
[1] 0
$instrumentalness
[1] 0
$liveness
[1] 0
$loudness
[1] 0
$speechiness
[1] 0
$tempo
[1] 0
$valence
[1] 0
$popularity
[1] 0
$artist
[1] 0
The creation of the new dataset that retains numerical variables is done for the later process of this exercise – Model Building.
top_5_artists_df_num_only <- top_5_artists_df %>%
select(-album, -id, -name, -artist)
Lastly, the function of generating radar charts is created where the audio features are normalised at the values from 0 to 1 to provide readability of the chart.
radar_chart <- function(arg){
songs_data_to_be_used <- top_5_artists_df %>% filter(artist==arg)
radar_data_v1 <- songs_data_to_be_used %>%
select(acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence)
radar_data_v2 <- apply(radar_data_v1,2,function(x){(x-min(x)) / diff(range(x))})
radar_data_v3 <- apply(radar_data_v2,2,mean)
radar_data_v4 <- rbind(rep(1,5) , rep(0,5) , radar_data_v3)
return(radarchart(as.data.frame(radar_data_v4),title=arg))
}
Exploratory Data Analysis
Audio Features across the Top 5 Spotify Artists
A radar chart is useful to compare the musical vibes of the 5 artists in a more visual way.
par(mfrow = c(2, 3))
radar_chart_lauv <- radar_chart("Lauv")
radar_chart_the_chainsmokers <- radar_chart("The Chainsmokers")
radar_chart_gryffin <- radar_chart("Gryffin")
radar_chart_kygo <- radar_chart("Kygo")
radar_chart_martin_garrix <- radar_chart("Martin Garrix")

From the visualisation, we can infer the patterns for each artist:
- All the artists have similar tempo except for Martin Garrix which has the highest tempo among all of them
- Both Gryffin and The Chainsmokers have similar danceability and energy but Gryffin has higher loudness than The Chainsmokers’
- Lauv, Kygo, and Martin Garrix have similar danceability but Kygo has higher energy than both Lauv and Martin Garrix respectively, where both of them have similar energy
- Both Lauv and The Chainsmokers have similar loudness
- Lauv, Gryffin, and Kygo have similar acousticness
- Both Lauv and Kygo have a similar valence, both The Chainsmokers and Marin Garrix have a similar valence, and Gryffin’s valence lies between both pairs’
- Surprisingly, Martin Garrix, who’s an EDM DJ, has the lowest loudness among the rest of the artists
- Out of all the artists, Kygo has the highest for both loudness and energy
Correlation within the Audio Features
After getting a glimpse of the audio features among the artists, we would like to understand the correlation between the popularity of the songs and the audio features of the songs by using a correlation matrix.
corr_plot_song_features_data <- top_5_artists_df %>%
select(popularity, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence)
corrplot(cor(corr_plot_song_features_data),
method = "color",
type = "upper",
order = "hclust")

From the correlation matrix, we can identify that:
- There is a high positive correlation between energy and loudness
- There is a positive correlation between danceability and valence
- There is a strong negative correlation between energy and acousticness
- There is a high negative correlation between loudness and acousticness
- Surprisingly, there is a negative correlation between danceability and tempo and a weak positive correlation between energy and tempo
Characteristics of the Top 5 Spotify Artists’ Songs based on Popularity
In this sub-section, we would like to understand which audio features made the songs of the top 5 artists popular.

From the above plots, we can observe that most of the songs by the 5 artists are:
- Low in acousticness
- Highly danceable
- Highly energetic
- Loud
These seem to be characteristics of songs that I am interested in listening to.
Variations of Acousticness, Danceability, Energy, and Loudness between the Top 5 Spotify Artists
In this sub-section, we would like to understand the variations of the above features between the artists.
Artist by Acousticness
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, acousticness, median, fill = artist), y = acousticness)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Acousticness",
title = "Variation of Acousticness between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
From the boxplot diagram, Lauv’s songs have the highest acoustics while both Martin Garrix’s and Gryffin’s songs have a similar interquartile range of acoustics. It is expected that Lauv’s songs have high acoustics among the rest of the artists’ as most of his songs are played without the use of electronic amplification equipment. Whereas, the rest of the artists’ songs do use electronic amplification equipment as they are EDM songs but Kygo’s songs also include acoustic version (such as Kids in Love (feat. The Night Game) – Acoustic Version) of his original EDM songs, hence, Kygo’s songs have higher acoustic than Martin Garrix’s, The Chainsmokers’, and Gryffin’s.
Artist by Danceability
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, danceability, median, fill = artist), y = danceability)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Danceability",
title = "Variation of Danceability between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
Surprisingly, Lauv’s songs have the highest danceability as the rest of the artists’ songs are mainly EDM songs. Both the songs from The Chainsmokers and Gryffin have a similar median of danceability and similar first quantile (25th percentile) of danceability, which is similar to the median of danceability of the songs from Martin Garrix.
Artist by Energy
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, energy, median, fill = artist), y = energy)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Energy",
title = "Variation of Energy between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
Martin Garrix’s songs have the highest energy as compared to the rest of the artists. Furthermore, the songs from Kygo, The Chainsmokers, Martin Garrix, and Gryffin have higher energy than Lauv. This is expected as EDM songs have high energy.
Artist by Loudness
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, loudness, median, fill = artist), y = loudness)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Loudness",
title = "Variation of Loudness between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
Gryffin’s songs have the highest loudness as compared to the rest of the artists. Similar to the analysis of energy, the songs from Kygo, The Chainsmokers, Martin Garrix, and Gryffin are louder than Lauv. This is expected as well because EDM songs are loud.
Artist by Tempo
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, tempo, median, fill = artist), y = tempo)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Tempo",
title = "Variation of Tempo between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
The tempo represents the speed of a song and the mood it evokes. The higher the tempo of a song, the more inspiring and joyful it tends to be. Else, it indicates sadness of a song. Both songs from The Chainsmokers and Gryffin have a similar third quantile (75th percentile) of tempo. Martin Garrix’s songs have the lowest tempo.
Artist by Valence
top_5_artists_df %>%
ggplot(aes(x = reorder(artist, valence, median, fill = artist), y = valence)) +
geom_boxplot(stat = "boxplot", aes(fill = artist), outlier.size = 2) +
stat_summary(fun = mean, geom="point", shape=20, size=3, color="red", fill="red") +
labs(x = "Artist", y = "Valence",
title = "Variation of Valence between the Top 5 Artists") +
ggthemes::theme_economist() +
theme(legend.position="none")
Among the rest of the artists’, Lauv’s songs have the highest valence while Martin Garrix’s songs have the lowest valence. As compared to the songs from the rest of the artists, this means that Lauv’s songs bring in more positive vibes while Martin Garrix’s songs bring in more negative vibes. Both the songs from Lauv and The Chainsmokers have a similar median of valence. Both the songs from The Chainsmokers and Gryffin have similar third quantile of valence.
Model Building and Interpretation of the Results
K-Means Clustering
Clustering is conducted in this exercise to find out which songs of the artists are similar and discover new songs that I might like.
Firstly, we need to conduct the Principal Component Analysis (PCA) to detect outliers before conducting clustering.
PCA
top_5_artists_df_scale <- scale(top_5_artists_df_num_only)
pca_top_5_artists <- PCA(top_5_artists_df_scale,
scale.unit = FALSE,
graph = F,
ncp = 10)
plot.PCA(pca_top_5_artists,
choix = c("ind"),
habillage = 1,
select = "contrib5",
invisible = "quali")

The outliers derived from the PCA are 9th, 20th, 139th, 193th, and 222nd rows of the dataset.
plot.PCA(pca_top_5_artists, choix = c("var"))

pca_dimdesc <- dimdesc(pca_top_5_artists)
pca_dimdesc$Dim.1

Based on the correlation between variables within Dimension 1 (25.99%), We found 5 variables considered as most contributing variables, i.e: energy, loudness, liveness, valence, and danceability.
Clustering Process
Before initiating the process of clustering, it is necessary to remove the outliers. Then, in order to get a new PCA, there is a need to create a new data scaled.
top_5_artists_without_outliers <- top_5_artists_df_scale[-c(9, 20, 139, 192, 222),]
top_5_artists_df_scale1 <- scale(top_5_artists_without_outliers)
After completing the above task, we will have to pick the optimal number of clusters (K) by using the Elbow Method. The following is the function of building the elbow method chart as well as the output of the elbow method chart of the top_5_artists_df_scale1
.
wss <- function(data, maxCluster = 10) {
ssw <- (nrow(data) - 1) * sum(apply(data, 2, var))
ssw <- vector()
for (i in 2:maxCluster) {
ssw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, ssw, type = "o", xlab = "Number of Clusters", ylab = "Within Groups' Sum of Squares", pch=19)
}
wss(top_5_artists_df_scale1)

From the chart, we can infer that the optimal K is 8 since within groups’ sum of squares is not changing significantly after K = 8.
From picking the optimal number of clusters, i.e. 8, we can start on building the clusters.
RNGkind(sample.kind = "Rounding") #to get the set.seed numbers not changed everytime executed
set.seed(33) # can be any number
top_5_artists_kmeans <- kmeans(top_5_artists_df_scale1, centers = 8)
fviz_cluster(top_5_artists_kmeans, data=top_5_artists_df_scale1)

After building the clusters, now it’s time to perform profiling of the clusters.
top_5_artists_without_outliers <- as_tibble(top_5_artists_without_outliers) # convert from matrix to tibble format
top_5_artists_without_outliers$cluster <- top_5_artists_kmeans$cluster
top_5_artists_without_outliers %>%
group_by(cluster) %>%
summarise_all(mean)

From the table, the characteristic summary of the songs of the Top 5 Spotify Artists in the same cluster are as follows:
- Cluster 1: Highest energy, highest liveness, loudest, and least popular
- Cluster 2: Highest instrumentalness, lowest liveness, least loud, lowest speechiness, and lowest valence
- Cluster 3: Highest speechiness, most popular, and lowest instrumentalness
- Cluster 4: High energy and low speechiness
- Cluster 5: Highest danceability and highest valence
- Cluster 6: Highest acousticness, lowest danceability, and lowest energy
- Cluster 7: Highest tempo
- Cluster 8: Lowest tempo, and most popular
Song Recommendation
After finishing the clustering and profiling of the songs, a song recommendation system is built to enable listeners to get effective suggestions regarding the next best songs according to their taste.
In this exercise, I would like to check which cluster does my favourite song, Lauv’s Mean It, belong to.

top_5_artists_df_without_outliers <- top_5_artists_df[-c(9, 20, 139, 192, 222),]
top_5_artists_df_without_outliers$cluster <- top_5_artists_kmeans$cluster
top_5_artists_df_without_outliers %>%
filter(name == "Mean It", artist == "Lauv")

From the output table, my favourite song is in cluster 8 (Lowest tempo, and most popular).
Now, let’s try a new artist, let’s say… Gryffin. Hence, we can check the best songs from Gryffin that I should try listening to based on my taste given that my favourite song is Mean It by Lauv.
top_5_artists_df_without_outliers %>%
filter(cluster == 8, artist == "Gryffin")

These are the songs from Gryffin that I should try listening to based on the music taste of my favourite song, Lauv’s Mean It.
Further Exploration: Lyrics Analysis
Further analysis is made to understand what are the common words popping up in the Top 5 Spotify Artists’ songs’ lyrics through the generation of word cloud as well as the conducting of sentiment analysis based on emotions of all the lyrics from the Top 5 Spotify Artists. Special thanks to Ekene A. for providing clear and concise instructions on how to scrape lyrics from Genius.com using Python.
In this exercise, the lyrics of each artist’s top 3 songs are scraped.
Before conducting the generation of word cloud and emotion sentiment analysis, it is necessary to process text cleaning in order to remove the punctuations, numbers, and irrelevant words.
lyrics <- readLines("lyrics_5.txt")
lyrics <- paste(lyrics, collapse = " ")
lyrics <- tolower(lyrics)
stop_words <- Top200Words # collecting 200 stop words
lyrics <- removeWords(lyrics, stop_words)
lyrics <- removeWords(lyrics, stopwords("en"))
lyrics <- removeWords(lyrics, c("endoftext", "re", "ll", "ain", "ahh", "ooh", "yeah", "oh", "got", "cause"))
lyrics <- gsub(pattern = "W", replacement = " ",lyrics) # remove non-word characters, i.e. punctuations
lyrics <- gsub(pattern = "d", replacement = " ",lyrics) # remove numbers
lyrics <- gsub(pattern = "b[a-z]b{1}", replacement = " ",lyrics) # empty string at either edge of a word to match exactly one time
lyrics <- stripWhitespace(lyrics)
Word Cloud
In this exercise, the word cloud is generated based on the top 200 common words among the top 3 songs of each artist.
frequently_appeared_words <- freq_terms(lyrics, top = 200)
frequently_appeared_words %>%
wordcloud2(backgroundColor = "black",
color = "random-light")

In a word cloud, larger words meaning more appearances in the lyrics.
Out of the 200 frequently appeared words in the lyrics of the top 3 songs of each artist, Love is the most frequently appeared word. This shows that love is definitely a common theme that revolves around my Top 5 Spotify Artists. Other words that follow the theme of Love include Paris, Heart, and Baby. Due to this, it seems that I’m a person who prefers songs that bring romance vibes.
Emotion Sentiment Analysis
lyrics_tibble <- as_tibble(lyrics) # convert from value to tibble
sentiment <- get_nrc_sentiment(lyrics_tibble$value) # get sentiment values of the lyrics
emotion_sentiment <- sentiment %>%
select(-positive, -negative) # remove non-emotion variables
new_emotion_sentiment <- data.table(Emotions = names(emotion_sentiment), transpose(emotion_sentiment)) # transpose the table
# Barplot
new_emotion_sentiment %>% ggplot(aes(x=Emotions, y=V1, fill=Emotions)) +
geom_bar(stat="identity", color = rainbow(8)) +
labs (x = "Emotions",
y = "Values of the Sentiments",
title ="Emotion Sentiments of the Top 5 Artists' Songs",
subtitle = "Based on the occurences of the top 200 words of the lyrics of their top 3 songs respectively") +
ggthemes::theme_economist() +
theme(legend.position="none")
From the visualisation, we can observe that the lyrics of all the top 3 songs of each artist are associated with negative emotion of sadness. Furthermore, the lyrics associated with the negative emotion of fear and the positive emotion of joy are the same. Lastly, the lyrics associated with the negative emotion of disgust is less than 15.
This could mean that I would prefer songs that provide low valence and low tempo since the majority of the lyrics from the top 3 songs of each artist are associated with sadness.
Summary, Limitations, and Future Work
Other than briefly mentioned of data scraping using Python, this exercise has demonstrated data cleaning & transformation, data analysis, data visualisation, and data modelling using R.
It has allowed us to understand what is the relationship between the audio features of the songs as well as what are the common audio features of my Top 5 Spotify Artists of 2020.
Furthermore, clustering and profiling were done to group the songs based on their similarities to understand their characteristics which, in turn, constructing a song recommendation to find out what other songs that can give them a try to listen based on the taste of my favourite song.
Additional analysis was done to find out the insights of the lyrics of the songs from the artists. However, there is a limitation to scrape the lyrics. One can input any amount of songs to scrape the lyrics from the artists but the output of data scrape would be empty or not entirely scraped. Hence, due to making sure that all lyrics were scraped, this exercise has scraped the lyrics from the top 3 songs of each of the Top 5 Spotify Artists.
This exercise has also illustrated the plotting of the word cloud, identifying the theme occurred in the word cloud, as well as implementing an emotion classification with NRC sentiment to interpret emotions found in the lyrics.
In the case when such similar exercise exists in the future, I will explore more on Multiple Linear Regression model or Logistic Regression model to predict the popularity of the songs based on their audio features.
All in all, this exercise is fulfilling and allows me to understand my music taste more!