We used data science to explore Ed Sheeran’s songs, here’s what we found

Aayush Chadha
Towards Data Science
6 min readMar 4, 2017

--

It was Friday afternoon and Ed Sheeran had released his latest album that morning. As I sat listening to it for the first time, a thought whizzed through my head, “this album doesn’t really sound fresh. It’s either that I have been listening to too many of his songs or most of his songs are so similar, that I can now more or less guess what is going to be next in the lyrics. While I contemplated my plans for Friday night, I texted Rifad and asked him, like any reasonable first year Computer Science student would, “Mate, would you be willing to forsake your Friday night plans for having a go at Ed Sheeran’s songs?” To my surprise, he got back immediately saying he’s up for putting himself through the agony of scraping music sites for lyrics and creating visualisations in d3.js.

Most popular words in Ed Sheeran’s songs

And so the hacking began.

The idea was that we will use known natural language processing techniques, like term frequencies, sentence similarities and sentiment analysis to find out how varied are Ed Sheeran’s songs. First, we had to get the data from somewhere and Rifad quickly put together a scraper that got the lyrics off one of the numerous sites for lyrics on the internet, only later did we realise we could have used the Genius API for finding these lyrics.

Nonetheless, once the lyrics were saved, my work began. Recently, I have been experimenting a lot with the fancy word2vec model that converts words into vectors and then one can use those vectors to find some level of similarity between sentences, phrases and even entire documents using existing techniques like cosine similarity or Euclidean distance. The way I thought of approaching this particular dataset was to first generate word embeddings specifically on Ed Sheeran’s lyrics, calculating normalised vectors for all his songs and then applying a dimensionality reduction algorithm to reduce the vector dimensions from 50 to just 2, making it easy to visualise. In the process, I hoped to find popular words in his songs and also how varied his vocabulary was.

Firing up a Jupyter notebook, I got to work. I used the word2vec model from gensim to generate the word embeddings. The way a word2vec model works is that it takes as input a large corpus of text and then generates a vector space of several dimensions out of it. Words used in similar contexts are placed closer to each other in this space. The point to note here is that it takes a large corpus of text. My text corpus composed of 40 odd songs, with a vocabulary length of roughly 2500 words, not exactly the best input source, nonetheless, since we were just doing this out of curiousity, I decided to go ahead. I thought I can evaluate how well the model did by asking some ‘fans’ about what they think about his music. Now, since I was using a measly 4gb MacBook Air, I used only 50 dimensions. The best models use close to 300 dimensions. The context window of a word was 7 and frequent words were sub sampled. At the end of training, I had a 1.5 mb model where each word ever used in his song was now a vector with 50 dimensions.

The next step was to construct a fairly ad hoc representation of his songs as vectors. To achieve that, I calculated the vector sum of each song and then normalised the vector. The reason behind normalising these vectors was that it helps in finding similarity between various pairs of vectors. In hindsight, one might argue that a more fairer representation might have utilised some approach which utilised a metric like the tf-idf score of a word as well.

With the vectors in place, I decided on embarking on a quick detour to construct a word cloud. It was no surprise that the most used word was love, followed closely by know and come. (Side note: If you look hard enough at the word cloud above, you can see the silhouette of Ed Sheeran in the word cloud ;) )

Now came the even more fun part, projecting my 50 dimensional vector space into 2. From literature, one can do this in two ways, if the dimensions are too high, principal component analysis works well to reduce it to something more manageable and then once can make another pass with the t-distributed stochastic neighbour embedding algorithm to reduce it to 2–3 dimensions which can then be visualised. The other way is to use t-sne directly and that’s what I did, since it handles 50 dimensions fairly easily over 500 iterations.

Once the dust settled on this, this is what we got.

Songs like The A Team (far left) and Shape of You (far right, green dot) are very dissimilar from his other songs. One explanation we have about this is the fact that most of his songs are nostalgic in some respect and hence that cluster in the middle. Some of the songs in the cluster are Castle on the Hill and Photograph. This makes some sort of intuitive sense since the word2vec model placed words in similar context closer to each other. This was also a view that was confirmed by teenage fans of Ed Sheeran who played around with the visualisation.

With Friday well behind me, I called it a day and handed over the data to Rifad so that he can compute sentiment scores for the songs. He wanted to make use of the Google Sentiment Analysis API and the Spotify API so he could find the saddest/happiest song. The Google sentiment API returns two highly relevant scores, a sentiment score and a magnitude which is representative of the sentiment in various lengths of text. Therefore, longer texts will have a higher magnitude. The Spotify API on the other hand returned a bunch of interesting measures, like danceability, speechiness and valence. We came up with a very rough, back of the paper formula to calculate the final song sentiment. Since valence and sentiment score were similar values, we added them together and then multiplied it by the magnitude. We then divided the result by speechiness, thinking that it should normalise for songs with more words in them. We then squared the results and calculated the log value of the result. We certainly think that this formula didn’t best represent the sentiments in the song, however, it had some semblance of adhering to human judgement of the songs, which might or might not be right.

Songs with higher scores were more positive. Unsurprisinly, Don’t was the gloomiest, or the one probably using the most words with negative connotations. What do I know really made us think What do we know, it had an unnaturally high score. I suppose we can attribute some of it to the ambiguity of his lyrics.

All in all, it was a Friday-Saturday well spent writing fairly amateurish text mining scripts. More importantly, we learned a lot about the process, had some really good fun and came across the occasional surprising result.

We put together a quick website where all our interactive visualisation are — https://r1fad.github.io/edSheeran/

All our code can be found here — https://github.com/r1fad/edSheeran

--

--