49 Years of Lyrics: A Python based study of the change in language for popular music from 1970 to 2018.

Have lyrics become more aggressive and profane over the past 49 years? We use SpaCy and Keras to investigate.

15 min readDec 17, 2018

--

Background

This article originally started out as an argument about whether popular music was better or worse today than it was several years ago. There are several theories on why certain time ranges of music resonate with us, which can definitely affect our impartiality when it comes to something as subjective as music and the arts in general. There’s an excellent Slate article on Neural Nostalgia here that talks about it in detail.

For me though, as a data driven person, I thought some level of quantitative analysis could be brought to bear. If I was able to go through music from 1970 to 2018 and investigate the lyrics from a Natural Language Processing (NLP) perspective, what could I discover? I mean, I know late 90s music was the best music of all time (see Neural Nostalgia article above), but how could I prove/disprove that? How could I measure something so subjective?

I also wanted to provide other researchers/data scientists/hobbyists with some examples of how open source web-page based data could be collected, structured, and then used to feed API calls. Further I wanted to show how one could use SpaCy to tokenize the lyrics so they could be fed through a trained ANN. I used Requests, BeautifulSoup, and SpaCy for collection and data preparation tasks, matplotlib and seaborn for the visualizations, and Keras with Tensorflow (GPU) to train the ANN and then predict with it.

A Note about the Accompanying Code

You’ll also see from a lot of the code in the github repo that I focused on linearity and readability so that others could pick and choose the parts of code that suit their purpose. It generally isn’t optimized for performance, I really focused on investigation. You can find all the source code on github here.

The Hypothesis

After much more arguing, we came up with the following measurements that would be tested against lyrics to see how they’ve changed over a 49 year period:

  • The total number of words per song as a measure of complexity.
  • The change in most frequent nouns per year.
  • Use of adverbs per year.
  • The number of profane/controversial words (which in itself is subjective) appearing in songs per year.
  • The level of aggression/hostility in the songs (we’ll build a Keras sequential model for this task).

The hypothesis is as follows:

  • Word quantity and language complexity has increased from 1970 to 2018.
  • Several common nouns appear throughout the entire spectrum of lyrics, but the most common ones change over time.
  • Adverbs have become more aggressive/forward over time.
  • Profanity in Lyrics has increased significantly in the past 20 years (1998 to 2018).
  • Songs are more aggressive now than they were in the 1970s.

Ancillary experiments (Coming Soon…)

I will also iterate through the lyrical data to see when specific new terms show up in lyrics that have not been seen before. Term such as “Internet, Blackberry, iPhone, Terrorism, Recession” and the like (coming at a future date).

Let’s get started!

Data Collection

There are three datasets we’re using to run this experiment:

  1. A dataset we’ll collect ourselves that includes over 3400 song lyrics between 1970 and 2018.
  2. A list of prohibited/restricted words from www.freewebheaders.com that we’ll use to assess the perceived levels of profanity in lyrics.
  3. A training dataset from Kaggle (originally used for the detection of cyber trolls) that we’ll use to train a Keras Sequential Neural Network. We’ll then pivot the trained NN to predict whether or not a song is considered aggressive.

Initial Collection (Web Scraping)

I couldn’t find any ready sets of lyrics data to use for this experiment online, so I took a look at billboard.com’s year end top 100 songs. While they do have records going back before the 70s there are a lot of gaps in their datasets, including no top 100 lists from 1991 to 2006. Luckily there’s another website (bobborst.com/) that was curated by a genuine music lover and all the content pre-2017 can be found there.

So the majority of the seed data will be collected from http://www.bobborst.com/ and the remainder from billboard.

I used Python’s request library to pull in the data and then the BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to perform the collection. It’s an interesting task because one website is organized by html tables and the other by divs, so I needed two different transforms. The collected data was then stored in a pandas dataframe called “all_songs.”

See the following function on Github for the full snippet.

def collect_songs_from_billboard(start_year,end_year):

With the initial data collection complete, I now had the Artist, Rank, Song Title, and Year for 4900 songs. As I was really focused on the lyrics though, I didn’t have what I needed for the experiment. This is where https://genius.com/ comes into play.

all_songs.head()

Lyric and Metadata Collection (via genius.com)

A quick Google search lands us on a library called LyricsGenius that gives a nice wrapper around the genius.com API. We can then use a loop to iterate through all_songs in order to collect the lyrics from each song. This API also gives us the album, the release date, any associated URLs, the song writer(s), and any featured artists on the song. A snippet of how this works is below (see Github for the full code).

api = genius.Genius("YOUR API KEY GOES HERE", verbose = False)
try:
song = api.search_song(song_title, artist = artist_name)
song_album = song.album
song_album_url = song.album_url
featured_artists = song.featured_artists
song_lyrics = re.sub("\n", " ", song.lyrics)
song_media = song.media
song_url = song.url
song_writer_artists = song.writer_artists
song_year = song.year
except:
song_album = "null"
song_album_url = "null"
featured_artists = "null"
song_lyrics = "null"
song_media = "null"
song_url = "null"
song_writer_artists = "null"
song_year = "null"

We need to use try/except here because there are often discrepancies between how Billboard/Bob Borst store artists/songs vs how genius.com stores them (e.g. and vs &, prefacing Beatles with The, etc). I handled a few of these upon inspecting some of the misses, but overall decided to see how much I got from the original intake of 4900 songs. The API calls aren’t very fast, so iterating through the entire set took around 2 1/2 hours to complete.

Investigating the Data (Phase 1)

General Data Characteristics

Of the 4900 songs I threw at the genius API, I got 3473 back. I used pandas and seaborn to visualize the distribution of songs from year to year to see how many misses there were and whether or not it could have an outsize effect on the rest of the experiments.

I ran the API collection twice, once without any substitutions and once substituting Beatles, Jackson 5, and &. The results are below:

  • Without substitution: 3378 Records (68.9% of total records).
  • With substitution: 3473 Records (70.9% of total records). A gain of 95 records.

Some manual further inspection shows that there are several song titles that just don’t match up across the two datasets. We could spend more time going through the exceptions, but we’ll proceed with the knowledge that we don’t have 100% of the dataset.

Songs with Lyrics

You can see from the above that we have the most amount of data coming in in 1990 and the least amount in 2010. We’ll keep this in mind as we proceed.

Data Preparation

For data preparation there are three things we want to get, 2 for characteristic purposes (that will help our lightweight assessment of language complexity), and the more crucial one is the extraction of nouns, verbs, adverbs, stop words and special characters from the lyrics in order to perform some more core analytics.

Tokenization with SpaCy

SpaCy is a pretty industrialized series of NLP libraries that really fast tracks data preparation and can be used for all kinds of other text analytics based on it’s pre trained models. I highly suggest reading the primer here.

For this experiment, I’ve written a function that grabs the verb, adverb, noun, and stop word Parts of Speech (POS) tokens and pushes them into a new dataset. We then extract them out and return them into an enriched dataset that enables us to further investigate the data and have it ready to pass through our profanity checks and our aggression ANN. Check out the function called:

def add_spacy_data(dataset, feature_column):

for the full details.

I also use split and set to count the number of words and number of unique words in each dataset. Let’s take a look at the newly enriched data.

Investigating the Data (Phase 2)

We can now see our enriched dataset with more detail:

Enriched Row Example

We see here that we have verbs, nouns, adverbs, corpus, word counts, and unique word counts are now available to us. We remove out the stop words in this case because they typically do not have much meaning on their own, and we’d like to focus on the words that have impact. Let’s take a look at the word breakdowns further.

Lyrics (Original Content)

[Verse 1] When you're weary Feeling small When tears are in your eyes I will dry them all I'm on your side When times get rough And friends just can't be found Like a bridge over troubled water I will lay me down Like a bridge over troubled water I will lay me down  [Verse 2] When you're down and out When you're on the street When evening falls so hard I will comfort you I'll take your part When darkness comes And pain is all around Like a bridge over troubled water I will lay me down Like a bridge over troubled water I will lay me down  [Verse 3] Sail on Silver Girl Sail on by Your time has come to shine All your dreams are on their way See how they shine If you need a friend I'm sailing right behind Like a bridge over troubled water I will ease your mind Like a bridge over troubled water I will ease your mind"

Corpus (Removed Stop words, Punctuation and made Lowercase)

verse 1 when be weary feel small when tear eye I dry I be when time rough and friend not find like bridge troubled water I lay like bridge troubled water I lay verse 2 when be when be street when evening fall hard I comfort I will when darkness come and pain like bridge troubled water I lay like bridge troubled water I lay verse 3 sail silver girl sail Your time come shine all dream way see shine if need friend I be sail right like bridge troubled water I ease mind like bridge troubled water I ease mind

Adverbs

when when when just not when down out when when so hard when all around how right

Nouns

verse tear eye side time friend bridge water bridge water street evening part darkness pain bridge water bridge water time dream way friend bridge water mind bridge water mind

Verbs

be feel be will dry be get can be find will lay will lay be be fall will comfort will take come be will lay will lay sail sail have come shine be see shine need be sail will ease will ease

We’re going to map out word frequencies (total and unique), as well average frequency of words that are used across every year to see if we can prove our complexity increase and nouns evolution over the 49 year spread.

Average Words and Unique Words Per Year

Songs Collected, Average Words, and Unique Words per Year

We can see from the chart above that the amounts of words in each song has been trending upwards from 1970 to 2018, and that generally speaking, unique words tick upwards with the increase in overall number of words. We can also see that the overall number of songs collected doesn’t seem to have a direct effect on either. We can look at this with a stacked barchart as well to see if there are any more insights.

Songs Collected, Average Words, and Unique Words per Year (Stacked Bar)

This helps us determine that the lowest number of unique words happened in 1978, and also supports the hypothesis that (by measure of uniqueness and word counts) that lyrics have gotten more complex over time. We can also look at these with matplotlib’s subplot feature to overlay multiple dimensions. This will help us visualize if there are any overt correlations.

Songs Collected, Average Words, and Unique Words per Year (Multi Axis)

From this view, we can indeed see that unique words and total words follow each other closely, and that the number of songs collected do not appear to have a clear bearing on those values. In fact, when some of the most complex lyrics appear, the collection is actually relatively low. As we’re averaging both word count and unique word count, if there was an outsize problem caused by the data, we would see dips where we saw collection misses.

It looks like our most complex year lyric wise was 2004, 2005. Let’s take a look at them below.

Most Words, 2004
Most Words, 2005

We can see here that in both cases the top 5 are Rap/Hip-Hop songs, which makes sense in this case as both of those genres are word heavy vs some of the more Pop songs of the time. You can check the code for more ways to interact with the data, but suffice it to say the results with unique words are similar. I didn’t have the ability to collect genre information with the songs, but I would think you’d see these genres were quite popular in this time frame, which would again support the increase of the word counts.

Let’s look at a word cloud or two.

I wrote a function that wraps the wordcloud library into a format and font package I like and have pushed some of the years of data through it here. I actually use word clouds a lot in day to day investigations to identify outliers and terms that could bias models that I build. They can also be quite pretty. PLEASE NOTE: as some of the lyrics can contain profanity, that may show up in the word clouds.

We’ll take a look at the lowest complexity and highest complexity years to see what’s most common within each.

1972 Word Cloud
2004 Word Cloud

In the word clouds above, it looks like Verse show up a lot. That’s because they’re in the lyrics as place markers. We could go back and treat them as stop words, but as it appears to be consistent across the data, we can probably proceed. If we come back again we may want to clean it up. Word clouds are great for this.

Now for the most common terms across years.

Nouns over the Years

From the visualization above, it looks like love peaked in 1993, and then was replaced by baby, which was then succeeded by what, but that’s really a pronoun so we can fall to time. Baby had a good run in 2012. This supports our hypothesis that the topics of lyrics have changed over time, even if we limit it to words seen in all years.

Profanity Analysis

Now we understand the data, we know word counts have climbed, and that topics have changed, and it looks like our collection doesn’t have a lot of bias due to the variance in records per year. We can now proceed to our analysis of the frequency of profanity in the lyrics.

A Note on Bias

The dictionary we’re using to detect profanity is based on present day texts, conversations, and mediums, so it may have a bias towards more modern day songs. We can intuit that songs organically have more overt profanity today, but I didn’t have a list of older, more covert forms of profanity to access for this experiment. With that in mind, let’s continue.

I loaded a dictionary from www.freewebheaders.com that includes their list of no-no words for sites like facebook. you can read more on the link, but only really open the file if you’re not easily offended, it contains some pretty terrible language. I then iterated through the dataset to see when these words showed up, stored them alongside the lyrics, and then counted the frequency of occurences. The outcomes are visualized in the chart below.

Bad Words per Year

This chart supports our hypothesis that there’s more profanity in recent years, but there are three interesting points here:

  • Profanity rises significantly from 1991 onwards. This may be due to an addition of formerly censored content to the charts.
  • There are significant increases in profanity in the mid 90s and the mid 2000s. This is interesting because it happens over 2 decades.
  • 2018 is the most profane year on record. As profanity isn’t often linked to positivity in writing, this would seem to support our increased aggression hypothesis.

Aggression Analysis

For the aggression analysis I found a dataset on Kaggle that has short messages that are tagged as aggressive/not aggressive. I looked for one that had covert/overt/non aggressive, but didn’t have any luck.

The dataset has 20,001 messages in it, and after a brief SpaCy treatment (the same approach as used for the lyrics) the data was prepared to be passed into scikit-learn’s Counter Vectorizer and then, Bag of Words data ready, passed to a Keras sequential model. You can find a nice lightweight tutorial on getting started with Keras here.

I tried several different configurations for the model, but the most positive impact occurred when I limited the features down to 250, which makes sense given the short nature of the source data and the lack of topical complexity. It may not classify as many songs as aggressive as we would like in a perfect world, but we’re looking for an upward tick in aggression, and the model will applied across all data equally.

The Keras model is pretty deep, and I’ve added multiple dropout layers to help avoid overfitting. When I added more layers to the model, I would get slightly improving accuracy, and the dataset is small enough that it was fairly easy to test.

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_49 (Dense) (None, 128) 32128
_________________________________________________________________
dropout_25 (Dropout) (None, 128) 0
_________________________________________________________________
dense_50 (Dense) (None, 512) 66048
_________________________________________________________________
dropout_26 (Dropout) (None, 512) 0
_________________________________________________________________
dense_51 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_27 (Dropout) (None, 512) 0
_________________________________________________________________
dense_52 (Dense) (None, 250) 128250
_________________________________________________________________
dropout_28 (Dropout) (None, 250) 0
_________________________________________________________________
dense_53 (Dense) (None, 250) 62750
_________________________________________________________________
dropout_29 (Dropout) (None, 250) 0
_________________________________________________________________
dense_54 (Dense) (None, 250) 62750
_________________________________________________________________
dropout_30 (Dropout) (None, 250) 0
_________________________________________________________________
dense_55 (Dense) (None, 128) 32128
_________________________________________________________________
dense_56 (Dense) (None, 128) 16512
_________________________________________________________________
dense_57 (Dense) (None, 128) 16512
_________________________________________________________________
dense_58 (Dense) (None, 1) 129
=================================================================
Total params: 679,863
Trainable params: 679,863
Non-trainable params: 0
_________________________________________________________________

There are two Jupyter notebooks in the git repo, one has the collection and analysis code and the other has the ANN training code. If you run this on your own please make sure to train the ANN first before you try to load it into the analysis code. There are examples on how to save, load, and pipeline your models in there.

Let’s see what our ANN predicted.

Aggressive Songs Per Year

We can see above that our aggression prediction model thinks a lot of songs are aggressive, but the trend on it’s own looks like it goes down a bit, which it contrary to our hypothesis. We can look at them overlayed by again using matplotlib’s subplot/multi axis feature.

Here we can see that when you scale both plots that songs, given the number of songs collected vs the number found aggressive, have been climbing and inverted in 2002. We can rightfully be sceptical of our models overall accuracy at prediction, but this kind of lightweight approach on a distant but available dataset (cyberbullying messages) can help inform us. In this case I think there’s enough indicators to make me want to look for richer datasets, and more complex approaches, to building an aggression detection model.

Conclusions

So here we are. We’ve collected our own seed data, used it to pull more data from an API, prepped the data for text analysis, checked against a dictionary of profane words, built an ANN to detect aggression, and then ran it against our data. Let’s revisit our hypothesis to see what we’ve learned.

  • Word quantity and language complexity has increased from 1970 to 2018.
  • Supported. We can see that by measure of frequency and uniqueness that the lyrics have become complex.
  • Several common nouns appear throughout the entire spectrum of lyrics, but the most common ones change over time.
  • Supported. We now know that Love lost value in 1996, but never truly went away. And Baby had it’s best years in 1993 and 2012.
  • Adverbs have become more aggressive/forward over time.
  • Not Supported. I didn’t even graph this because the data was so inconclusive. Feel free to take a look at the git repo and explore.
  • Profanity in Lyrics has increased significantly in the past 20 years (1998 to 2018).
  • Supported. With 2018 being the most profane year on record.
  • Songs are more aggressive now than they were in the 1970s.
  • Possibly Supported. We’re suspicious of the accuracy of our ANN (82% on it’s own data) vs the Lyrics dataset, but it does support the need for more research.

Thanks for reading, let me know what else you’d like to see!

--

--

I’m passionate about helping people gain insight through the exploration and responsible use of Data.