Melting Faces With NLP

Discovering themes in 90’s grunge with python

Kendon Darlington
Towards Data Science

--

Image by author, inspired by the photography of Jeff Kravitz.

What a time to be alive

The 90’s brought us some seriously killer stuff. This is the decade where both Python and R sprang to life! The 90s also gave us Google, Windows 95, and the first text message. On the downside, Stack Overflow wasn’t founded until 2008 so I’m not sure how anyone actually used Python or R.

I guess you actually had to be smart to program back then, but I was just a teenager. A teenager who cared not for the adult trappings of having a career, developing skills, or writing articles about having a career and developing skills. However, I did care about one thing in the 90’s: rock & roll.

This is the decade that brought us Nirvana, The Smashing Pumpkins, Alice In Chains, Stone Temple Pilots, Soundgarden, Pearl Jam, and many many many more great bands. But Kurt Cobain warned us:

"He's the one who likes all our pretty songs
But he knows not what it means"

Challenge accepted.

The plan

Before we try to figure out the meaning of the 90’s grunge movement, we need to slow it down a bit. When we think about the meaning of words, that is referred to as sentiment. There are a few steps along the way before we can start searching through words, phrases, sentences, songs, and albums for sentiment. I intend for this to be a multi part series, each time diving into a different aspect of NLP (Natural Language Processing).

Don’t despair! The early steps of NLP can also show us interesting themes in text that might be naked to the human eye (or in the case of music, the human ear). Our single experiences reading or listening to music are little snapshots in time, and as our attention focuses on the current moment, the moments that came before it quickly blur in our memories. NLP gives us the ability to view all of these blurry moments at once, bringing them into focus and putting them into perspective relative to each other.

To generate and capture these themes, we are going to use phrase frequency analysis. This is how its going to work:

  • Import the data. This will be the lyrics from 6 great 90’s grunge albums
  • Take these blocks of text and break them down into lists of words
  • Remove words we don’t care about
  • Take these words and get them into 3 word phrases
  • Count the frequency of each phrase’s occurrence
  • Visualize this analysis in some cool looking word clouds and see what themes rise to the top

PS: Stick around until the end and I will show you how I made that sick Kurt Cobain guitarcloud image at the top of the article!

Getting Started

Here is a link to the full project on Github. It includes the folder, code, and everything you need to do this tutorial.

Here we go! Start out by creating a folder called “Phrase Analysis”. This is where everything is going to go. I prefer to use Pythons’ Anaconda distribution, with Spyder as my IDE, but you can use whatever you feel comfortable with.

Go to the GitHub for this project and download the data LINK. The data for this project is an excel file that has only 3 rows. This spreadsheet has 2 columns, band name and lyrics. The lyrics of two albums are crammed into each cell to make this easy. All of this data was downloaded from LyricsOnDemand.com, which has the lyrics for entire albums. These are the albums we will be analyzing:

  • Nirvana: Nevermind & In Utero
  • Smashing Pumpkins: Siamese Dream & Mellon Collie and the Infinite Sadness
  • Pearl Jam: Ten & Vs

This is what the data looks like:

Image by author

Before we can get started coding, we will need to install some packages. These are the ones you want.

Many of these come with anaconda. However you may need to fool around with a few of them to get them to install. Just run each of these lines of code one by one. Iin Spyder you can highlight a line of code and hit F9 to run just the highlighted portion. Other IDEs might call this ‘run selection’ or ‘run current line’. Monitor the console for errors, google anything red. We don’t have 1998 Google, we have this spiffy, high fangled, 2021 Google so the answer to your python problems are always just a few searches away…

Next up, we have the one bit of code I cant write for you. Remember that folder called “Phrase Analysis” we created? Lets point our working directory to that folder. Simply replace ‘PathToTheFolderYouCreated’ with, you guessed it, the path on your computer.

That might look like this if your name were Jessie and you created the folder in My Documents: ‘C:\\Users\\jessie\\Documents\\Phrase Analysis’

Why do we set a working directory? It lets us use relative paths for the rest of the code in this project, and cuts down on redundant file path typing, producing cleaner looking code. Now to access a file you just type ‘file.txt’ rather than having to hardcode ‘C:\\Users\\jessie\\Documents\\Phrase Analysis\\file.txt’.

Backstage Pass to Loopapaloza

Next up, we are going to pull in the data. Lets store this data into a Pandas dataframe called dfAlbums. Pretty soon we are going to loop through the rows of dfAlbums (there are only 3) and perform NLP on each block of text.

We will also create dfNgramFrequency to store the results of our loop. Right now, this line of code defines an empty dataframe with the columns Band, Phrase, and Frequency. This will be the home of our frequency analysis. I define this home outside of the loop so that is wont be overwritten by it!

Now its time for a big block of code! Are you ready for it? Do big blocks of code crank your anxiety up to 11? Its all good, because I'm about to explain what each and every line does.

for index, row in dfAlbums.iterrows():

This is the loop. It is saying: for each row in our dataframe of bands and their lyrics, do stuff. This means each loop is a band! Nirvana goes first, then Smashing Pumpkins, then Pearl Jam. We will call this all-star lineup Loopapaloza and its gonna sell out.

band = dfAlbums.loc[dfAlbums['band'] == row[0]]['band'].item()
lyrics = dfAlbums.loc[dfAlbums['band'] == row[0]]['lyrics'].item()

This takes the current loops band and lyrics and stores them into a variables. Its just easier to parse this out once with Pandas so we can work with vairbles going forwards. Less code is better.

A few mildly interesting things here are

  • .loc: this helps us access the contents of the dfAlbums dataframe by selecting a column. In our cases we are limiting it by the band column.
  • .item(): allows us to store the data from an individual dataframe cell as a string variable. Without it, Pandas would store the information as a series which isn’t what we want
lyrics = re.sub(r'[^\w\s]', '', lyrics)           

First, we take our lyrics and remove any punctuation. Punctuation really doesn’t do us any favors for phrase frequency analysis, it is just noise. I am using a tiny line of regex to ditch the punctuation. Don’t believe its that easy? Type this in the console and hit enter:

re.sub(r'[^\w\s]', '', "Hey! Wait! I've gotta backpropogate.")
Image by Author
stopWords = set(stopwords.words('english'))    

Next we create a set of words called ‘stopWords’. In NLP, stopwords are literally words that you stop on. It means don’t use them. Generally these are words that have no meaning, or don’t contribute to the sentiment of a sentence. Type ‘stopWords’ into the console to crack it open and see the types of words in there:

Now I am not doing this here, but it is very common to add your own stopwords to the list. You can do this to remove outliers from your data, and to add company specific lingo that isn’t valuable for discovering sentiment. If we wanted to add the word ‘cheese’ to this list we would type:

stopWords.add("cheese")

The cheese is now gone.

wordTokens = word_tokenize(lyrics)

Next we will will break our big ol’ chunk of lyric text and get it into a list of words. This process is called tokenizing.

sentenceNoStopwords = [w for w in wordTokens if not w.lower() in stopWords]

Finally, we remove the stopwords from the list of lyrics. This gets stored in ‘sentenceNoStopwords’. This is a simple loop comparing two lists that is effectively saying ‘for each word in wordTokens, make sure its not in stopWords.

listOGrams = []
n = 3
gramGenerator = ngrams(sentenceNoStopwords, n)
for grams in gramGenerator:
listOGrams.append(grams[0] + ' ' + grams[1] + ' ' + grams[2])

Now we get to the good stuff! First we define a list to store our phrases, we loop though our list of words (sentenceNoStopwords) and store all three words phrases into listOGrams. When you see ‘gram’ in NLP, think of it as a unit of text, usually a word. Our n is set to 3 so the n-gram is 3 word phrases, aka ‘trigrams’.

If we set n to 2 it would be two word phrases, also known as ‘bigrams’. If we set it to 1 it would be single words, aka ‘monograms’. Finally if we set it to -1 it would rewind time, hurtling the entire universe and all of its matter back towards the singularity of the big bang, we would call this the apocalypseogram.

Actually, python would probably just crash.

df = pd.DataFrame(listOGrams, columns = ['Phrase'])

Finally we take this listOGrams and convert it into a Pandas dataframe. This dataframe will have just one column called ‘Phrase’. Stay with me, we are almost there!!!

df = df.groupby(['Phrase']).size()
df = df.to_frame()
df['Phrase'] = df.index
df.reset_index(drop=True, inplace=True)

These lines do our aggregation with pandas. Actually, just the first line is aggregating, counting up how many times each phrase was found in the text. The rest of the lines are just stupid things you have to do with Pandas to end up with a clean dataframe.

When you aggregate in pandas, it likes to send your groupings (in our case phrase) to the index. We just need to shuffle things around when were done counting like move our phrase from the index back to a column, then reset that index so it doesn’t look sloppy.

df = df.rename(columns = {0: "Frequency"})
df = df.sort_values(by = 'Frequency', ascending = False )

df['Band'] = band

We rename our aggregation column to ‘Frequency’, then sort on that value in descending order. We also create a column called ‘band’ that is filled with our band variable created a long time ago. We now have everything we need to populate our final table.

dfNgramFrequency = dfNgramFrequency.append(df, ignore_index = True)

This takes our aggregation and inserts it into dfNgramFrequency, which was defined outside of our loop. See that wasn’t so hard was it?

Visualizing The Results

We can finally visualize the results of our phrase frequency analysis. But before we jump into that, lets type the following line into the console to take a peek at our new data:

dfNgramFrequency
Image by author

Here we have a dataframe with 3 columns, band, the phrase and frequency of phrase occurrence. Phrase and frequency just happens to be one of the 2 formats accepted by the totally awesome stylecloud library.

Stylecloud is a package that takes data and produces some nice looking wordclouds. There are two ways to pass data into a stylecloud.

  1. Pass raw text into the library
  2. Pass a csv containing phrase and frequency into the library

Why wouldn’t I take the first option? Styleclouds do have their own stopword functions etc, we could have just used that. Remember that this is the first in a series of articles on Natural Language Processing. I like to have complete control over what happens to my data when performing NLP, and I feel like pythons excellent NLTK library (which is what we used) is the right tool for the job.

The only bummer is we cant pass a dataframe straight into stylecloud, it wants a csv despite my best attempt at hacking it (and by best attempt I mean I tried for about 30 seconds and gave up). As a workaround I just write the data to csv’s in our working directory and have stylecloud read from them.

Another benefit of the stylecloud library is we can get the clouds into the shapes of font awesome icons. These are pretty looking icons commonly used in web and app development. If you happen to be one of those silicon valley data scientists with your FANG stock options stacked to the moon you can go ahead and spend $99 a year unlocking all of the icons.

For the rest of us doing data science on a budget, the free ones will suffice. Enough talk, time to get stylish!

Opening Act: Nirvana

Allow me to introduce our opening act at Loopapaloza: Nirvana:

Very briefly lets review what this code is doing. The first line takes our dfNgramFrequency dataframe, limits it to just rows where the band is Nirvana. We then write these to a csv in our Phrase Analysis directory. Stylecloud then reads from this file to generate the cloud. The “icon_name” parameter determines which icon from Font Awesome to use. You can click on any of their icons on their website to copy this identifier. Finally we display the image.

The results:

You are looking at the phrase frequency analysis of ‘Nevermind’ and ‘In Utero’ molded into the shape of the most face melting programming language of all time: Python.

Looking at the cloud, you can notice that the trigram with the highest frequency must be “Said Said Said”. Only, there is a problem with this; Kurt Cobain never said this phrase…

This comes from the lyrics of the Nevermind Song ‘Breed’, which repeats he two word phrase ‘She said, She said, She said” over and over. But why does our analysis say ‘said said said’ instead of ‘she said she’ or ‘said she said’? Stopwords!

Stopwords contain primarily pronouns (words like ‘I’, ‘she’ & ‘he’). I actually don’t think this is the end of the world. Removing stopwords is helping us reduce noise in our dataset, and helps to identify themes. Themes don’t necessarily have to be words said in the order that came out of the artists mouth. Themes go beyond that. If you were to remove the stopword code from our project, the wordclouds become much less interesting, and covered in phrases that don’t really have much meaning.

However, if you want to consider ‘said said said’ to be an outlier, simply add this word to our stopwords list and re-run the code.

stopWords = set(stopwords.words('english'))
stopWords.add("said")
Image by author

There, its gone. Now another thing you might notice is that words from the choruses of songs are more prominent than the words from the verses (eg. “Forever Debt Priceless” from the chorus of the song ‘Heart Shaped Box’). This is because in typical song structure, choruses are repeated many times, while verses tend to be unique. I’ll leave this as homework to the reader (that's you BTW), you could tediously remove all occurrences of choruses that come after the first (get it down to one chorus per song).

The results may be interesting, but I don’t think this will help us find meaning or themes any better. Just thinking through the problem, I assume that the artists chose those words for the chorus because they were the most important part of the song, so they should pop more than the verses.

When I look at this stylecloud, I see themes around getting away, happiness, mentally cracking, guns, and complaining. What do you see?

Second Act: The Smashing Pumpkins!

Nirvanas set is over! Its time for the second act in Loopapaloza:

The results:

These are the lyrics to ‘Siamese Dream’ and ‘Mellon Collie and the Infinite Sadness’ compressed into the shape of the grungiest programming language of them all: R.

We can see clear themes around love, death, freedom, belonging, trust, and holding back. Billy Corgan gets deep man.

The Headliner: Pearl Jam

Finally we have our final act at Loopapaloza:

The results:

Now I don’t know about you, but I cant do data science without my glasses. Wearing glasses is probably the most rocking thing you could ever do. The lyrics from Pearl Jams ‘Ten’ and ‘Vs’ albums show themes around glorification, being alive, children, and pellet guns apparently.

Encore!

Art and data join forces.

Now you may be wondering how I made that cool image at the top of the article? I made that with a combination of Python, Photoshop, and a just a sliver of artistic skill. Some may say that there is no place for Photoshop in data science, but I would disagree.

I believe its our job as data scientists to capture the attention, and imagination of our audiences. When you blend art and data, your work can stand out in the sea of bar graphs we are assaulted with every day. While a Kurt Cobain wordcloud may not tell a story any better than a bar graph, it has novelty. The human mind is hard wired to remember and focus on novel things it hasn’t seen before as a survival mechanism. Lets tap into this glitch of human nature and use it to our advantage!

You can make this image in 4 easy steps!

  1. Buy hundreds of dollars in digital art equipment such as a drawing tablet and Photoshop subscriptions.
  2. Get good at digital art. Draw a a bunch of stuff for the image.
  3. Use your art as clipping masks for wordclouds.
  4. Blend it all together in Photoshop.

See easy!

Why did I chose Kurt Cobain? There is just some awesome imagery and photography from the In Utero era. There are some cool photos and videos of Kurt performing songs in front of a life size Mannequin representation of the album art for In Utero, making it appear as though he has angel wings. Jeff Kravitz took some great photos of these moments, you should check out his work here!

Here is how to do it:

Step 1: Draw a left handed Fender Mustang.

Image by Author.

Step 2: Draw some wings

Image by author.

Step 3: Draw a stylistic rendering of Kurt Cobain. Detail is not important here. We want our eyes to be drawn to the guitar and wings.

Image by author, inspired by the photography of Jeff Kravitz and Kurt Cobain.

Step 4: Python!

Next we need to fill the wings and the guitar with words.

Here I am using the wordcloud library instead of stylecloud. The reason is because wordcloud allows you to generate wordclouds inside of masks. What is a mask? You take a blacked out image and the wordcloud will fill its shape. In digital art, this is sometimes referred to as using ‘clipping masks’. Assuming you have the above images in the Phrase Analysis folder (they are in the Github project!).

I also wanted to set the words to be equal to the title of this article.

Melting Faces with NLP
Finding themes in 90s grunge with python
By: Kendon Darlington

You will notice in my code that I repeat these words alot to make them bigger in the clouds. I also just list a bunch of data science jargon one time each, this will make those words really small and serve as a bed of data science underneath my article headline. Finally I make sure to repeat my name the most so its the biggest word, because I am a self absorbed Oregon Trail Millennial and thrive on attention.

The new guitar:

Image by author.

The new wings:

Image by author.

Now these images are cool enough on their own. But to really make it all pop, we need to tie it all together. You will need to layer all of the images in Photoshop, redraw the outlines of the guitar and wings to make them smoother, and add a few effects like pink sound waves coming from his mouth. I wish I show all those steps here, but it would be a photoshop tutorial all on its own!

The end result:

Image by author, inspired by the photography of Jeff Kravitz.

Download the full project on Github. Link.

--

--

I am a data scientist who loves to find creative ways to make data tell its story.