The world’s leading publication for data science, AI, and ML professionals.

Machine Learning and The Beatles

Using NLTK and GPT-2 to analyze and generate The Beatles lyrics

Image via Pixy.org
Image via Pixy.org

Intro 🌅

For my final project at Metis, I decided to combine my passion for music and data science into one project involving the best band of all time. The overall project involved Natural Language Processing (NLP) with the NLTK libraries, text generation by applying transfer learning to OpenAI’s GPT-2 model, and an application built using Steamlit. The data I used were the lyrics from The Bealtes’ discography from Genius.com and supplementary data from the web about who sang and composed each track. In this blog post, I will discuss the following topics:

Background 🎵

At first I thought the idea was novel, using NLP to dive deeper into The Beatles’ catalogue, but a Google search showed me how many people have already explored this topic. In order to diversify my project, I produced an app that allows the user to choose which Beatle to base the lyric generation off of. Additionally, I made sure to include some insights behind each Beatle and what made their songs unique.

Data Collection: lyricsgenius API 📝

First, I want to mention that Genius.com has its own API, however, I found it difficult to use for my purposes. It requires the knowledge of specific IDs for whatever you are searching (artist ID, album ID, song ID). As I started to look through Genius.com’s HTML script for IDs corresponding to The Beatles’ works, I thought to myself, "there has to be an easier way." While looking for alternatives, I found the lyricsgenius API.

The lyricsgenius API is a fairly simple API to use and is basically a wrapper around the Genius.com API, but allows for more flexibility when searching for lyrics without the ID information. Accordingly, you have to obtain an API key from Genius to feed into lyricsgenius for the API to work properly.

And to show you how simple it is to use, let’s take a quick look at my code:

import lyricsgenius
genius = lyricsgenius.Genius("[YOUR KEY HERE]")

The code above outlines the basic set-up: first, import lyricsgenius; second, feed the Genius API key into the lyricsgenius API and assign to the genius variable. Now, we are ready to get some lyrics!

# Save the album titles to a list
album_titles = [
   'Please Please Me',
   'With the Beatles',
   "A Hard Day's Night (US)",
   'Beatles for Sale',
   'Help!',
   'Rubber Soul',
   'Revolver',
   "Sgt. Pepper's Lonely Hearts Club Band",
   'Magical Mystery Tour',
   'The Beatles (The White Album)',
   'Yellow Submarine',
   'Abbey Road',
   'Let It Be'
]
# Create a for loop to run through the list of album titles
for albums in album_titles:
   album = genius.search_album(albums, "The Beatles")
   album.save_lyrics()

As the computer runs through the for loop, the album.save_lyrics() method will automatically save all of the metadata for the album to a .json file in the current directory.

As a side note, the genius.search_album() method isn’t foolproof. For example, I had to manually find the album ID for ‘Sgt. Pepper’s Lonely Hearts Club Band’ and insert it explicitly to get the proper information.

album = genius.search_album(album_id=11039, artist="The Beatles")
album.save_lyrics()
# Used the Genius API to find the album id for Sgt. Pepper's Lonely Hearts Club Band

The json files containing the album information are very dense. For my purposes, I built a function to take each json file, clean the lyrics, and spit out only the information I wanted into a DataFrame for easy use.

Obtaining the composer information wasn’t as interesting. Using Pandas, I read a web page containing that information into a DataFrame, restructured the DataFrame to more similarly match the DataFrame containing the more in depth song information, then joined the two DataFrames on song title. As always, there was some necessary data cleaning.

NLP: A Pipeline and Topic Modeling 🥁

In my previous blog post, I touched on the basic elements of the NLP process (tokenizing, stemming, vectorizing). I won’t dive into it too much, but I did a very similar process for this project. First, I created an NLP pipeline to take all of the documents and clean, stem, tokenize, and vectorize them. This makes it very easy to tweak the specific elements of the NLP process to compare your results.

In this situation, I wanted to uncover topics within The Beatles’ corpus. Again, very similar to my previous NLP project, I engaged in the trial-and-error process of creating topics, analyzing those topics, and manipulating the NLP pipeline to draw out topics that actually made sense. With The Beatles’ documents, I uncovered five general topics to group the songs into: desire, reality, relationship, home, and fun.

Here are some examples:

Desire: I Need You – Help!
If you can't see it in the title, it's all about the narrator needing the subject back in their life
Reality: Drive My Car – Rubber Soul
'Drive my car', 'Working for peanuts' (metaphor, but also talking about life being hard, working for a living, but wanting to be famous)
Relationships: Girl – Rubber Soul
All about a girl, how she came into the narrator's life and was strong women, but the song is about the narrator still falling for her
Home: Golden Slumbers – Abbey Road
A song McCartney wrote at home after his mother passed away is all about the concept of home and sleep
Fun: Your Mother Should Know – Magical Mystery Tour
A song about hit songs across generations, talking about singing and dancing

NLP: Sentiment Analysis 🎸

For sentiment analysis, I used the Nltk VADER Sentiment Intensity Analyzer. From this module, I utilized the polarity_scores() method, which returns four metrics of measurement: compound, negative, neutral, and positive. The compound measurement is measured on a scale of -1 (negative sentiment) to +1 (positive sentiment) with 0 being neutral sentiment, thus combining the three other measurements into one metric. With the compound sentiment measurement, one can easily visualize sentiment.

Image by Author
Image by Author

The graphic above shows the average sentiment per album grouped by singer, with each dotted line being the release of an album.

It is very easy to implement VADER sentiment analysis, however, my knowledge of the specifics behind it is fairly limited. For a deeper dive into VADER Sentiment Analysis, I would recommend this blog.

Text Generation: GPT-2 ⌨️

Now on to the text generation. First, don’t be intimidated by the implementation of text generation, because GPT-2 couldn’t make it easier to generate convincing text. GPT-2 itself can do many different things, but when specifically talking about text generation, it has a few simple steps to follow. I successfully ran my first iteration with this Google Colab notebook by Max Woolf and I would suggest you do the same.

GPT-2 itself has a few different sized pre-trained models:

  • Small – 124M parameters / ~500MB
  • Medium – 355M parameters / ~1.5GB
  • Large – 774M parameters / ~3GB
  • XL – 1.5B parameters / >5GB

This is where a Google Colab notebook really comes in handy. Unless you are running a GPU, it is unlikely that your local machine can beat the resources of even the lowest computing power of a Google Colab notebook. So, I would suggest running through that Google Colab notebook I posted in the previous paragraph.

Anyway, what is GPT-2? It is a pre-trained model (Generative Pre-trained Transformer 2 is the full name) that uses deep learning to do many NLP type of processes. Creating a neural network from scratch requires a lot of data, time, and computing resources, so having access to GPT-2 is a pretty phenomenal way to solve NLP tasks in a quick and efficient manner. This ability to tailor a pre-trained model to one’s needs is called transfer learning. In regard to text generation, GPT-2 is ready to generate text, but you can feed it a specific corpus for it to focus on, thus generating text that more similarly resembles our corpus.

In regards to generating The Beatles’ lyrics, I just fed in the corpus and let the model learn from The Beatles to create songs. Now, GPT-2 is very smart, so it will pick up on nuances in the lyrics, such as section labels (‘chorus’, ‘verse’, etc.) or singer (‘Harrison’, ‘Lennon’, etc.). With this in mind, it is important to tailor your input to get your desired output. As they say, "garbage in, garbage out".

App Building: Streamlit 🕸

I can’t tell you how much I love Streamlit. It is such an easy way to build a web app and even create dashboards. Streamlit is especially great for the beginner programmer, easily taking less than an hour to get something basic down. With that being said, the customization is limited, and I had to use HTML and CSS to modify the page to make it look how I wanted.

As for the code, I won’t dive into every aspect, but there are some things I’d like to point out.

Creating the sidebar on the left is as easy as the code below. There are a few more imports in my final code, but you just need Streamlit imported to do your basic app development. Also, the selectbox method can be saved to a variable for future changes to the application. In this case, when you select different singers, the layout of the page changes, and the corresponding GPT-2 model is loaded for that specific singer.

# Import streamlit
import streamlit as st
# Select who will generate the song
singer = st.sidebar.selectbox(
 "Who is singing the song?",
  ("The Beatles",
   "John Lennon",
   "Paul McCartney",
   "George Harrison",
   "Ringo Starr",)
,)

The only other code I wanted to speak to was in regard to the formatting. Streamlit allows you to customize the page pretty well, but I couldn’t find a built-in solution to use my images on the page, so I searched and found that HTML in a markdown cell would do the job. I would believe that Streamlit would eventually add a solution to this, but for now, this is the way.

# Set background image
st.markdown(
  """
  <style>
  .reportview-container {
    background: url("[IMAGE URL]");
    background-position: right;
    background-size: cover;
    background-repeat: no-repeat;
    background-blend-mode: lighten
  }
  </style>
  """,
  unsafe_allow_html=True,
)

In the python file, I added a lot more interesting stuff to get this app working. Essentially, the user chooses the singer of the song, and the app selects the trained GPT-2 model corresponding to that singer (having been trained on the lyrics of songs that individual sang on). Next, you enter a prompt and click generate, and GPT-2 does the rest.

Unfortunately, the app takes a bit of time to send the output, so there is definitely room for improvement, but I liked how it ended up.

Conclusion: 📙

As an aspiring data scientist, I wanted this project to help solidify my knowledge of NLP and get some experience with neural nets and text generation. In a way, GPT-2 was a little too easy for me to implement, so I can’t confidently say I know how to work with neural nets. However, the process of working with GPT-2 led me to understand how powerful neural networks can be.

As for text generation, NLTK is such an easy library to work with, and once you get a basic pipeline created (feel free to steal the one in my repo [that Metis provided for me]) you can then apply that same pipeline to any of your future NLP projects. Personally, I’d suggest to learn more NLP libraries (I’m telling myself this as well), spaCy in particular seems to be popular within organizations.

Here are some basic take-aways:

  • Look for help if you are stuck (I always find a package that makes my project run smoother)
  • Build a pipeline (especially if you are testing by trial-and-error)
  • Streamlit makes it easy to create apps
  • Have fun!

Check out the Github Repository for more information on this project. It isn’t my cleanest repo, but I will try and clean it up for better understanding. Also, reach out if you have any questions or comments.


As this is my final project with Metis, I couldn’t be more amazed with everything I’ve learned over the 12 weeks within the program. Obviously, it was a lot of information, so I am currently diving back into all of the topics to get a stronger understanding of the whole data science toolkit. Anyway, I’ve written about the other three projects on Medium, so go check those out if you want to see my other work.

Reach out: LinkedIn | Twitter


Related Articles