How to Collect Song Lyrics with Python

A hassle-free way to create a song lyrics dataset for training generative language models

emakpati
Towards Data Science

--

Photo by Brandon Erlinger-Ford on Unsplash

Introduction

Building datasets for language model training and fine-tuning can be very tedious. I learned this the hard way while trying to gather a conversational text dataset and a niche song lyrics dataset, both for training a single GPT-2 model. Only after several hours of semi-automated scraping and manual cleaning did I come across Genius and its API. It really was a godsend. Not only is it easy to get set up with and use the API, but it also boasts a plethora of artists that aren’t necessarily household names. This makes it a great option for creating datasets of both mainstream and niche song lyrics. The process is rendered even more effortless when stacked with lyricsgenius, a package created by John Miller that significantly simplifies the task of filtering data retrieved by the Genius API.

This writeup will revolve around the use-case of constructing a training dataset for a generative language model, like GPT. To be clear, this will not include steps to actually build a model. We’ll walk through the process of setting up the API client and then writing a function to fetch song lyrics of k songs and save the lyrics to a .txt file.

Setting up the API Client

  1. Review the API documentation page.
  2. Review the API Terms of Service.
  3. From the documentation page, click API Client management page to navigate to the Sign-up/Log-in page.
  4. Complete the form using the signup, or login (if you have an account), method of your choice and click Create Account. This should take you to your API Clients page or re-route you back to the home page. If you are sent back to the home page, scroll down to the page footer and click Developers.

5. Once on the API Clients page click Create an API Client (on Safari) or New API Client (on Chrome) to create your app. In this context, the term “app” refers to usage of the Genius API. Only the App Name and App Website URL fields in the screenshot below are necessary to progress. The app name will chiefly be used by us to identify this individual client (you can create several API clients). I typically default to using my Github for most website requirements but any site should be fine for the URL field.

6. Clicking Save will take you to the credentials page which should look like the screenshot below. If you’ve created multiple API clients, they’ll all show up here. We’ll come back to this page later for the client access token. Congrats, you’ve completed the signup process!

Installing LyricsGenius

From https://github.com/johnwmillr/LyricsGenius

Lyricsgenius can be installed from PyPI using pip install lyricsgenius or from Github using pip install git+https://github.com/johnwmillr/LyricsGenius.git.

Code (Finally)

We’ll first start by importing lyricsgenius.

import lyricsgenius as lg

We also need to create a variable for the path to the file we want to write the song lyrics to. In this exmaple my .txt file is named auto_.txt.

file = open("/Users/User/Desktop/auto_.txt", "w")

It’s finally time to use our API client credentials. Navigate back your API Clients page and click Generate Access Token.

This token will be passed to lyricsgenius.Genius() along with the parameters we want to use to filter the text data. In terms of filtering, we’ll ignore lyrics that aren’t from official songs and disregard live performances and remixes. It’s also a good idea to set remove_section_headers to True assuming we want our dataset to focus solely on spoken song lyrics and to exclude song metadata.

genius = lg.Genius('Client_Access_Token_Goes_Here', skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"], remove_section_headers=True)

Lastly, let’s write a function called get_lyrics() that takes a list of artist names and k number of songs we want to grab for each artist as parameters. The function will print the name of each song collected and the total number of songs successfully grabbed for each artist and then write the lyrics to our .txt file.

def get_lyrics(arr, k):
c = 0
for name in arr:
try:
songs = (genius.search_artist(name, max_songs=k, sort='popularity')).songs
s = [song.lyrics for song in songs]
file.write("\n \n <|endoftext|> \n \n".join(s))
c += 1
print(f"Songs grabbed:{len(s)}")
except:
print(f"some exception at {name}: {c}")

Let’s break down the logic.

  1. Set a counter c used to keep track of the number of sets of lyrics written to the .txt file.
  2. Loop through the list of names arr.
  3. Create a try block.
  4. Pass name to the lyricsgenius.Genius.search_artist() along with the number of songs we want k and sort the songs by popularity so that each artist’s most popular songs are grabbed first (up to our limit k of course). This will give us a list of song names named songs.
  5. List comprehension loop through songs, adding each song’s lyrics song.lyrics to a new list s.
  6. Call file.write() and pass “ ”.join(s) to compress the list of strings into a single string and write the newly made string (which represents all lyrics grabbed) to the .txt file. Instead of joining against “ ”, we can join against a more conspicuous separator like “\n \n <|endoftext|> \n \n” which will make it much easier to read through the text file as each set of single-song lyrics will be succeeded by that separator. On a side note, separators can be very useful when building datasets for some language models and can aid your model in understanding when a single sample of text ends.
  7. Increase c by 1.
  8. Print generic success message along with the number of songs len(s) collected.
  9. Create except block.
  10. Print generic except message along with name of artist that threw the exception and c.

Ouput:

Print output when running get_lyrics([‘Logic’, ‘Frank Sinatra’, ‘Rihanna’], 3).
Example of the .txt output when running get_lyrics([‘Logic’, ‘Frank Sinatra’, ‘Rihanna’], 3).

Process Improvement

A clear area for improvement is our process’s dependence on a collection of artist names being passed to lyricsgenius.Genius.search_artist(). Manually creating a list of artist names is definitely not scalable. We only used three artists in our example, but to build a large enough dataset to fine-tune a production-caliber model we’d ideally want dozens of artists and a much higher k variable.

The solution automates the task of creating the list of artists; one way being to scrape the names from one of two sources using bs4. Wikipedia provides several lists of musicians based on music genre and may be a great, singular source to grab these artist names from.

Conclusion

A process that was once difficult and complex is now frictionless and streamlined thanks to the work of the Genius team and John Miller.

Here’s all the code I used above to connect to the API and write song lyrics to the .txt file.

Thanks a bunch for allowing me to share, more to come. 😃

--

--