The world’s leading publication for data science, AI, and ML professionals.

Sex and Drugs and Organic Topic Modeling

Using GPT-J to analyze the lyrics of Rock & Roll songs

Sex and Drugs and Organic Topic Modeling, Diagrams by Author
Sex and Drugs and Organic Topic Modeling, Diagrams by Author

Question: Which topic is the most common in rock and roll Lyrics?

Answer: The Beatles summed it up in their song, "All You Need is Love."

As an experiment in Topic Modeling, I used the latest AI systems to analyze 12,000 songs by 50 rock bands. This article will walk through the processes I used and explain all of the components, including a free version of GPT-3 that runs on a TPU in Google Colab called GPT-J. Note that you can use these techniques to find and analyze the topics in any text dataset.

RockTopics Overview

Here is a high-level diagram for this experiment which I call RockTopics. After a brief discussion of the main components, I’ll explain the processing steps in greater detail in the sections below.

RockTopics Components, Diagram by Author
RockTopics Components, Diagram by Author

I started with a database of 128K song lyrics I found on Kaggle, and I filtered down the number of songs by finding its intersection with the list of the 100 Greatest Rock Bands from Rolling Stone Magazine. This yielded 13K songs from 50 bands.

The heart of the system is the open-source GPT-J transformer trained using the Mesh Transformer JAX on the Pile, a large corpus of English text. I fed each of the song lyrics line-by-line into GPT-J, using a few-shot prompt to find the primary topic for each line.

I used the Google Universal Sentence Encoder to transform each discovered topic into an array of 512 numbers. I then analyzed the topics using a combination of TSNE dimensionality reduction and k-means clustering to produce graphs using matplotlib.

Here are the most common topics according to the analysis. Note that similar topics are clustered together, and the size of the circles represents the number of occurrences in the song lyrics.

Rock Topics, Image by Author
Rock Topics, Image by Author

System Components

The following sections describe the components and processes I used in detail.

Lyrics Database

For the lyrics, I found a nice dataset by Anderson Neisse, an AI researcher in Brazil. The dataset has 128,083 songs from 2,940 bands in 6 genres. He released the dataset under the Database Contents License.

As I mentioned above, I filtered the list of songs with Rolling Stone Magazine’s list of the 100 greatest recording artists. The resulting list has 11,959 songs from 50 bands, including The Beatles, Bob Dylan, Elvis Presley, The Rolling Stones, Chuck Berry, and Jimi Hendrix. In total, there are 185,003 lines of lyrics.

GPT-J

GPT-J is an AI model for analyzing and generating text trained using Mesh Transformer JAX, a scalable system for training large models using parallel processing. The model was trained on a large corpus of text called The Pile, an 825-gigabyte English text dataset designed for training large-scale language models [1].

GPT-J was created by EleutherAI, a grassroots collective of researchers working on open-sourcing AI research. The system was modeled after OpenAI’s GPT-3, but GTP-J is free to use under the Apache 2.0 open-source license.

You can try out GPT-J for free here, https://6b.eleuther.ai/

EleutherAI's GPT-J, Image by Author
EleutherAI’s GPT-J, Image by Author

As you can see, it already knows the number one topic in rock songs. Next, I’ll show you how to find the topics for each line in the song lyrics.

Finding Topics in Lyrics

Similar to other large transformer models, GPG-J works by passing in a prompt, and the system will generate continuing text as a result. I used the following prompt to get the topic from each line of the songs. Note that this is called a "few-shot" inference because a few examples are given.

Determine the topic for these lyrics.
Line: Ah, look at all the lonely people!
Topic: Loneliness
Line: Wiping the dirt from his hands as he walks from the grave.
Topic: Death
Line: Little darling, the smiles returning to the faces.
Topic: Happiness
Line: Every breath you take.
Topic:

The first line is a general description of the query followed by the three example lines and topics. The system will use this info to get the gist of what’s being asked and will specify the topic for the last line, "Every breath you take." Running this query multiple times will result in the topics "Breathing," "Health," "Life," etc.

Here are the topics from the first five lines of "Every Breath You Take" by the Police.

Line                     Topic
Every breath you take    Breathing
Every move you make      Movement
Every bond you break     Leaving
Every step you take      Walking
I'll be watching you     Watching

The system seems to do a pretty good job and finding topics for the lines. Note that it is not just pulling keywords. It always states the topic in noun form and sometimes generalizes the meaning of the line to find the topic as it did for Leaving and Walking.

Note that I found that using the same examples for each query would sometimes "leak" the example topics "Loneliness," "Death," and "Happiness" into the results, increasing their counts. To minimize the leakage, I culled a list of 300 examples from my intermediate results, and I wrote some code to choose three examples randomly from the extensive list for each query. This appears to have reduced (or spread out) leakage to statistically insignificant levels.

I found that the leakage issue can be avoided altogether using OpenAI’s GPT-3 davinci-instruct-beta model using a "zero-shot" query, meaning no examples are provided. Here’s the prompt.

Determine the topic for this line of lyrics.
Line: Every breath you take.
Topic:

The results are similar to GPT-J without any leakage because there are no examples. This method is preferred, but it comes at a cost. Although running this one query would cost only $0.0017 using the paid version of GPT-3, running it on 185K lines would cost over $300.

Tensor Processing Unit

Tensor Processing Unit, Image Source: Google Blog
Tensor Processing Unit, Image Source: Google Blog

For the last 18 months, I have been using Google Colab to run my experiments in AI. I have been using the two types of processors, Central Processing Units (CPUs) and Graphics Processing Units (GPUs). CPUs are general-purpose computers that have been around since the 1960s. GPUs were developed in the 1980s for graphics-intensive operations and have been used for AI and ML starting in the 2010s.

In 2016 Google introduced their Tensor Processing Unit (TPU), designed specifically for training and running instances of AI models. They have been available for use on Google Colab since 2020. Yu Wang led a team from Harvard to test the speeds of AI models on GPUs and TPUs. They found that TPUs outperform GPUs by a factor of 3X to 7X when running large models [2].

My RockTopics Colab for running GPT-J topic modeling is based on a TPU Colab from Ben Wang at EleutherAI. It takes about 1.4 seconds per line which ran for about three days to get the topics for all 185K lines of lyrics.

Universal Sentence Encoder

After I collected the topics, I used the Universal Sentence Encoder from Google to convert each topic phrase into an embedding, an array of 512 floating-point numbers. Here’s some sample code to show how it works.

import tensorflow_hub as hub
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/5")
embedding = embed(["love loss"])
print(embedding)

The result is here.

tf.Tensor( [[-2.79744454e-02 -6.52119750e-03 1.32698761e-02
4.50092048e-04 9.81796882e-04 3.18628997e-02 2.73146853e-02
-1.10505158e-02 -2.71893758e-02 -5.06720766e-02 -3.20894904e-02
...
-1.08678043e-02 7.85474479e-03 -6.44846493e-03 -3.88006195e-02]], shape=(1, 512), dtype=float32)

Although these numbers won’t mean anything to humans, they represent the concept of "love loss" in the encoder’s multidimensional space. I’ll show you how I used statistics to make sense of the topics in the following sections.

Dimensionality Reduction

There is a lot of information in each of the embeddings. However, the first step I used is dimensionality reduction (DR) to show the topic embeddings graphically. This will help visualize the data by reducing the number of dimensions for each topic from 512 to two to create a two-dimensional chart.

Two standard methods for DR are Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (TSNE). Each method tries to retain the gist of the data while reducing the number of dimensions.

PCA takes the data, converts it into a matrix, uses some fancy math to find the most significant variance, then converts it back into a dataset. TSNE works by iteratively minimizing the divergence between the input data points and the data points representing the same data points after translating them into a lower-dimensional representation. You can read more about the PCA and TSNE methods in Luuk Derksen’s post here.

The top 100 topics look like this when reduced from 512 to two dimensions using PCA and TSNE. The size of each circle represents the number of times the topic was found in a line of lyrics. You can click on the images to see a bigger view.

Rock Topics after PCA and TSNE Dimensionality Reduction, Images by Author
Rock Topics after PCA and TSNE Dimensionality Reduction, Images by Author

You can see that the PCA reduction has more tightly packed clusters, while the TSNE reduction has the topics spread out a bit more. Although the PCA reduction has a more interesting overall grouping, it makes it hard to read the topic labels because they are often stacked on top of one another. In general, I find the TSNE reduction easier to read.

Orienting the Graphs

While looking at the graphs, you might ask why the topics are in their spatial positions? Although the relative positions have meanings, the overall spatial orientation is arbitrary. The Universal Sentence Encoder assigned the positions of phrase meanings in the 512-dimensional space. And although the DR algorithms try to maintain spatial coherence, there is no guarantee that any particular topic will land in any specific place.

To make the graph orientation more predictable, I arbitrarily chose the topic time to orient the X-axis and the topic music to orient the Y-axis. Here is what the graphs look like before and after running the orientation code.

Rock Topics Orientation, Diagrams by Author
Rock Topics Orientation, Diagrams by Author

After orienting the graph, the time topic is now at the three o’clock position, and the music topic is roughly at the 12 o’clock position. The source code for orienting the graphs is here.

K-Means Clustering

You might have noticed that some of the topics are near-synonyms. For example, you can see dream and dreams, dance and dancing, etc., as individual topics.

To combine similar topics, I used an algorithm called k-means clustering. The algorithm aims to group n samples into k clusters in which each sample belongs to the cluster with the nearest mean.

Here’s what the topics look like after using k-means to reduce the data to 50 clusters.

Rock Topic Clusters, Image by Author
Rock Topic Clusters, Image by Author

You can see that similar topics have effectively been grouped together, like how the topic dream merged into dreams and dance merged into dancing, etc.

Here’s a graph that shows the bands arranged by the average of the topics in their lyrics. The size of the circles represents the number of lines in their catalogs of songs.

Topics by Bands, Image by Author
Topics by Bands, Image by Author

It’s rather fun to see where your favorite bands land. Some of the groupings make sense, like seeing Bruce Springsteen next to Elvis Costello. But somehow, The Beach Boys are sandwiched between Radiohead and Nirvana. I didn’t expect that.

Discussion and Next Steps

Using large transformer models for organic topic modeling seems to work well for large datasets. Running GPT-J on Google Colab is a cost-effective way to run the analysis. The next thing I could try is finetuning GPT-J to find topics without specifying examples, eliminating the problem with leaky examples.

Another project would be to see how well this topic modeling technique compares with other methods. For example, I could run it on the 20Newsgroups dataset and see how it compares to the state-of-the-art system, Bidirectional Adversarial Training (BAT)[4]. While the BAT method uses a smaller model trained to perform the specific task of topic modeling, my approach of using a large transformer model can leverage general knowledge to yield better results.

Source Code and Colabs

All source code for this project is available on GitHub. I released the source code under the CC BY-SA license.

Attribution-ShareAlike
Attribution-ShareAlike

Acknowledgments

I want to thank Jennifer Lim and Oliver Strimpel for their help with this project.

References

[1] L. Gao, et al., The Pile: An 800GB Dataset of Diverse Text for Language Modeling (2020)

[2] Y. Wang, et al., Benchmarking TPU, GPU, and CPU Platforms for Deep Learning (2019)

[3] D. Cer, et al., Universal Sentence Encoder (2018)

[4] R. Wang, et al., Neural Topic Modeling with Bidirectional Adversarial Training (2020)


Related Articles