The world’s leading publication for data science, AI, and ML professionals.

How We Used NLTK and NLP to Predict a Song’s Genre From Its Lyrics

The aim of this article is to outline our process for using NLTK and Natural Language Processing methods to clean and preprocess text data…

The aim of this article is to outline our process for using NLTK and Natural Language Processing methods to clean and preprocess text data and turn song lyrics into a matrix of numerical values, so we can train a Machine Learning Algorithm that can classify each song’s genre based on its lyrics.

What is Natural Language Processing (NLP for short)?

NLP refers to analytics tasks that deal with natural human language, in the form of text or speech. These tasks usually involve some sort of machine learning, whether for text classification or for feature generation, but NLP isn’t just machine learning. Tasks such as text preprocessing and cleaning also fall under the NLP umbrella.

The most common python library used for NLP tasks is the Natural Language Tool Kit, or NLTK. NLTK is a sort of "one-stop shop" for all things NLP. Unlike most other Python Libraries and ML models, NLTK and NLP are unique in the sense that in addition to statistics and math, they also rely heavily on the field of Linguistics. Many of the concepts and methods for working with text data described throughout the rest of this article are grounded in linguistics rules.

Obtaining Data: Where did we get our data?

We found a CSV on Kaggle with 300,000 song lyrics ranging from 11 different genres and 6–7 different languages. The dataset had information on the Song Title, Artist, Year, Album, Genre, and a column with the full song Lyrics.

Cleaning and Pre-Processing Text Data

Now that we have our data, the fun part begins. First, we need to preprocess and clean our text data. As you might have already suspected, preprocessing text data is a bit more challenging than working with more traditional data types because there’s no clear-cut answer for exactly what sort of preprocessing and cleaning we need to do. When working with traditional datasets, our goals are generally pretty clear for this stage – normalize and clean our numerical data, convert categorical data to a numeric format, check for and deal with multicollinearity, etc. The steps we take are largely dependent on what the data already looks like when we get a hold of it.

Text data is different – in its raw format, text data starts with only 1 dimension – the only feature in our dataset that we are interested in at the initial stages of our project is the column with the full text of each song lyric. This means that we need to make some decisions about how to preprocess our data and extract features from the text document that we can later use to train ML models on. Before we can begin cleaning and preprocessing our text data, we need to make some decisions about things such as:

  • Do we remove stop words, or not?
  • Do we stem or lemmatize our text data, or leave the words as is?
  • Is basic tokenization enough, or do we need to support special edge cases through the use of regex?
  • Do we stick with English words only or do we allow for other languages?
  • Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?
  • Do we engineer other features, such as bigrams, POS tags, or Mutual Information Scores?
  • What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?

These are all questions that we’ll need to think about pretty much anytime we work with text data.

Exploring Data: Looking at our Pandas Dataframe we found…

The first thing we did was check for non-values and dropped songs with NaN lyrics and after cleaning we still had 200,000 rows.

Then we looked at value counts for Genre decided to drop Folk, Indie, and Other because the first two didn’t have enough data and "Other" doesn’t provide any predictive value to our final classification task.

After all of this cleanup, we were left with eight basic genres: Rock, Pop, Hip Hop, Metal, Country, Jazz, Electronic, R&B. These are the target classes that we will be trying to predict.

Distribution between genres was uneven, so we decided to randomly select 900 songs per genre giving us a total number of rows *900 songs 8 genres = 7200 songs**.

Feature Engineering and Model Optimization:

  1. We used a combination of NLTK, Pandas and Regex methods to:
  • clean text from punctuation and odd characters
  • remove stopwords
  • tokenize to only English words
  • return a corpus of stemmed words
  • return a corpus of lemmatized words
  • append the final clean lyrics back to the Pandas DataFrame
  1. We used TF-IDF Vectorizer to turn words into a numerical representation of the importance of each word to a particular song lyric.

What is TF-IDF?

TF-IDF stands for term frequency-inverse document frequency, and the TF-IDF weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

How is TF-IDF computed?

Typically, the TF-IDF weight is composed of two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

  1. TF: Term Frequency measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF = (Number of times term t appears in a document) / (Total number of terms in the document).

2. IDF: Inverse Document Frequency, which measures how important a term is to the meaning/content of a particular document in the corpus, compared to all other documents. It is known that certain terms, such as "is", "of", and "that", may appear very frequently in most documents, but that doesn’t give us any information on the importance of those commonly used words to a specific document’s meaning. Thus we need to weigh down the excessively frequent terms while scaling up the rare ones which are specific to only a smaller number of documents, by computing the following:

  • IDF = log_e(Total number of documents / Number of documents with term t in it).

After stemming and lemmatizing all the song lyrics and creating a features TF-IDF matrix we found ourselves with a final Pandas DataFrame of 7200rows and 30,000 columns. Each row represents a particular song lyric and each column is a unique word and its corresponding TF-IDF value.

Training and Optimizing Our Models

The first thing we wanted to do was to test whether our basic ML models performed better with a corpus of stemmed or lemmatized text. We trained and evaluated the performance of Multinomial Naive Bayes, Random Forest, AdaBoost, Gradient Boost, and K-Nearest Neighbor, using both stemmed and lemmatized words. The chart below shows our results:

We chose to go with lemmatized words over stemmed words because every model consistently performed at least 1% better when using lemmatized text.

From here we opted to focus on model optimization for our top three models –Multinomial Naive Bayes, Gradient Boost, and Random Forest.

Next thing we did was PCA where we ran a test on our data to see how many components would preserve 80% of the variation. Then we ran PCA with n_components = 1800 on our top three models to see if that improved performance. The graph below shows the result:

As you can see from the graph, PCA didn’t improve performance in either model, so we decided to not use PCA moving forward.

Next things we wanted to do was GridSearch on the three top-performing models and pick the model with the combination of parameters that yielded the highest accuracy score. Summary of results below:

  • Grid Search on the Random Forest improved performance from 41% to 43% accuracy.
  • Grid Search on the Gradient Boost improved performance from 45% to 50% accuracy.
  • Grid Search on Naive Bayes Grid Search did not generate improved performance because the default parameters are optimal.

Interpreting and communicating the final results:

Below you can see the graph of our top three models’ Final Performance after optimization and hyperparameter tuning using GridSearch.

Our highest model, GradientBoost after Grid Search yielded 50% accuracy, which is just about four times better than random guessing (guessing a random class out of 8 possible classes = 1/8 or 12.5%). Even though 50% is not a stellar number, we were still impressed that given only 7200 lyrics we were able to train a model that can correctly guess what Genre a song belongs to 50% of the time by only scanning that song’s lyrics.

From experimenting with Grid Search and PCA optimizations, we found that Multinomial Naive Bayes was the fastest and simplest to use model right out of the box. Without any extra optimization techniques, it yielded only 5% less accuracy than the top model – GradientBoosted Classifier.

Conclusion:

Based on our fun experiment, it appears that there is a certain set of vocabulary, which is specific to each song genre that can allow one to train an ML Algorithm that can guess a song’s genre only by analyzing its lyrics. Another interesting finding was that the NaiveBayes Classifier seemed to generate a very strong performance right out of the box. Thus, if you are working with a very large text dataset, where feature generation and model optimization prove to be computationally expensive and time-consuming, you might opt to use Naive Bayes for simplicity and efficiency, without sacrificing performance too much. If you have sufficient time and computational power and you want to optimize performance as much as possible, then running a GridSearch on a bunch of ensemble models such as Random Fores or GradientBoosted Classifiers would be the way to go.

Fun Add-On: Using an Unsupervised Learning model to identify distinctive topics and keywords for each genre

We used gensim.corpora.Dictionary to create a frequency dictionary for the lemmatized, tokenized word set. We grabbed keywords from each genre and generated a Topic Model score. Using the Word2Vec Dictionary generator, we ran a Topic Modeling LDA algorithm and printed the word clouds for the top Keywords in each genre below.


Related Articles