Identifying Language Specific to a Musical Genre

Published in

Towards Data Science

6 min readSep 9, 2019

“Music is the universal language of mankind.”― Henry Wadsworth Longfellow

Human beings have a natural ability to relate to the music they listen to and to communicate with others through the music they create, regardless of the language they speak. As music has progressed over the ages, we have developed new ways of talking about it through spoken and written language. Musical notation, for example, was adopted as a means of recording a piece of music in written form so that anyone who is familiar with this notation could pick it up and play.

In the modern area, music has become more accessible and widespread, and there is a much wider variety in the styles of music that are out there. This phenomenon has led to a major focus on categorizing music based on genre. In many cases, these genres have developed entire subcultures around them. Fans of a particular style of music might identify with these subcultures in more ways than just the music they choose to listen to. They might dress in a certain fashion, frequent certain venues, and even develop a specific way of speaking about the music they love.

There are terms people use to describe music in general, but I wanted to see if I could analyze the differences in the language people use around a specific genre. To do this, I created a machine learning model to test how well a computer would be able to determine what genre a person might be talking about based on the words they are using. As a case study, I chose to compare two specific genres: EDM and Rock. The data used for this analysis came from Reddit posts about these two genres. I pulled one group of posts from the subreddit r/EDM and another group of posts from r/rock to get text data of people talking specifically about these two types of music. A link to the project can be found here.

The next step in this process was to clean the text so that it could be processed properly. Because these were Reddit posts, there were lots of things inside them that were not actual words, such as links to YouTube videos and playlists, so these had to be removed. Since the words themselves were the only relevant content, any punctuation was also removed. I chose to use the titles of the posts in addition to the content of the posts themselves because, for these particular subreddits, the titles were just as relevant as the content in many cases.

Once the text had all been properly cleaned, it was time to analyze the text using Natural Language Processing. I ran all the posts through a Count Vectorizer, which identified all the unique words in the posts and counted the number of times each word appears in each post. I also tried a slightly different way of processing this text using a TF-IDF vectorizer, which instead of doing a simple count of all the words, also takes into account the number of posts that contain the word. These results could now be run through a model that would be able to predict which of the two subreddits a given post came from. I also analyzed any combination of two words that appeared together, which helped to take into account the context of the words and thus improved the performance of the model.

I tried out several models and compared them to see which performed best. Each model uses a different algorithm to classify a post as either from r/EDM or r/rock based on the text of the post. The models I chose to compare were Naive Bayes, Logistic Regression, Random Forest, and K-NN. Each model was tested with both the Count Vectorizer and the TF-IDF Vectorizer. The results of how accurately each model was able to predict the genre of a post are shown here:

The best performing model was the Naive Bayes, and it did better when used with a TF-IDF Vectorizer than with a Count Vectorizer. This model was able to correctly predict the genre as either EDM or Rock for close to 90% of the posts. This demonstrates that there is a significant difference in the language people use to describe different genres of music, enough for a model to be able to distinguish between these dialects based solely on analyzing the words people use.

We can even identify particular words that are highly associated with a particular genre, which provides some great insight into what specific words make each style unique. As more and more people use these words more frequently when talking about a certain kind of music, they begin to shape the language and the culture around a musical genre. To make its predictions, my model learns which words are the most consequential and weights them accordingly. To determine the most important words, I examined which words were weighted the highest in my model and pulled out the top 15 for EDM and the top 15 for rock.

As can be seen in the charts above, the words that are most associated with a genre fit with the lingo you would expect to see. Intuitively, the top word for both genres is the name of the genre itself. For the rock genre, we can see that the word “band” is ranked almost as high as “rock.” This indicates that the word “band” is extremely prevalent among people talking about rock music, which makes sense given that most rock artists are bands. It also makes sense that we would see words like “guitar” or “metal” ranking high on this graph, as rock usually involves a lot of guitar playing, and metal is a popular subgenre in rock music.

We can see a similar trend for the most important words involving EDM. For example, words like “remix,” “DJ,” “festival,” and “drop,” are all highly associated with electronic music for different reasons. Most EDM artists are DJ’s, and these DJ’s will very commonly remix a song by another artist. Many live EDM events are giant music festivals, and the songs usually center around a climactic section known as a “drop.” The more we look into these top words, the more we can understand about the culture around each style of music.

This concept could certainly be applied to other genres, and I would expect to find unique language relating to each genre as shown above. People really do talk about different styles of music in very different ways, and there’s always more to explore in this realm. With more data and fine-tuning of my cleaning and modeling process, I believe I could have improved the accuracy of this model even beyond 90%, but these results still show the effectiveness of Natural Language Processing in determining the unique dialect surrounding a particular style of music. I would be interested in trying this with some other areas of music in the future, but I found this to be a very interesting case study. I hope you did too!

Identifying Language Specific to a Musical Genre

Written by Josh Robin