NLP Jam: The Grateful Dead and Phish

For this blog post, I will share one of the projects that I completed as part of my data science boot camp.

David Bertsch
Towards Data Science

--

Introduction

My data science boot camp class was assigned the task of building an NLP model that would take in a reddit post, and classify it as belonging to one subreddit or another, based on the text. The first step of the project was to select two different subreddits.

The Grateful Dead and Phish are often linked together because both bands have a similar artistic approach that is distinct from most music, in that they emphasize the live concert performance vs. recording studio albums. Both fanbases were both early to make use of the internet, and so they have very active reddit pages in addition to several other online forums. As someone who is obsessed with both bands, the Grateful Dead and Phish subreddits seemed like an obvious choice.

A very brief description of the bands:

The Grateful Dead

The Grateful Dead rose to prominence in the 60s as part of the countercultural movement in San Francisco. They took a variety of musical influences — bluegrass, classical, folk, country, blues, jazz — and formed their own individual sound. They pioneered a style of playing unique concerts that consisted largely of improvisation. As a result of their concerts being different from one another, they attracted fans that became obsessive followers to the point that some fans would actually “tour” with the band. Fans of The Grateful Dead became known as “Deadheads”.

Phish

Phish formed in 1983 as college students, and quickly developed a wildly original sound and song catalog. Within about 10 years, they had a diehard nationwide fanbase of their own. Despite having different musical influences and style, they followed the Grateful Dead’s model for making their concerts distinct from one another and emphasizing the improvisation in their performance. Obviously, their fans are referred to as “Phish-heads”.

If you are interested in a (long) primer on the music of these respective bands, I’ll link to a couple of signature jams at the end of this post. Obviously, there is a limit to how much you’ll be able to discern from just a single song/jam, but I would hope to impart the following takeaway:

There are some musical similarities in terms of the instrumental makeup of the bands, but overall, the musical styles are fairly distinct. At the same time, these bands have a common approach to the performance of the music in the sense that they use their songs as launchpads to improvisational space.

To me, this is one of the most interesting aspects of the comparisons between Phish and the Dead. Sometimes, I sense that there is a prevailing misconception that these two bands have very similar music, but I think that what really links these bands together is their fans.

Deadheads and Phish-heads are notoriously devoted. Both bands have hundreds of songs, and any given concert consists of some sampling of these songs in whatever order is spontaneously constructed in the moment of performance. Furthermore, many of the songs are appended with improvisational sections that are different with each individual performance. The result is that each concert is a unique performance. This is why these fans are known to travel many miles to repeatedly attend shows. To me, this points to a similarity across the two fanbases in how they connect with and consume music, and how they approach life in general.

Because of this similarity, I would expect that— aside from the primary discussion topic of the bands themselves — these two message boards are somewhat interchangeable. Both forums probably contain a lot of similar discussion on music in general. And both forums probably contain a lot of miscellaneous discussion on things like current events, personal stories, philosophy, etc. The point is that I am sure that there are many posts that even a relative expert like me would be able to accurately classify. More than anything, that is what intrigued me about building a model to classify posts from these subreddits — Would my NLP classifier be as accurate as someone who might frequent these message boards?

Maybe so, maybe not…

Hypotheses

I can tell your future, just look what’s in your hand…

Before doing any data collection, I had a few ideas as to how an NLP classification model might perform.

  1. I thought it would be fairly straightforward to build an accurate model. This was based on the idea that there would be a lot of band-specific terminology that would recur. This would be words like band member names, song titles, album titles, or years during which the bands did not overlap. I figured that most a good chunk of the posts to these subreddits would contain at least some of this kind of terminology.
  2. I thought that a model would have a hard time with classifying vague posts. I figured that there would be many posts that wouldn’t explicitly refer to either band.
  3. I thought that the model would have a challenge with classifying “crossover” posts where one band is being discussed in the other band’s subreddit. Since there are many fans of both bands, I know that this kind of discussion does occur fairly often.

Data

The data for this analysis was gathered using the Reddit pushshift API. I compiled about 6,000 posts from each of these subreddits that spanned about 2 years. I only included the original post title and “self text” and did not gather the corresponding comments.

The data called for some routine cleaning…

Once I had the data, I had to perform a few data cleaning operations:

  • The data contained entries that did not include any “self text”. This is the text that the user attaches to their post, separately from the title of the post. These null values were replaced with blank cells.
  • Many of the entries contained text that did not represent a user’s language, like “\n”, “[deleted]”, “[removed]”, or urls. These were filtered from the data (toss away stuff you don’t need in the end).
  • I lemmatized the text, in order to group words together that had the same root (keep what’s important).

After doing this, the data was ready for analysis.

Modeling

Let’s get down to the nitty gritty!

I tried out the following models in my analysis:

  • Logistic Regression with Count Vectorization
  • Naive Bayes with Count Vectorization
  • Logistic Regression with TF-IDF Vectorization
  • Naive Bayes with TF-IDF Vectorization
  • Support Vector Machines with Count Vectorization
  • Random Forests with Count Vectorization
  • Support Vector Machines with TF-IDF Vectorization
  • Random Forests with TF-IDF Vectorization

All of the models performed fairly accurately, and they all had a fair amount of variance.

The logistic regression models and the Naive Bayes models had the highest accuracy by a small margin, but the highest variance by a significant margin (~90% testing accuracy/~99% training accuracy).

The random forests models reduced the variance, but had the highest bias by a substantial margin (~90% testing accuracy/~85% training accuracy).

The support vector machines model with TF-IDF vectorization was my preferred model. It had the lowest variance, and its testing accuracy was close enough to the best performing logistic regression/Naive Bayes models. This model was 90% accurate on testing data classification and 94% accurate on training data classification.

Results

In this section, I’ll go through some of the results that I found to be most interesting.

Which distinguishing words appeared most frequently in the posts for each subreddit?

I wanted to find out which distinguishing words or “buzzwords” were most common for these subreddits. These are words that make specific references to one band or the other. They are mostly band member names, song title words, and relevant dates. One interesting thing is that many of these words occur frequently in both forums, so the mere occurrence of a buzzword is not sufficient to making an accurate classification, although it is probably a strong indicator.

Buzzwords that appeared most frequently on the Dead subreddit
Buzzwords that appeared most frequently on the Phish subreddit

Which words appeared most frequently in misclassified posts?

In investigating the misclassified posts, I looked into which words occurred most frequently. I filtered out ordinary words in order to build the plot below.

Words that appeared most frequently in misclassified posts

What proportion of misclassifications were from the Dead subreddit vs. the Phish subreddit?

There were 161 Dead posts that were misclassified compared to 142 Phish posts that were misclassified. This seems relatively balanced. It would be interesting to see if this trend held up with a larger sample of results. It is possible that the slight imbalance is a result of the Phish forum containing a larger chunk of band-specific discussion, since they are still an active band, whereas the Grateful Dead band members perform with off-shooting side projects nowadays.

Which types of posts were prone to misclassification?

Scanning through the misclassified posts, I noticed that there were a few common types of posts that the model had a hard time classifying correctly:

  • Posts that pertained to ticket trading/selling were often misclassified, since these posts tend to refer mainly to dates, venues, and locations.

Some examples:

Best place to trade tickets? I’m not looking to scalp or go above face value, just trying to trade my GA for good hard seats. Short wife means I need 1st row of sections.

Anyone get tickets yet? Wondering how much would the price range be so i know what to put aside, thanks!

Chicago ticket warning… For all you heads looking for cheap Chicago tickets, Don’t buy hard tickets from Craigslist. Fake tickets are everywhere on CL right now. Its like the brown acid. Stay AWAY. But if you feel like experimenting buy lottery tickets instead. Be safe be smart

  • Posts that referred to one band in the other band’s subreddit were naturally prone to misclassification

Some examples:

As of today, Trey has been alive longer than Jerry. Eternally grateful that we have been able to hear and see them each. Their contributions to music will live on for a long long time.

I posted in the Dead subreddit, but are there any Grateful dead/phish/other sweet bands themed fantasy baseball leagues out there?

Fall ’97 Spring ’77 The Dead released the May 1977 box set and Cornell this year. I think its time Phish release the Fall ’97 box set. I spent a good amount of time this summer listening to both tours and they both share the good funk in their own way and with ideal energy and tightness. Just wanted to put that out there. I want my goddamn box set.

  • Vague posts that did not explicitly refer to either band naturally had a tendency to be misclassified.

Some examples:

Favorite Complete Show on YouTube? Thinking about putting on an old show tonight. What would you recommend?

Anyone have/know of a driveway/neighborhood/campground that I could park my car at after the show in Phoenix on Sunday?

What experiences have you had with cults on the lot? I have been reading about the twelve tribes and am interested in hearing your run ins with them or any other cults. I was thinking of doing a research paper on them do you think they are too dangerous to visit? Thanks!

Statements just seem vain at last…

Conclusions

This was a useful exercise in testing out a variety of different NLP models. I think the performance of my models was generally quite good, but there is still potential for improvement.

Overall, my hypotheses — that the models would likely be accurate, but that they would be susceptible to misclassifications on certain types of posts — proved to be correct.

My first idea to improve the model from here in the future would be to acquire more data. I would either do this by getting more posts, or by including the comment text for each of the posts that I analyzed for this modeling process. Having a larger dataset would allow the model to learn more band-specific buzzwords, and it would allow the model to potentially discern any ongoing inside jokes or recurring references that were unique to one subreddit or the other.

As is reflected in my hypotheses, I think it is reasonable that there would be some unavoidable amount of bias in any model classifying posts between these subreddits. Both subreddits contain a fair amount of similar conversation about music and concerts in general that don’t necessarily specifically pertain to either band. Also, both message boards contain a lot of conversation pertaining to the other band. Phish and the Grateful Dead share a lot of fans, and there is a lot of discussion about both bands in both subreddits.

Interestingly, some of the top buzzwords for each band were common words in both subreddits. Since there is a lot of crossover discussion between the forums, “Phish” and “Dead” are words that, while especially likely to occur in their respective subreddits, are also fairly likely to appear in the other subreddit. This means that simply having a “buzzword” is not enough to correctly classify each post. This would likely be a factor in many of the misclassifications.

It would be interesting to repeat this exercise with two other subreddits of bands that were not as closely linked in terms of their fanbases. It would also be interesting to repeat the exercise on two non-music subreddits. I wonder how the accuracy would compare and if the same kind of model would emerge as the best one.

Thank you for indulging me! I‘m sure you found this to be as riveting as I do. I somehow continue finding new ways to be entertained by Phish and the Dead. I’ve still got a long ways to go in developing my understanding of NLP and the tools that go along with it, but at least I’m enjoying the ride!

Finally, let there be songs to fill the air…

10/18/74 Dark Star — Winterland Ballroom
10/20/13 Tweezer — Hampton Coliseum

--

--