The world’s leading publication for data science, AI, and ML professionals.

Getting Group Behavior Insights Through Discord Servers

How Discord messaging data shows that analyzing social media behavior can give you valuable insights regarding social groups.

How analyzing Discord messaging data can give valuable insights regarding social media behavior and social groups.

Photo by Chewy on Unsplash
Photo by Chewy on Unsplash

Introduction

When browsing the internet, social media is basically unavoidable. Pretty much everyone uses at least one social media platform for a multitude of reasons and they often represent a significant part of a person’s social life.

It’s for that exact reason that analyzing behavioral data in social media platforms is a great way to get insights about people, and, even more so, about groups. And in social media platforms based around chatting, those group insights become even more apparent.

For that reason, I’ll be showing how to get your hands on Discord, messaging data and prepare it for analysis, as well as play it with a little bit. You can follow along with the google colab containing the code for this article.


The Data

In a chat-based platform, especially one like Discord with wide servers divided into multiple channels, there are many variables we can keep track of when making a dataset.

For the data that I’ll be making use of here, all the messages were taken from the same channel, which means we don’t need to keep track of where the messages were sent. All the variables that are important for the analysis we’ll be making here are the author of the message, the time the message was sent, and the content of the message itself.


Before We Get Started

When working with data like this that has the names – or, in this case, user names – of other people who did not necessarily permit you to make their information public, you should always anonymize the data to avoid any unnecessary problems while also respecting their privacy.


Preparing The Data

Before we can begin making our analysis, we have to make sure our data has everything we’re looking for.

After viewing the overview of the file using Pandas, you should notice that there are missing values in the dataset. The reason behind it is that, in this dataset, images and embeds sent in messages are represented as NaN values.

data.info()
Using the DataFrame.info() method to get an overview of the dataset.
Using the DataFrame.info() method to get an overview of the dataset.

Since these messages represent such a small portion of our dataset and are of no use to us, we can simply drop them. We should also convert the time column to DateTime to have an easier time working with dates.

Once we get that out of the way, we can focus on getting some fancier work done to get more columns to analyze. Such as what emotes are contained in a message or how many words each message has.


Besides the regular emojis that we’re all used to, discord also has exclusive guild emotes that can be animated or not. To represent all three kinds of emotes, we’ll be adding a separate column for each with each column being made up of arrays that contain all the emotes of that type sent in the message.

To make those arrays, we can use regular expressions to identify which messages contain emotes and then add them to their respective columns.

I made use of python’s emoji library to extract the emojis from the message with more ease since not all of them fit the regular expression pattern I was using.

After that, we simply add our word count column to the data frame and we’re ready to start playing with the data.


Analyzing The Data

With our transformed dataset in hands, we can begin our analysis to gain insight into our messaging history data.

One possible analysis we could make, for example, would be to watch how the user activity in the server behaved throughout the dataset – which contains over a year’s worth of messages from that server.

However, while doing that we run into a problem, which is the fact that there are too many authors and many of them lack enough data to contribute anything substantial to the analysis.

That’s because the data also includes messages from bots and users who are inactive in the server and therefore tell us nothing about the group as a whole.

fig = px.histogram(data, x='author', title='Messages Sent Per Author', labels={'author':'Author'})
fig.update_layout(title_font_size=30, template='plotly_white')
fig.show()

By setting a minimum amount of messages as a baseline and dropping authors who have sent less than that, we can filter out those who have not contributed enough data to be of importance to this analysis.

With a smaller amount of authors who have all contributed a significant amount of messages, we can finally plot a graph that better represents the user activity in the server over time.

fig = px.histogram(data, x='time', color='author', opacity=0.5, title="User Activity Over Time", labels={'time':'Date'})
fig.update_layout(barmode='overlay', title_font_size=30, template='plotly_white')
fig.show()

By making the time window smaller, you can get insights as to how the activity in the server has been lately, even finding out at what hours of the day the server is the most active.

We can also plot the total amount of messages sent per author to see who’s the most active and how the members of the server compare in frequency.

fig = px.bar(x=data.author.value_counts().index, y=data.author.value_counts(), color=data.author.value_counts().index, title='Messages Sent per User', labels={'x': 'Author', 'y': 'Messages Sent'})
fig.update_layout(title_font_size=30)
fig.show()

Note that the plot above counts all the messages sent that are contained in the data frame. You can also limit the time frame to get more recent data instead of considering such a big window of time.

And, by doing simple filtering to the dataset, you could have that same plot represent the number of messages containing a certain string while still being divided by the author.

term = 'LOL'
term_data = data[data.content.str.contains(term)]
# The exact same plot, but replaced data by term_data
fig = px.bar(x=term_data.author.value_counts().index, y=term_data.author.value_counts(), color=term_data.author.value_counts().index, title=f'Messages Containing "{term}" Per User', labels={'x': 'Author', 'y': 'Messages Sent'})
fig.update_layout(title_font_size=30, template='plotly_white')
fig.show()

Another possible analysis is to plot and compare the total amount of emotes sent by each user while also dividing them by the type of the emote, which can, for example, indicate who has nitro amongst the group, as only those that do can send animated emotes.

For this plot, however, first, we need to get the data frame into a ‘tidy’ format, which is what Pyplot uses for bar graphs like the one we’re about to plot. We can do so by using the pd.melt() method, which is a very useful method that easily gets the data frame into the ‘tidy’ format.

fig = px.bar(data_line, x ='author', y='value', color='variable', labels={'value':'Emotes Sent', 'author':'Author'}, title="Emotes Sent per User")
fig.update_layout(title_font_size=30, template='plotly_white')
fig.show()

Conclusion

After all these examples, it should be apparent that there’s a lot of potential insights to be taken from chat data such as the one from Discord servers.

From insights into someone’s routine to their behavior as part of a wider group, data from social media is one of the best ways to discover new things about people, and that’s why it deserves more attention


References

This article was inspired by the following two repositories that deal with making similar data analysis but for WhatsApp and Telegram respectively:

kurasaiteja/Whatsapp-Analysis

expectocode/telegram-analysis


Related Articles