A Comprehensive Data Analysis on a WhatsApp Group Chat

Ever wondered which of your friends swore in texts the most?

Published in

Towards Data Science

14 min readMay 30, 2020

While thinking of what to do for my next project, an idea suddenly popped up in my head. Why not do data analysis on a WhatsApp group chat of college students and find out interesting insights like the most used emoji, the sentiment score of each person, who swears the most, the most actives times of the day, or does the group use phones during college teaching hours? These would be some interesting insights for sure, more for me than for you since the people in this chat are people I know personally.

Note: I represent each of my friends with two letters, an abbreviation of their name as a way to maintain anonymity.

Data Retrieval & Preprocessing

The first step was to gather the data. WhatsApp allows you to export your chats through a .txt format. Opening this file up, you get messages in a format that looks like this:

Since WhatsApp texts are multi-line, you cannot just read the file line by line and get each message that you want. Instead, you need a way to identify if a line is a new message or part of an old message. You could do this use regular expressions, but I went forward with a more simple method. I created a function called vali_date(), which returns True if the argument passed is a valid date.

I am extremely proud of how I named this function.

While reading each line, I split it based on a comma and take the first item returned from the split() function. If the line is a new message, the first item would be a valid date, and it will be appended as a new message to the list of messages. If it’s not, the message is part of the previous message, and hence, will be appended to the end of the previous message as one continuous message.

From here it’s just a matter of extracting the necessary details from each line: the date sent, the time sent, the sender, and the message itself. This is just a matter of using some simple string processing functions, mainly split(), and making a DataFrame out of it. If you’re using the same method then please make sure to specify the date and time formats correctly.

The export also has lines for when a person was removed, when a person was promoted to an admin and other functional messages which are of no use to us. With the code I wrote, these messages were being set as the sender’s name with blank messages. To filter these out, I use a very rudimentary function to check if the sender’s name has some of these keywords. To account for other messages which I did not hardcode into the if condition, I checked the number of words in the sender’s name. In the exports I used, I had no one with more than 3 words in their name, but if you’re doing something like this, you have to account for different names. I also had deleted a friend’s old number which wouldn't pass the if condition above, so I had to account for that using another condition. If you’d like to avoid this painstaking process you should use regexes which I embarrassingly failed to do.

As I loaded the dates and times in a singular DateTime format, I’m easily able to extract specific information from these like the day or month they were sent, which would be useful for heatmaps.

At this point, I think I’m ready to start my analysis so I plot a simple line graph to see the frequency of messages over the years. This group was created in the last quarter of 2017 and the chat was exported in 2020. I expected to see a nice line graph with crests and troughs in odd places. But when I looked at the plot, I saw something weird.

Notice the lack of messages for the first half of the export. This seemed bizarre. At first, I thought it was something wrong with my text processing, but I eventually figured out that WhatsApp chat exports have a limit on how many messages they can export. This limit was around 40,000, which explains why there aren’t many messages for the first half of the export. Looking at the shape of the DataFrame, it had 39,489 rows. At this point, I was disappointed that I have no access to almost half my data.

But wait!

Here comes the light at the end of the tunnel. My friend who was also in the group mentioned that he had access to his old phone. He exported the chats from his old phone and sent it over. I applied the same extraction techniques I did for the first export, albeit added some small changes to my code due to different date representations and different contact names, and this new export had a surprising 68,584 messages (which I’m not sure how it’s more than 40,000). I combined these two DataFrames, removed duplicates, and ended up with a grand total of 74,357 messages, spanning over 2.5 years. The line graph looks much better now.

Exploratory Data Analysis

Now that we have a clean DataFrame to work with, it’s time to perform analysis on it. First things first, since almost all the plots will be comparing one person with another, I’ll assign a specific colour to each person so that it becomes easy to identify each person among multiple plots. I chose my colour palette from this website.

Next, I made a dictionary where each key is the name and the value for each would be their assigned colour. I create a function which reorders colours given a list of names to match the ordering of the plot. This function takes the ordered names as input and returns a reordered list of colours. This list has to be passed into the pallete argument in a seaborn plotting function.

Now we have a nice set of colours for each person which we can visualize using palplot.

My first plot will be the total number of messages sent per person. For this, a simple seaborn countplot will suffice. The next one would be the average message length for each person. For this, I create a new column called msg_length which contains the length of each message which I get by using a lambda function which returns the length of a given message. I just group the DataFrame by name and then apply mean() on the returned groupby object.

It’s really interesting to see plots like this side by side. For instance, the person who sent the least amount of texts, MM, typed the second-longest messages on average. This means this person doesn’t send many WhatsApp messages but when they do, it’s long. We can see that BA sends the most messages while having a relatively long message length. RF seems to have beat everyone on average message length but on closer inspection, this was due to some ‘outliers’, which can be seen here:

Since there are no rules here on what constitutes an actual message, I’ll keep these messages in and not take it as an outlier. WhatsApp is wild.

The exported chats were exported without any media files. Any message that contained media was indicated with ‘<Media Omitted>’. We can use this to filter out and see who sends the most media.

BA is beating everyone by a mile. He also ranks the top in total messages and third in average message length. Most dedicated contributor award goes to BA!

Not that it’s worth anything.

Time

Next thing we can check uses the months that we extracted from the DateTime stamp of each message. We can use these to make a heatmap to check the most active periods of time. I’ll put two plots side by side.

I group the DataFrame by month and apply the sum function on the object returned. To count the number of messages in each group of the groupby, I had already created a column called count which had the number 1 for each message. And since months are ordered, I passed in a list of strings with the months in order to the order parameter of the plot function.

We can see that the last 3 months of the year have much more activity than the other months. The most active days are November Fridays and December Mondays, however, we can’t infer much from this. This may possibly be due to certain days instead of recurring patterns. To prove this, we can check the most active dates to see what days of the week they were on.

The top two days were in November and December, which contribute to the high average during these months.

To get a clear idea of the activity on separate days, we plot a separate plot by aggregating message count to each of the 7 days of the week. To go even further, we can look at aggregations of each hour of the day.

Looks like the day themselves have no variation on activity among them, except Saturday. This is probably due to the fact that Saturday is the first weekend after Friday and people are usually taking a rest and doing other activities than messaging on their phones.

Looking at the hour-wise activity, we can see that most of the group activity occurs at night. Some other interesting things happen:

The fall from 9 am to 10 am. 9 am is when classes start, and not many students are inclined to use their phones at this time since they’re either making it to class or catching up with their friends.
The small spike at 1 pm. This is during lunch hours at college, where people use tend to use their phones more since they’re out of classes. However, the fact that it’s only a slight increase from 11 am and 12 am shows that we are in fact active on the group during teaching hours. Not that surprising but nevertheless fascinating to see behaviour like this proved using data.
The jump from 3 pm to 4 pm. This is because everyone in the group is a college student. College finishes at 3:30 pm and all of us start travelling home- during which we either sleep on the bus on the way home or are busy doing other things.

Emojis

Using the emoji library for Python, we are able to extract emojis from the text messages.

For each person in the group, I’ll create a dictionary containing the emojis as keys and the number of occurrences as their values. I tried having the dictionary contain emojis that had at least 1 occurrence, but something in my code wasn’t working, so I initialized a count of 0 to every available emoji for each person in the group. Inefficient, I know, but it works.

I go through each row in the DataFrame, and for each message, I index the dictionary with the emojis in the message and increase the count by 1. Then I sort the dictionary in descending order to get the most used emojis of each person.

A barplot is also plotted, giving us the total emoji count of each person.

We can see that the most common emoji is the laugh-cry emoji, which is in line with the report by the Unicode Consortium, which says that ‘😂’ is the most frequently used emoji.

There are some emojis which weren’t rendered in by Google Colab. These are probably newer emojis. RF seems to use the most emojis, especially the ‘👍’ emoji, which is a bit suspicious. We can see that it is due to a single message rather than their collective texting habits.

Swear Words & Sentiments

Something really exciting! There’s a library called profanity_check which can check if a given message contains swear words or not. What I want to find how much do swear words affect the sentiment of a message.

First things first, I create a new column containing 1 if the message contains a swear and a 0 if it doesn't. This is done using the predict function from the aforementioned library. We can then group the DataFrame by name and get the average of the swear column, which gives us the percentage of messages which had swear words in them for each person.

We can see that BA, RF, and GJ swear the most. MM’s at last since it was shown that he never contributed much. The rest of the group falls in between.

Now, to test whether swear words affect sentiment, I need to find the sentiments of each message. For this, I’ll use two different analyzers: VADER (Valence Aware Dictionary and sEntiment Reasoner) from NTLK and TextBlob. First, VADER. The polarity_scores function returns a dictionary with many keys like positive sentiment, negative sentiment, and overall/compound sentiment. I’ll be using the compound sentiment here, which can give values ranging from -1 for the most negative and +1 for the most positive. From the TextBlob library, the TextBlob function returns a sentiment which is categorized between two- polarity and subjectivity. Here, we’ll be using polarity which again ranges from -1 to +1.

For VADER, since we’re taking a mean of the sentiments, the bars average around zero since there are a lot of neutral texts among the texts with sentiments. We can see that BA, who swore the most, had the most negative average sentiment. RF and GJ were the second and thirst highest swear users, however, the negative sentiment of GJ is more than RF, suggesting that swear words alone don’t determine the negative sentiment of a message. VADER is specifically attuned to sentiments expressed in social media, and since WhatsApp is a social media app, we can trust that these sentiments are somewhat accurate.

It’s interesting to see that TextBlob has complete different sentiment rankings for all people except the top two with the lowest scores in VADER, BA and GJ. Even though VADER classified 3 people with average negative sentiments, TextBlob returned average sentiments which were all positive.

To dig a bit deeper to see which one is more accurate, we can look at the messages sorted by polarity score for both VADER and TextBlob and inspect the messages visually to see if the sentiment given makes sense.

The first message looks interesting. Let’s take a closer look.

It’s a copypasta, with a ton of heart emojis, which VADER has classified has to have the highest positive sentiment, and rightfully so. The length of the message probably contributed to this. However, TextBlob seems to have given it only a slightly positive sentiment.

The 5th message from the list above is a promotion from a company:

‘ Hi, Greetings from <redacted>. This is regarding a new Membership plan for those who have attended one of our programs. We are looking for “Student Ambassadors” for our firm now. If you are interested , please let us know through WhatsApp or call @ <number>*By the way, who is a Student Ambassador?* He/She is the one who will be representing <redacted> in their respective institution. *What is his/her responsibility?* To organize workshops/competitions/crash courses/Guest lectures in their institution on behalf of <redacted>. *What do you get by being an Ambassador?* You get free pass to all our programs/events. You will be taken as an intern if found talented. And most importantly support and guidance for your research or project works. And you will continue to one of team members of <redacted> after your studies.’

This message clearly does not have a positive tone to it yet VADER gives it a high polarity score.

Let’s look at messages sorted in descending order of TextBlob polarity score.

Looks like TextBlob works better with short messages, seeing as it rates short messages with a perfect positive sentiment of 1.0. Out of all messages, there are 251 with perfect positive sentiments of 1.0.

Overall, I think I’ll prefer the sentiments from VADER, seeing as it seems to work with long messages better and is specifically attuned to sentiments expressed in social media.

Conclusion

That was fun! It’s really interesting to see texting habits of people and incidents of daily life reflected in text. I suggest you take a look at my code and apply it to your own group chats. However, some modifications will have to be done at the DataFrame creation part. If you’re interested, shoot me a message and I’ll get you sorted.

Thank you for reading! Let me know what you thought about this article.