
For various projects I have worked on using NLP techniques, I am dealing with the text data in English. What to do when the text data are not in English? This article will discuss how I derive some insights from tweets in foreign languages by analyzing the universal language: Emojis🎈 .
Project Background
Recently I have started a for-fun project analyzing Twitter posts about a Japanese Show I am watching. In my previous posts, I have discussed using the Twint library to gather all show-related tweets and some analysis about tweets and tweets related actions such as the number of replies, retweets, and likes. The show has broadcasted seven episodes in total, and I have gathered over 222k show related posts. I have presented the code and some interesting results on my GitHub.
Besides analyzing the tweets’ quantity, I am also interested in the tweets’ contents to see what fans are posting about the show. It can be challenging to use traditional NLP techniques because these 222k tweets are in 40 different languages. Here is the language share of all show-related tweets:

The graph shows that English only accounts for roughly 4% of all tweets, and 86% of the tweets are in Japanese. To understand the contents and frequently used words for these posts, we could use NLP models designed for different languages, or we could translate all foreign languages into English before applying text mining techniques. However, we could start the analysis by analyzing the Emojis used in all posts for quick insights.
Frequently Used Emojis Among All Tweets
Emojis are widely used in social media posts, and they help the users express their posts’ sentiments graphically. Knowing what Emojis are frequently used in show related tweets helps us understand how fans react to the show. To know the most frequently used Emojis, the first step is to group all posts and extract all Emojis used in these posts. Here I have all tweets stored in the DataFrame called _all_tweets:_

As shown above, by filtering tweets with specific hashtags related to the show, there are 222301 tweets in total between October 1st and November 21st in at least 40 different languages. Each row of the DataFrame is one tweet, with various features about these tweets, like the user who posted this tweet, at what time and date, etc. Tweets contents are stored in the column called "tweet". To get the frequency of all Emojis used in all tweets, we can group all 222301 tweets by "concating" or "joining" string variables.
tweets_contents = ','.join(all_tweets['tweet'])
Using the code above will join all rows in the column "tweet" into one giant string variable, separated by a comma. Here is an example with the first two rows:

After combining all tweets, we need to filter out the Emojis out of the string variable "tweets_contents". Emoji is basically a kind of Unicode. We could use regular expressions to exclude all other characters:

As shown here, it includes other symbols like "&", "" that is not something we want. Another way of selecting Emojis out of a string is to use the emoji library:

This works much better. We can write a function and apply it to all tweets.
import emoji
def extract_emojis(x):
return [word for word in x if word in emoji.UNICODE_EMOJI]
all_emojis = extract_emojis(tweets_contents)
After checking the "all_emojis" list, I found that some characters show up frequently in the list but are not Emojis. Including them would bias the results. I then add an extra filter to exclude unwanted items:

Now I have a list called "all_emojis" that stores all the Emojis shown up in the tweets. The next step is to count the frequency of each unique Emojis and show the most frequent ones. This is a classic question of converting a list into a dictionary based on the item frequency. As discussed in my previous article, the easiest way count item frequency and sort the dictionary is to use the Counter library:

As shown by the number of "keys" in the dictionary, there are close to 900 unique Emojis used in all tweets. The Counter dictionary records each Emoji’s frequencies as the dictionary value. To get the most frequent keys, call the _most_common_ function from Counter:
common_emojis = Counter(all_emojis).most_common(30)
Here I have chosen the top 30 most common Emojis. The common_emojis defined by the code above is a list of tuples with the first element of each tuple as the Emoji, and the second element as the Emoji frequency:

The last step is to plot this list as a bar plot:
vals = [x[1] for x in common_emojis]
legends = [x[0] for x in common_emojis]
plt.figure(figsize=(14,4))
plt.ylim(0, 38000)
plt.title('Top 30 Emojis in over 222k Related Tweets')
#get rid of xticks
plt.tick_params(
axis='x',
which='both',
bottom=False,
top=False,
labelbottom=False)
p = plt.bar(np.arange(len(legends)), vals, color="pink")
# Make labels
for rect1, label in zip(p, legends):
height = rect1.get_height()
plt.annotate(
label,
(rect1.get_x() + rect1.get_width()/2, height+5),
ha="center",
va="bottom",
fontname='Segoe UI Emoji',
fontsize=15
)
plt.savefig('words_tweets.png')
plt.show()
After some adjustments, the graph looks like this:

Due to the limitation in the font by Matplotlib, the Emojis in the plot are not colored. If you know any way that I can use to plot colored Emojis, please do let me know!
Most Frequently Used Emojis By Week
In the previous section, we have looked at all Emojis used in all tweets. We could segment tweets by different weeks and look at the most frequently used Emojis for each week to see whether the fans’ reactions have been different. The "date" column shows the tweet posting dates. Based on the date, we could create a column indicating which week this tweet was posted. Rather than dividing weeks by Mondays, since this show releases every Thursday at midnight, I think it is more meaningful to segment the weeks by episodes releasing date. First, I created timestamps for all show releasing dates. The first episode was released on October 8th, and I set the starting date at one week before the first episode:

To create the "week" column that can be used to groups all tweets after different episodes, I first define a function, then I apply the function into every row of the column "date" using a lambda function:

Checking the number of tweets a week after each episode by _value_counts()_:

This show is getting more and more popular with more episodes being released. Especially for episode 7, it was released on November 19th, and the data were collected until November 21st. With around three days, the number of tweets is already higher than all other weeks.
To get the frequency counts of Emojis in different weeks, we need to groupby the tweets by weeks and apply the frequency count for each week:

Following the previous procedure, after "groupby" tweets by weeks, we use the join function to join all tweets by week to have eight giant strings representing eight weeks.
For each week, we are creating a dictionary to store the Emojis’ frequency counts. Besides, we need to include what week is this dictionary belongs to. Here I am using a list comprehension to construct a list of tuples:
emojis = [(strings.index[i], Counter(extract_emojis(strings.iloc[i])).most_common(1)) for i in range(8)]
The first element of the tuple, "strings.index[i]", is the week index; the second element "Counter(extract_emojis(strings.iloc[i])).most_common(1)" is the most frequent Emoji and its count for this week:

We could use the list "emojis" to plot a bar chart:
import matplotlib.pyplot as plt, numpy as np
# Set up plot
freqs = [emojis[i][1][0][1] for i in range(8)]
labels = [emojis[i][1][0][0] for i in range(8)]
xlabels = [emojis[i][0] for i in range(8)]
plt.figure(figsize=(8,4))
plt.ylim(0, 17000)
p1 = plt.bar(xlabels, freqs, color="pink")
plt.title('Most Used Emojis by Episodes')
# Make labels
for rect1, label in zip(p1, labels):
height = rect1.get_height()
plt.annotate(label,
(rect1.get_x() + rect1.get_width()/2, height+5),
ha="center",
va="bottom",
fontname='Segoe UI Emoji',
fontsize=20)
plt.savefig('emoji_eps.png')
plt.show()
The slicing for list "emojis" can be confusing. As demonstrate below, we need to use " emojis[i][0]" to get the episode week index, use "emojis[i][1][0][1]" to get the frequency count and "emojis[i][1][0][0]" for the Emoji,

The bar chart looks like this:

By adjusting the parameters in _most_common_, we could also look at the top 3 most frequently used words by week:

Based on the Emoji analysis above, we can feel the fans’ enthusiasm as more episodes are released. The number of show-related tweets increases dramatically, but the emotions from the tweets are much more intense. Although expressed in different languages, the feelings presented by Emojis are universal. Universal tears and love for such a great show that warms up the winter of 2020.
For future steps, I will analyze the tweet word counts using the google translate API to translate the tweets in English or use NLP models in different languages. Thank you for reading this article. Please let me know if you have any suggestions!
Here is the list of all my blog posts. Check them out if you are interested!
Read every story from Zijing Zhu (and thousands of other writers on Medium)