Covid-19 Outbreak: Tweet Analysis on Face Masks

Yanqing Shen
Towards Data Science
6 min readMar 5, 2020

--

A sign on a True Value hardware store says protective masks are sold out, in Orinda, Calif., on Feb. 28, 2020. (JOHN G MABANGLO/EPA-EFE/Shutterstock)

‘STOP BUYING MASKS!’ One day after the U.S. reported its second Covid-19 death, health officials continued to plead with the Americans to stop panic-shopping mode. While the Centers for Disease Control and Prevention(C.D.C.) doesn’t recommend that healthy people wear face masks to avoid infection, Chinese authorities encourage people to do so in the face of Covid-19. The contrast stems from the different traditions and habits of westerners and easterners. Behind the recent mask price gouging in the U.S, ordinary people’s attitude towards wearing a mask triggers my curiosity. What do they talk about online, and what’s the overall emotions? Is there any network effect on their remarks? In the following post, I will study the recent mask shortage by conducting tweet analysis.

Objectives

There are two main objectives for this project:

  1. Get to know people’s attitude towards the epidemic icon-mask
  2. See if there is a network effect on people’s tweets and find out the heated topics

In terms of methodology, I will use data mining techniques, sentiment analysis, and network analysis to achieve the above two objectives.

Data Understanding

With a Twitter developer account, I extracted 1,200 tweets containing the keyword ‘mask’ from the website. As shown below, the raw data has 16 columns and 1,200 rows. While the first column contains the content of a tweet, the other columns describe information like engagement, time and user ID, etc.

Table 1. Part of raw data

Data Cleaning

As shown in Pic 1, It’s hard to analyze the original messy corpus. By using the NLP package in R, I removed punctuation, numbers, stopwords, URL, and white spaces from the corpus. Additionally, as the tweets were pulled out using the keyword ‘mask’, I am more concerned with the words appear together with it instead of itself. Thus, I created a dictionary including unnecessary words such as ‘facemask’ and ‘masks’ and cleaned it from the text.

Pic 1. Part of the original tweets

It’s also important to know the complexity of text mining. One word can have multiple meanings. Masks can also be related to topics like skincare, which is unrelated to my studies. I will solve this problem in the next step.

Data Exploration

The data visualization allows us to get some preliminary understanding.

Chart 1. terms of frequency

Unsurprisingly, people’s tweets centered on the epidemic Covid-19. The words like ‘need,’ ‘important,’ ‘planning’ give us a glimpse at their attitude. They also talk about the imbalanced supply and demand in the current market. Noticeably, there are two weird terms on the right-hand side, ‘Bremner’ and ‘Movie.’ It’s a topic about an incoming movie called M.A.S.K. written by Chris Bremner. Through the below word cloud, I saw more unnecessary words like ‘allnighter.’

Pic 2. Word Cloud for terms

Now unrelated words come up intuitively. I went back and deleted all these tweets and proceeded with the remaining 1,127 tweets.

Sentiment Analysis

Every word has emotion behind it. The ‘syuzhet’ package in R help to capture people’s emotions in text.

Chart 2. Sentiment Scores for mask tweets

The above plot shows that tweets are more related to anticipation and trust. People do have some negative feelings on Twitter. The bar chart shows that around 30% of people have expressed fears online, and about 50% of people have demonstrated negative emotions. But overall, people on Twitter are optimistic about the epidemic icon-mask.

Network Analysis

While sentiment analysis helps to learn individuals’ attitudes, network analysis identifies relationships on social platforms. In tweet analysis, each word is a vertex; the degree of vertex shows its connection with other words. For example, we can tell from the below that ‘coronavirus’ usually appears together with ‘get.’

Pic 3. Terms that appear together

From the below histogram, the right skewness indicates small degree values for most tweets. There are also some extreme values on the right tail, meaning some terms have close connections with others.

Chart 3. Histogram of node degree

What are those popular terms? The network graphs below provide a cleaner look. To avoid messy display, I only included terms having frequency more than 30.

Pic 4. Terms Visualization

The connected terms are those that appear together on Twitter. The word ‘coronavirus’ is at the center of network graphs, related to all the other terms. Then I clustered all the words based on edge betweenness.

Pic 5. Clustering based on edge betweenness

Betweenness represents how frequently a node is between other nodes’ geodesic paths. The three clusters are mainly about the ongoing epidemic, mask importance, and mask usage.

After seeing the relationship between terms, I moved on to the network impact on tweets.

Pic 6. Vertices with tweets

The above plot shows the distribution of tweets. We can see that many tweets have no connections (discrete points in the sparse area). As the tweets with high engagement are more of interest, I removed those less connected ones and got a more detailed network graph as below.

Pic 7. A detailed network of tweets

The numbers above represent the ID of tweets in raw data. Tweets in the two dense areas are most frequently liked, reposted, and commented. I then randomly picked some of the tweets from the circled areas to see what people are talking about masks on Twitter.

Pic 8. Selected tweets with high engagement on Twitter

As these tweets triggered most discussions online, we now know people’s primary concerns:

  1. They are not sure who needs to wear a mask to minimize infection risk.
  2. They are hesitant about the original travel plan.
  3. They pay attention to the policy changes in Covid-19 prevention.
  4. They remind people who show symptoms to take action.
  5. The panic shopping mode for masks is still going on.

Deployment

The critical question is about “so what.” Although text mining is a relatively new area of computer science, it has gradually been applied to fields like risk management and customer care service. The Covid-19 outbreak has just started in the U.S.; at this point, there are two central deployments for the text mining technique.

  1. Retailers and suppliers can get to know people’s changing attitudes along the time, adjusting inventory and production plans accordingly.
  2. Authorities can know people’s concerns and uncertainly, giving clear directions and enact new policies beneficial to people.

Finally, there are two improving points for my study. Firstly, one major problem of ‘syuzhet’ package is that it does not properly consider negatives, which may have some impact on sensitivity analysis. Secondly, the number of extracted tweets can be more to capture the whole picture online.

--

--