A lot happened this year and if you watch the movie "Death to 2020" on Netflix, you will have an idea of the timeline of events. For this project, I thought it would be interesting to gain insights into what Twitter users had to say about 2020. Twitter receives more than 500 million tweets a day, so all I had to do was find a way to retrieve these tweets and analyze them. However, before embarking on this project, I had no idea what strategy to use. So I found myself browsing through dozens of articles on various concepts in Natural Language Processing (NLP) and while reading these articles, I became more interested in the topic. The tweets used for this project were created between 12 and 25 December 2020. If you tweeted about 2020 during this period, there is a good chance your tweet is part of this analysis. At __ the end of this article, you will learn:
- Most common words used by Twitter users to describe the year 2020
- Time of the day when Twitter users are more active (by Country and Continent) through the interactive Tableau Dashboard
- The proportion of positive, negative, and neutral tweets
- The country with the most tweets
- The most "retweeted" and "liked" tweet within the period
- The duration of this project and how to implement a similar project by yourself
Let’s dive into my analysis, I can’t wait to show you what’s been discovered.
Project Strategy
This project had several facets outlined in the flow diagram below. I will explain the basics, while further discussion on some concepts will be done in subsequent Medium posts.

The Python libraries used include Pandas (for data cleaning/manipulation), Tweepy (for Tweet mining), NLTK (Natural Language Toolkit for text analysis), TextBlob (for sentiment analysis), MatPlotlib & WordCloud (for Word Cloud visualization), Emot (for emojis identification), Plotly (for data visualization) and other built-in libraries as shown in my Jupyter Notebook.
Tweets Mining
This was probably the most arduous part of the project because, unlike my previous projects, where I had existing datasets, I had to build this one from scratch. To do that, I used the Tweepy library for Python to scrape tweets.
Thanks to this Toward Data Science article by Tara Boyle, I found my way around the Twitter API (Application Programming Interface). However, some things had changed because Twitter now has stricter limits on mining tweets through their API. One of these limits is that you can only retrieve a maximum of 2,500 tweets every 15 minutes. Since I wanted to work with a large dataset, I used the _"wait_on_ratelimit" parameter in Tweepy that makes the code sleep every 15 minutes. Also, tweets can only be mined as far back as 10 days. After three consecutive days of running the program I wrote, I had scraped 50,780 unique tweets. The highlights of this step are stated below. You can see detailed explanations in my Jupyter Notebook.
Highlights of Tweet Mining Task
- Search Query: I passed 4 different phrases – ["2020 has been", "2020 was a", "this year has been", "this year was a"] to the API so that it returns tweets containing them. Twitter requires a specific syntax to recognize that you want an "exact phrase" match. Also, I only mined tweets created in English for this analysis. *For Twitter users in non-English speaking countries, their views might be underrepresented.
- Information Returned: I specified that the API returns the following data for each tweet – Tweet ID (primary key), Tweet, Time Created, Location, Number of Retweets and Likes. I did not retrieve Twitter usernames for ethical reasons.
- Successive Mining Attempts: On the second and third day of my code execution, I had to specify the "since_id" parameter so that the Twitter API does not return tweets already in the dataset from the previous day(s).

Data Cleaning
Cleaning up your data is very vital because it helps to prevent errors in your analysis, avoid data duplication, etc. In this step, I looked for duplicate tweets by using the Primary key (Tweet ID), checked for empty rows and replaced "NaN" or Null values for the "Location" column with the string – "No Location" (I will explain why I did this in the Location Geocoding section).
Tweets Processing
To achieve the ultimate goal, i.e. Sentiment Analysis, there was a need to clean up the individual tweets. To facilitate this task, I created a function called "preProcessTweets" in my Python program which I later applied to the "Tweets" to produce the desired results. This function was used to remove punctuation, links, emojis, and stop words from the tweets in a single run. Additionally, I used a concept known as "Tokenization" in NLP. It is a method of splitting a sentence into smaller units called "tokens" to remove unnecessary elements. Another noteworthy technique is "Lemmatization". This is a process of returning words to their "base" form. A simple illustration is shown below.

Data Exploration
In this section, I will show you the most common words used by Twitter users to describe 2020. I created a "getAdjectives" function **** to extract _only adjectives for each tweet to a new column because adjective_s are descriptive words. This was made possible by the POS-tag (Parts of Speech tagging) module of the _NLT_K library. Using the _WordClou_d library, you can generate a Word Cloud based on the frequency of words and superimpose these words on any image (in this case, the Twitter logo). Also, I used the _Pyplot module in the Matplotlib library_ to display the image. The Word Cloud shows the words with a higher frequency in larger text size while the "not so" common words are in smaller text sizes.

You can view my Jupyter Notebook for the code to achieve the Word Cloud above. As you can see, the words "good", "hard", "bad", "great", "tough", "last", "difficult" were some of the most common words used. The frequency of the top ten words is displayed in the plot below.

Location Geocoding
For my final dashboard, I wanted to add a map that shows the number of tweets per country. To do this, Tableau needs basic geographic information such as the country’s name. I had used Geopy library for my previous project but this time around, I could not use it due to Server limit errors. After further research, I ended up using the Developer Here API to return Longitude, Latitude and Country name for each tweet location. One key thing to note is that if you send a request with a "NaN" or Null location to the API, it will return an actual location. That is why I had to replace the "NaN" values with "No Location" in the Data Cleaning step. It is very important to review the data frame after each code run to ensure you are getting the expected results. This was how I caught this discrepancy.
I will write about using the Developer Here API in the future because of the challenges I encountered while attempting Geocoding with other APIs.
Sentiment Analysis
Now, to the core of this project – Sentiment Analysis. From the words we use in our statements, one can tell whether they are Positive, Negative or Neutral. However, what if we can train a computer or model to do this automatically?

The above illustration is Sentiment Analysis in a nutshell. Thanks to the creators of Sentiment analysis algorithms contained in libraries such as TextBlob and VADER, we can analyze text and return their Sentiment score. This is a part of Unsupervised Machine Learning (UML). You can also train a Machine Learning **** model to predict these sentiments but, you would need a dataset of tweets with accurate sentiments to do so.
I must point out here that these algorithms have their error margins because the context for the trained model is different from the context for these tweets. For this analysis, I went with TextBlob. TextBlob analyzes sentences by giving each sentence a Subjectivity and Polarity score.
Based on the Polarity scores, one can define the tweets’ sentiment category. A Polarity score of < 0 is Negative, 0 is Neutral, while > 0 is Positive. I used the Pandas "apply" method on the "Polarity" column in my data frame to return the respective Sentiment category. The distribution of the Sentiment categories is shown below. You can also view the Sentiment Category distribution by country and continent in the Tableau dashboard HERE.

Tableau Dashboard
I was so excited to build this dashboard because I never knew how to use Tableau before starting this project. To develop the final dashboard that you see in the animation below, I exported the results from my Jupyter Notebook (where I run my Python program) to Tableau Desktop. The Tableau dashboard has six unique elements. Explore the dashboard by clicking this LINK. **** The dashboard can be viewed on any device, however, for a fuller view use a computer or tablet.
Fun fact: I learned how to use Tableau in two hours on Dec 21, 2020 thanks to this Tableau Community Tutorial. This was after I discovered that I could not publicly share my dashboard created using Power BI. Also, the entire project took me two weeks since I had to combine it with my full-time job.
Project Insights
From the dashboard, you can visualize different outcomes and get more insights. Please note that these insights are not representative of the entire Twitter community because I only mined a subset of tweets for that period.
- Tweet Sentiments: I was a bit surprised by the proportion of the sentiment categories. However, for most people, the end of the year is a time to show gratitude and hope for a better year ahead, hence a higher proportion of tweets were Positive (50%). Conversely, the year has been filled with many unpleasant events, so we had 31% of the tweets being Negative.
- Countries with most tweets: About 40% of the total tweets were from the United States, England and Canada. Since many Twitter users do not have their exact location on their profiles, their tweet locations were classified as "Unknown location".
- Time of day with the most tweets: It was interesting to see that most tweets were created at 5 PM (GMT). Thinking about it, in the United States and Canada, this is lunchtime while in countries like Nigeria and England, it is when most individuals finish the workday, so people have ample time to tweet. So, if you are looking for more engagement on Twitter, you should tweet within this period.
- Time of day with the least tweets: 9 AM (GMT) was the time of the day with the least tweets. The reason is that most people start their workday in countries like Nigeria and England at this hour while it is still bed-time in other countries such as the United States and Canada.
- Most "Retweeted" and "Liked" Tweet: For the period of 12 to 25 December 2020, the most "retweeted" tweet was about a Korean boy band, "BTS" and their songs **** with _10,873 retweets. The most "liked" tweet was from a user who tweeted about how "2020 was a good year for his dog who did not have to be alone for a second". The tweet had 42,295 like_s.
Closing Remarks
Thanks for reading my article. I hope you enjoyed reading it as much as I did writing it. I am grateful to all my family and friends that took the time to critique and proofread this work. Do not hesitate to leave a comment on the insights you got from the dashboard or project. All the references used were hyperlinked within the article. For the complete Python code written on Jupyter Notebook, GitHub repo with the dataset, Tableau Dashboard and my other social media pages, please use the links below: