As it was predictable, for two weeks it has been all about Olympics.
A lot of great stories have been made, a lot of stars were born and a lot of incredible events have happened. Moreover, in the era of social media "if you didn’t post it, it never happened". For this reason, a lot of people tweeted about the Olympics, sharing their feelings and thoughts.
Our goal is to model the topics of the tweets. This means that, given a tweet as input, we want to automatically understand the topic of the given tweet.
Here are some examples with general sentences:
INPUT SENTENCE:
I'm so angry, so I think I'm going to buy a cheeseburger
TOPIC:
cheeseburger
INPUT SENTENCE:
This fast food makes the best cheeseburger of the entire city
TOPIC:
cheeseburger
INPUT SENTENCE:
I really like Coke, so I think I'm going to buy a couple of bottles.
TOPIC:
Coke
Of course, if you look at the second sentence, another topic could be "fast food" or "city". In fact, we will more likely have a list of topics, and filter out the most influent ones.
Let’s get started!
0. The Libraries
As the title says we will use Python. In particular, we will use these libraries:
1. The Idea:
The thing that makes a NLP task different from another one (e.g. tabular data classification) is that you need to convert words/sentences into vectors.So the first thing you want to do is to convert your tweets and represent them into a N-Dimensional space.
Usually, N is very large, so you may want to reduce the dimensionality by using an opportune method (e.g. PCA).
Once that it is done, you want to use a clustering model to classify these tweets in an unsupervised way.
Thus, you will end up with k classified vectors. The last step is to use the original tweets and see the most frequent words of each classes. This will give you the topic of each class. Thus, you will have the topic of all the tweets in the dataset.

Let’s explain it in details:
2. The Dataset:
I’ve found the dataset here, imported it with pandas and selected only 5000 tweets for computational reasons:
3. From text to vectors:
A great model used for text pre-processing purposes is BERT. A fantastic article showing why and how we should use BERT for this purpose is this one. Here is how to convert your texts into vectors:
Let’s use PCA* to plot our vectors:
- Bidimensional PCA:
2. 3D PCA:
In order to clearly understand how many features of the PCA decomposition approach we may want to use, let’s use the explained variance ratio:
Ok, so 5 components is enough, for sure (remember that, in Python, you start counting from 0).
- PCA is a method that is used to reduce the dimensions of your data based on the information that each axis contains. We are using it as our vectors have 700+ dimensions and it may disturb our process.
4. Classification and Topic Modelling:
If we want to be precise, we shouldn’t talk about classification, as it is a supervised method. Instead, we are clustering our data, in an unsupervised way as we have no labels at all. The very simple method that we will use is the sklearn K-Means. It is nothing more complicated than:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=20, random_state=0).fit(pca_data)
Then, we will use **TF-IDF* to understand which words appears the most (but excluding the ones that compares too much** like "this", or "the" or "an").
The entire code can be found here:
- TF-IDF is a method that counts, in a smart way, how many times the word appears inside of a class of documents. The smart part is that, if the word appears in all the other classes too, it is probably not influent (like "the" or "an") and, for this reason, it is not considered to be "frequent".
5. Final Results:
Here is what we’ll do to get the topics:
- Order the words in each class by descending values of TF-IDF
- Consider each word only in the class with the highest TF-IDF (the TF-IDF value of a word can change depending on the class we are considering)
- Keep the highest TF-IDF value’s word as the word that represents the entire class.
And here is the code:
The method is not perfect, of course, but there is something really interesting.
The 13 class is all about biles. The 5 class is about the USA. Almost each class is identified with a specific topic.
Considerations:
When a sport event happens, we feel close together. Entire nations hold their breath for the last second of a tournament, and share the joy of a win, or the sadness of a loss.
If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:
A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.