The world’s leading publication for data science, AI, and ML professionals.

Trending Topics in Reddit during USA 2020 Elections

Finding real events in reddit dataset using topic modeling and time-series analysis

In this post, we will share our work (code and results) of topic modeling and detecting real events in textual data.

Downloading Reddit Dataset

First, we download Reddit data using "pushshidt" api. The following code enables to download the data under the rate limit, using start date, end date, subreddit name and the type of text (which can be submission or comment). We downloaded Conspiracy, Conservatives and Democrats subreddits during the time period 1.8.2020–1.2.2021.

Data Preprocessing

After having the posts and comments dataframes we would like to clean and preprocess the data. For each submission we concatenated the "title" and"self-text" fields to get one text field for the post, and for the comments we used the "body" field to get the text. We dropped out ‘[deleted]’ and ‘[removed]’ texts, and removed posts and comments without a datetime.

Then, to preprocess the text we removed stop words, numbers and punctuation characters. We split the text into sentences and sentences into tokens. To deal we the data amount and to use enriched texts we took only texts of length bigger than 5 tokens. We prepare a list of all the sentences as input for the upcoming models.

Topic Modeling

In order to find trending events over time, we first aim to find topics. To get the relevant topics for each subreddit we performed the following three steps.

  1. Textual Embedding
  2. Clustering
  3. Topic Indicative Words

Textual EmbeddingWe first have to transform each text into a vector using NLP embedding techniques. We trained two models: weighted word2vec and universal sentence encoder (USE).

  1. _Weighted word2vec- w_ord2vec is a state-of-the-art word-level embedding, trained with one neural net-work hidden layer, takes the context of each word as the input and tries to predict the word corresponding to the context. We used this method with window size of 5 and embedding size equal to 128. Having the embedding for each one of the words in our vocabulary, we want to get the embedding for the entire post/comment text. In order to that we used weighted mean of the words by their tf-idf scores. The following code implements this method.

    _2. Universal Sentence Encoder (USE) –_The main drawbacks of word2vec method are that there is no respect to the words order and that the embeddings of the words are not contextual with the specific sentence they are mentioned in. To capture these problems, we used Universal Sentence Encoder. The main idea is to design an encoder that summarizes any given sentence to a 512-dimensional sentence embedding. USE uses this embedding with transformers architecture to solve multiple tasks and based on the mistakes it makes on those, the sentence embedding is updated. Since the same embedding has to work on multiple generic tasks, it will capture only the most informative features and discard noise. The intuition is that this will result in an generic embedding that transfers universally to wide variety of NLP tasks such as relatedness, clustering, paraphrase detection and text classification. To get the USE embedding for each one of the sentences, we use tensorhub to get the pretrained model and apply the model using data batches.

    Clustering

After having the vector representation for each text, we used K-means algorithm to get the clusters. Each submission and comment are then assigned to a cluster or set of top 5 clusters, representing that each text consists of several topics. We used TSNE method to visualize the texts embedding vectors and the color to represent the cluster that the text was assigned to.

Example of a histogram displaying the distribution of the texts amount assigned for each one of the fifty clusters for the Conservatives subreddit using the USE embedding.

Topic Indicative Words

We would like to assign for each one of the clusters, relevant words describing that cluster, to do this we tried two methods: TF-IDF: we used the top words with the highest tf-idf score in the text of each cluster to represent the topic. Closeness to the Centroid: using K-means, we have foreach cluster its centroid vector. We calculated the distance between this centroid and each one of the words in the vocabulary. The top closest words represent the topic.

The following code implements the second method, which performed better than tf-idf. For word2vec we used the word2vec model to get the most similar words to the centroid vector. For USE we calculated for each word in the vocabulary the embedding by applying the USE model for each one of the words, then we calculated the cosine distance between the post embedding and each one of the vocabulary words to get the top closest words.

Examples for the topic’s words using the above method:

Detecting real events

We wish to observe the topics occurrence distribution over time. The purpose is given social media’s nature, users tend to increase the discussion around specific topics when a related event occurs. Therefore, we expect to see a correlation between a highly mentioned topic, to a specific real-world event. Wee would like to differentiating between real-world events, which stimulate the discussion on an underlying specific topic, to on-going day-to-day conversations. To do so, we’ve plot the topic (cluster) texts amount (submission and comments) over time, to detect peaks in time which may indicate on event.

Example for the time series of the topic cluster: [‘rioters’, ‘riot’, ‘protestors’, ‘protest’, ‘protesters’, ‘riots’, ‘protests’, ‘demonstrators’, ‘portland’, ‘rioting’]:

To deep dive to each one of the peaks, we print the text associated with the relevant topic in the specific date:

For example, in the above topic’s time series we can see one major peak at 7/1/21 and one minor peak at 27/8/20.

The submissions and comments that are related to that topic on the 7/1/21, raise real event which is – Rioters breached US Capitol security on Wednesday.

For the minor peak, the submissions and comments that are related to that topic on the 27/8/20, raise another real event – Minneapolis false rumors riot, August 26–27, 2020. A riot that occurred in downtown Minneapolis in reaction to false rumors about the suicide of Eddie Sole Jr., a 38-year-old African American man; demonstrators believed he had been shot by police officers.

Conclusions

We presented here a full pipeline to extract real events from reddit dataset using topic analysis methods and Time Series Analysis. The method presented shows the high correlation between a mentioned topic, to a specific real-world event. The above pipeline with code and examples provides the ability to find topics and explore them, detect events and deeply understand them. You can use this pipeline and visualizations on other datasets as well.

Thanks for reading!!!


Related Articles