The world’s leading publication for data science, AI, and ML professionals.

Exploring Reddit WallStreetBets Posts Data

EDA and Sentiment Analysis of Reddit Data

Photo by energepic.com on Pexels
Photo by energepic.com on Pexels

Reddit WallStreetBets Posts is a data set available on the Kaggle website that contains WallStreetBet information. WallStreetBets is a subreddit used for discussing stock and option trading. WallStreetBets is most notable for its role in the GameStop short squeeze that resulted in $70 billion in losses on short positions in US firms. In this post we will explore the Reddit WallStreetBets Posts in python. The data was scraped using the python Reddit API wrapper (PRAW) in compliance with Reddit’s rules around API usage. The data is can be found here.

Let’s get started!

First, let’s read the data into a Pandas data frame:

import pandas as pd
df = pd.read_csv('reddit_wsb.csv')

Next, we will print the columns available in this data:

print(list(df.columns))

Now, let’s relax the display limits on thee number of rows and columns:

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Next, let’s print the first five rows of data using the ‘head()’ method:

print(df.head())

We can label our data based on whether or not the title contains the GameStop ticker (GME):

import numpy as np 
df['GME_title'] = np.where(df['title'].str.contains('GME'), 1, 0)
print(df[['title','GME_title']].head())

We can also create a column based on the text body:

df['GME_body'] = np.where(df['body'].str.contains('GME'), 1, 0)
print(df[['title','GME_body']].head())

Moving forward let’s continue to work with the text titles. Let’s define a function that takes as input a data frame, column name, and limit. When called, it prints a dictionary of categorical values and how frequently they appear. Let’s see the distribution in ‘GME_title’:

def return_counter(data_frame, column_name):
   from collections import Counter        print(dict(Counter(data_frame[column_name].values)))
print(return_counter(df, 'GME_title')

Let’s define a column called ‘ticker’ that has a value of ‘GME’ where ‘GME_title’ is 1 and ‘Other’, where ‘GME_title’ is 0:

df['ticker'] = np.where(df['GME_title'] ==1, 'GME', 'Other')

We can also filter our data to only include posts that have GME in the text title:

df_GME = df[df['GME_title']==1]
print(df_GME.head())

Next, we will use boxplots to visualize the distribution in numeric values based on the minimum, maximum, median, first quartile, and third quartile. If you are unfamiliar with them, take a look at the article Understanding Boxplots.

Similar to the summary statistics function, this function takes a data frame, categorical column, and numerical column and displays boxplots for the most common categories based on the limit:

def get_boxplot_of_categories(data_frame, categorical_column, numerical_column):
    import seaborn as sns
    from collections import Counter
    keys = []
    for i in dict(Counter(df[categorical_column].values)):
        keys.append(i)
    print(keys)
    df_new = df[df[categorical_column].isin(keys)]
    sns.set()
    sns.boxplot(x = df_new[categorical_column], y =      df_new[numerical_column])

Let’s generate boxplots for ‘score’ in the ‘ticker’ categories:

get_boxplot_of_categories(df, 'ticker', 'score')

Finally, let’s define a function that takes a data frame and a numerical column as input and displays a histogram:

def get_histogram(data_frame, numerical_column):
    df_new = data_frame
    df_new[numerical_column].hist(bins=100)

Let’s call the function with the data frame and generate a histogram from ‘score’:

get_histogram(df, 'score')

We can also get sentiment scores for each of the posts. To do so, we need to import a python package called textblob. The documentation for textblob can be found here. In order to install textblob open a command line and type:

pip install textblob

Next import textblob:

from textblob import TextBlob

We will use the polarity score as our measure for positive or negative sentiment. The polarity score is a float with values from -1 to +1.

Let’s define a function that will generate a column of sentiments from text titles:

def get_sentiment():
    df['sentiment'] = df['title'].apply(lambda title: TextBlob(title).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Posts", len(df_pos))
    print("Number of Negative Posts", len(df_neg))

If we call this function we get :

get_sentiment()

We can apply this to the GME filter data frame. Let’s modify our function so that it takes a data frame as input. Let’s also return the new data frame:

def get_sentiment(df):
    df['sentiment'] = df['title'].apply(lambda title: TextBlob(title).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Posts", len(df_pos))
    print("Number of Negative Posts", len(df_neg))
    return df 
df = get_sentiment(df_GME)
print(df.head())

Finally, lets modify our function so that we can visualize the sentiment scores with seaborn and matplotlib:

import matplotlib.pyplot as plt
import seaborn as sns
def get_sentiment(df):
    df['sentiment'] = df['title'].apply(lambda title: TextBlob(title).sentiment.polarity)
    df_pos = df[df['sentiment'] > 0.0]
    df_neg = df[df['sentiment'] < 0.0]
    print("Number of Positive Posts", len(df_pos))
    print("Number of Negative Posts", len(df_neg))

    sns.set()
    labels = ['Postive', 'Negative']
    heights = [len(df_pos), len(df_neg)]
    plt.bar(labels, heights, color = 'navy')
    plt.title('GME Posts Sentiment')
    return df
df = get_sentiment(df_GME)

I will stop here but please feel free to play around with the data and code yourself.

CONCLUSIONS

To recap, we went over several methods for analyzing the Reddit WallStreetBets Posts data set. This included defining functions for generating counts for categorical values, functions for visualizing data with boxplots and histograms, and generating sentiment scores from the post text titles. Understanding distributions in categorical values, like the ticker labels we generated, can give insights into how balanced the data set is in terms of categories/labels. This can be very useful for developing Machine Learning models used for classification. Further, boxplots and distributions reveal the spread in values for numerical columns and provide insight into the presence of outliers. Finally, sentiment scores are useful for understanding whether or not positive or negative sentiment is expressed in text. This type of data can be useful for tasks such as predicting the direction of stock price changes. I hope you found this post was interesting. The code from this post is available on GitHub. Thank you for reading!


Related Articles