In this semester, I took up a class called ‘Developing Meaningful Indicators’ under Dr Charles. I was inspired when he mentioned that he was trying to do some natural language processing on a huge data set of Reddit comments. In fact, two of my classmates (Nicole and Min Yi) had also used Microsoft Azure AI on Excel to conduct sentiment analysis on Twitter tweets.
So I teamed up with my classmate, who is aiming to make a real change in the world, Ling Hui, and attempted to do this for ourselves. We initially hypothesised that people on Reddit would generally give comments with positive sentiment as after all, Reddit is a community platform where people would share their ideas and stories with one another. We also thought that there would be some positive correlation between sentiment score and score (higher score = greater sentiment score). However, after we discussed finding fault and critical thinking, we realised that we might be wrong… which made us even more eager to dive in.
The Technical Bits
The first challenge was downloading the dataset. We found a massive dataset on Kaggle on all of Reddit comments in May 2015 which was 30GB in size! (yes I know it says 20GB on Kaggle but the proof is below)

The next step was figuring out how to open the file. Surprisingly, the computing student that I am had never used SQL before, so downloading and opening up the file on SQLite was a new experience for me. After much Googling and a small online crash course on SQL, I figured that SQL is much easier than most of the other languages I’ve learnt. I was also blown away by the size of the data set, it was a whopping 54 MILLION rows.

We had our very very rough plan set out, but for most of the technical details, I had no idea what to do. The idea was to first filter the data, then upload it into Amazon Web Services (AWS). I would then write a Python script that would read the data, send it to Amazon Comprehend for sentiment analysis, and then compile it again. We would then download the data again, and then perform data visualisation on Tableau, Excel, or other visualisation tools. Things were starting to look decently alright…

Wrong. As I started filtering the data set (for columns that I didn’t need), my next frustration appeared. I tried to create tables and delete columns, it resulted in massive crashes.


Being a generally impatient person, I would then force-close SQLite. However, as I (very) slowly learnt, that will corrupt the SQL file (or maybe not, but I did not have the energy to figure out why). Nevertheless, that left me in confusion and intense frustration.

After several tries, I had given up trying to filter the data, and wondered if I could directly upload the file into AWS. However, it seemed that AWS RDS only supported MySQL, PostgreSQL, MariaDB, etc. BUT. NOT. SQLITE. I spent several days Googling for alternatives, but most seem too complicated.

All hope was not lost. The turning point came during a (very boring) lecture of another module that I was attending. I realised that I do not need to upload the dataset to AWS, but I could bring AWS to me. Instead, I could use Python to access both the SQLite database and Aws Comprehend, and then write the results into a new (clean) CSV file that we can use for visualisations.

Luckily for me, I had experience with Python, so coding the Python script was decently smooth. The first was to connect to AWS. I have an existing AWS Educate account which I used to send requests to AWS Comprehend. After configuring the credentials and AWS SDK locally on my machine, I was able to code a function that takes a text, sends it to Comprehend, and returns the results. The instructions for the code is easy to follow from the documentation.
## script for Sentiment Analysis
def get_sentiment(text):
comprehend = boto3.client(service_name='comprehend', region_name='us-east-1')
response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
response = dict(response)
result = {}
result["sentiment"] = response["Sentiment"]
result["positive"] = response["SentimentScore"]["Positive"]
result["negative"] = response["SentimentScore"]["Negative"]
result["neutral"] = response["SentimentScore"]["Neutral"]
result["mixed"] = response["SentimentScore"]["Mixed"]
result["sentiment_score"] = (result["positive"] - result["negative"]) / 2
return result
Connecting to SQLite needed a bit more research. This link pretty much covered all that I needed to know, and I would highly recommend this for anyone trying out SQLite on Python.
## connect to sqlite database
connection = sqlite3.connect("master.sqlite")
print("opened database successfully")
data = connection.execute("SELECT id, subreddit, body, ups, downs, score FROM comments_cleaned")
The next chunk of code dealt with most of the processes. The biggest takeaway I got from coding this chunk is the error handling. After several executions of the code, I had discovered many errors that I had not expected, and hence had to appropriately handle the error, and rerun the code. Ultimately, the solution that I had come up with was to allow for user input to decide to skip (filter out) that row, or fix the problem locally before processing that row again, without disrupting/terminating the programme.
with open("comments_with_scores_" + str(NUMBER_OF_ROWS) + ".csv", 'w', newline = '') as file:
writer = csv.writer(file)
## create headers
writer.writerow(["id", "subreddit", "body", "ups", "downs", "score", "sentiment", "positive", "negative", "neutral", "mixed", "sentiment_score"])
count = 1
skipped = 0
for row in data:
## get sentiment results
body = row[2]
try:
result = get_sentiment(body)
except botocore.exceptions.ClientError as e:
print(e)
result = 'no-go'
while (result != "ok" and result != 'skip'):
print("type 'ok' or 'skip' to continue")
result = input()
if (result == 'skip'):
skipped += 1
print("error occurred with sentiment analysis, skipping row")
continue
result = get_sentiment(body)
except:
skipped += 1
print("error occurred with sentiment analysis, skipping row")
continue
## write to csv
try:
writer.writerow([count, row[1], row[2], row[3], row[4], row[5], result["sentiment"], result["positive"], result["negative"], result["neutral"], result["mixed"], result["sentiment_score"]])
print("scanned and accessed data index", count)
except:
skipped += 1
print("bad data, cannot write to csv, skipping row")
continue
## maintain count
if count == NUMBER_OF_ROWS:
break
count += 1
Ultimately, the code churned out data sets. I had specified the initial size just in case of future problems, so data sets of 50, 500, 5000. We were ambitious and wanted to go for 1 million rows, but unfortunately it was too time consuming to create and run the code, and my AWS session kept timing out and creating more errors. So we decided to work with around 8000 rows, the largest my computer could churn out. After producing the CSV data sets, I sent them to Ling Hui so we could work on the visualisations together.
The Visualisation Bits
With such a huge dataset, there were a lot of things that we could do with it. I have to admit, we were a little lost at first. With just about a dozen different tools at our disposal, and about another dozen ways to represent the data, Ling Hui and I thought really hard about the best way we could have represented the data. We decided to work around with Tableau, and realised that setting the variables to attribute/ dimension/ measure will help show different meanings.
The first graph we plotted was the plot of sentiment against average score. The score for each comment was calculated by subtracting the number of downvotes from the number of upvotes on that comment. As seen from the graph, Reddit responses with a negative sentiment have the highest average score. Perhaps, humans have a generally negative bias, and Reddit responses with a positive sentiment also have the lowest average score! Our guess is that us humans are hardwired for negativity. We have a tendency to pay more attention to bad things and overlook good things, and this is likely a result of evolution. Criticisms often have a greater impact than compliments, and bad news frequently draws more attention than good, which could possibly explain why posts with a greater score have a negative sentiment. In fact, after some research, we found out that this phenomena is called the "Negative Bias" and has been widely researched by psychologists.

We decided to play around with the functions in Tableau, and instead of sorting the score by average, we decided to sort the score by maximum. This graph proves our analysis of the first graph, as the greatest score is given to a comment with a negative sentiment, and the lowest for one with a positive sentiment.

Next, we wanted to find out which subreddit contains higher scoring comments. So if one is thinking of trying to increase their karma on Reddit, maybe they should try replying to posts in Unexpected. However, we realised that sorting our data in this manner may not be the best, as the subreddit for ‘Unexpected’ only has 1 post in our dataset. Unfortunately, using "average" measurement would be better than using the "sum" measurement, as sums are usually misleading because of population size. Apart from unexpected, the subreddit cringepics has the highest average score. After we went through the subreddit together, we realised that because the majority of the posts are supposed to be funny, and hence the replies are sarcastic and humorous, which might lead to many upvotes.

Following that, we wanted to plot a scatter plot of average sentiment score against average score, to see if there is a correlation between the 2. The sentiment score was calculated such that negative posts have a score closer to -1 and positive posts have a score closer to 1. We were hoping to see a relationship between average sentiment score and average score, but we realised after plotting it that the majority of the comments would have a low number of upvotes and hence score, so the points were all clustered around the 0 score. We realised that most comments won’t be graded highly but if one wants to obtain a high score, one’s chances are greater if their words generally carry a negative sentiment and post unconventional things.

In this next chart, we then hoped to see which subreddit would have the greatest score. In the previous charts, we used the measure function, so Tableau would give us the option of average, sum, max, mine etc. This time we wanted to play around with the dimension function. To those who are unsure of the difference between the 2, measures are numerical values that mathematical functions work on. On the other hand, dimensions **** are qualitative and do not total a sum. We filtered the score so that the score axis would start at 300 (essentially trying to sort out the top few comments). As you can see in the graph, the subreddit AskReddit has the greatest score.

We are still unsure as to why the sentiment score would be negative, but then we realised a small flaw in the sentiment analysis, and we will elaborate on this later. Anyways, it is understandable why AskReddit would have the highest scores, as people usually pose questions on that subreddit to get a response from other Redditors. So the post with the highest score must have been a reply that would have been helpful to others. Intrigued, we went to find the body of the comment, and this is what it said:
"When I was in 6th grade, there was this kid. I’ll call him Danny. Danny was an idiot of the highest caliber. The kid just sat there with a blank look on his face all day, every day. We’re going over the human body and digestion one day and teacher decides to throw Danny an easy question out to boost him up a bit. She asks him how we get liquid into our bodies. Danny sits there with a blank look and says "I don’t know". She prompts him again by asking how he gets liquid into his body when he’s thirsty. Danny replies, "I don’t know." She is amazingly frustrated by this point and goes as far to ask "Danny, how do you get water into your body when you’re thirsty?" and he replies, "I don’t know. Through your skin, maybe?"
She freaks out and loses her shit. He’s sitting there looking dazed as always and Mrs. Teacher charges to the back of the room and grabs his arm. She pulls him out of his seat and drags him to the front of the classroom where there was a sink. The water gets turned on and she shoves his arm into the stream and keeps yelling at him "Through the skin?? THROUGH THE SKIN!? How are you feeling now? Are you getting full? Are you getting enough water? YOU DRINK WATER WITH YOUR MOUTH!"
I’m nearly 40 now and I will remember that scene for the rest of my life with mixed amounts of horror and amusement."
Not what we were thinking of, but in hindsight, it was a pretty funny story. We realised the power of social media: its ability to bring people all over the globe together to forge a sense of camaraderie, or even just to make the lonely ones feel less alone.
This was another chart we obtained while playing around with Tableau, where we managed to get the top 10 comments, and see which subreddit they came from, which was expectedly, AskReddit.

However, one flaw of using sentiment analysis in general is that the software scans through the individual words themselves, but two individual negatively connoted words placed together could actually express a positive meaning. Take for instance in id 446:
THIS INTRO IS FUCKIN SICK.
‘Sick’ and ‘f**king’ are two negatively connoted words, but when placed together it actually means that the Redditor thought that the content was cool and interesting. Yet, the sentiment analysis categorised this as negative with a 99.68% confidence. So we realised that although sentiment analysis can generally help with obtaining an overview of the wider public opinion, it is also flawed in a sense that it is hard to pick up the general sentiment of phrases.
Conclusion
We have finally come to the end of our sentiment analysis on Reddit! So how can our findings serve as meaningful indicators? We were able to understand the attitudes of Reddit users- most had a negative sentiment. This could help people who are new to Reddit to look at the subreddit polarity before joining, so that they will know the general sentiment and what they can expect in that subreddit. We feel this is especially important since Reddit is a social sharing website. This links to our next point, that since the Reddit user base is so large, it could be used to gauge the general sentiment of more important topics like politics. Or even just simply used to gauge consumer experience on a new product put out on the market.
On a more meta note, we believe that it gave us a better understanding of how social media works. With Reddit being one of the more accepting and social platforms, we would expect that the community bonds over positivity or general happiness. However, our results clearly show otherwise, that people bond over negativity, or perhaps negative experiences. This gives a whole new perspective over humanity and social interactions as a whole, and even how we (Ling Hui and Ryan) bonded as friends. Maybe we need more negativity in the world.
P.S. We were also disappointed that r/dataisbeautiful only appeared in our dataset once 🙁 It is unfortunate that we couldn’t utilise the entire dataset due to technical constraints, since it would have definitely been fun to see the sentiment analysis of r/dataisbeautiful.