Introduction
Two months ago, I completed CS50’s Introduction to Artificial Intelligence using Python course. During the course, I was especially intrigued by the concept of Sentiment Analysis, extracting features from textual data and funneling into Machine Learning algorithms to determine emotional tones. I decided to embark on a project that involves text classification and sentiment analysis.
Dataset Search
Being part of Singapore’s education system for the past 12 years, r/SGExams subreddit seems to always be the preferred platform by students for discussions ranging from students ranting about how difficult a particular O level paper was to students inquiring about various University courses.
The myriads of such discussions provide the perfect dataset for Text Classification and Sentiment Analysis on student opinions of each respective education level.
Data Extraction
I used PRAW, Reddit API wrapper to extract the post title, title text, URL, date, time of the top 1000 posts on r/SGExams.I extracted data into a nested dictionary which I then converted to a pandas data frame. I represented meme post images as text: "meme post" using a loop function with _isself() to add to pandas DataFrame.
Do note that to use PRAW API, you have to create an account and register to obtain OAuth2 keys(client_id, client_secret, user_agent).
Data Preprocessing
Having extracted the data from Reddit, I go on to further refine the data to fit the model requirements. The steps to perform data preprocessing are as follows:
- Removing any blank rows
- Change all text to lowercase
- Word Tokenization: The process of breaking up a stream of text up into words, phrases
- Removing Stop words
- Removing non-alphanumeric text
- Word Lemmatization: The process of reducing inflectional forms and sometimes derivationally related forms of a word to a common base form while considering the context.
I only applied Lemmatization and not Stemming to keep the tokens readable. For those who are unsure about the differences between the two, Stemming usually removes the last few characters of a word leading to incorrect meanings and spelling while Lemmatization considers the context and converts word to its base form called a lemma. Here’s an example:
Original Stemming Lemmatization
Having Hav Have
The going The go The going
Am Am Be
One problem I faced is that as words in a sentence often have multiple lemmas, they will not be converted correctly. Words like ‘studying’ are not converted to ‘study’. To solve it, I added a _get_wordnetposttag function that takes in a word and returns a part of speech tag (POS tag) and added to 2nd argument to _lemmatizedword(). The POS tag allows the function to lemmatize the word more accurately.
Prepare Training and Testing Datasets
The corpus will be split into two datasets: Training and Testing. The training dataset will be used to fit the model while predictions are done on the Testing dataset, all of which are achieved by the _train_testsplit function from the sklearn library.
Encoding
After that, the labels of the datatype string in _TrainY and _TestY are transformed into a numerical format in a process called encoding for the model to understand. Eg: ["JC", "Uni", "Poly", "Sec"] → [0, 1, 2, 3]
Word Vectorization
There are multiple methods to transform text data into vectors such as Word2Vec but for this project, I will be using the most popular method by far, TF-IDF which stands for Term Frequency – Inverse Document Frequency.
- Term Frequency: How often a word appears in a document
- Inverse Document Frequency: Measures how rare a word is across documents
The TF-IDF model is first fitted to the entire corpus to build up its vocabulary. _TrainX and _TestX are then vectorized to _Train_Xtfidf and T_est_Xtfidf, both of which contain a list of unique numbers along with their associated TF-IDF score.
Support Vector Machine Algorithm
So what is Support Vector Machine (SVM)? It is a supervised learning model with associated learning algorithms that analyze data for classification & regression analysis.
I’ll start by explaining a few key concepts;
- Margin: Distance between the data points of two different classes
- Decision boundaries: Also known as hyperplanes, they exist as a line in 2D feature space or a plane in 3D feature space, it is hard to imagine when it exceeds 4 or more dimensions. Decision boundaries classify data points.
- Cross-validation: Determine the lowest number of misclassification and observations possible in the margin.
Without going deep into the math, let’s start with the figure below. There are two classes of data points, separated by a decision boundary along with its margin. To find the margin’s width, introduce a unit vector w that is perpendicular to the decision boundary.
The goal of SVM is to maximize the margin between data points and the decision boundary. This is otherwise known as a Soft Margin Classifier or Support Vectors Classifier(SVC). SVC is a decision boundary with higher bias and lower variance whose support vectors come from data points lying in the soft margin or on the edge which is obtained via cross-validation. Maximum Margin Classifier(MMC) will not work in this case as it reduces the margin threshold between data points and the decision boundary, especially so in the presence of outliers, increasing the chance of overfitting.
However, both SVC & MMC are unable to deal with non-linear data that has a ton of overlapping classifications, this is where SVM comes into play. To put it simply, SVM works with data in a relatively low dimension, transforms the data into a higher dimension, and find an SVC that can distinguish the high dimensional data. But transforming such data requires a lot of computation. As such, SVM uses Kernel Functions to find SVCs in higher dimensions by calculating relationships between every pair of points as if they are in the high dimensions without transformation of data to a higher dimension commonly known as The Kernel Trick.
With some understanding of SVM, we can explore our dataset and build the SVM model.
Results
The results are more or less what I had expected. In the interest of time, I didn’t train the model on all the data available on the subreddit. In fact, I only used the top 1000 posts of all time just to test the entire flow. The model had a subpar accuracy of 58.3% and is definitely over-fitting. Also, because all posts are of the same topic: Education, it might be very difficult to differentiate each education level due to many common terms used interchangeably.
Sentiment Analysis
I ran a Sentiment Analysis tool on the same dataset, taking the categories (Secondary School, Junior College, Polytechnic and University) into account.
I used VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon, and ruled based sentiment analysis tool for this project as it is specifically tuned to sentiments expressed on social media. Furthermore, it is sensitive to both polarity (Positive/Negative) and intensity of emotions. The rubrics of VADER calculates its sentiment by value: 1 being the most positive and -1 being the most negative with -0.05 to 0.05 being neutral. I tested the tool to test if it can understand language intensity and detect double polarities:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def sentiment_analyzer(sentence):
score = analyser.polarity_scores(sentence)
return score['compound']
sentence = "I dislike mushroom"
sentiment_analyzer(sentence)
OUTPUT: -0.3818
sentence = "I really hate mushroom"
sentiment_analyzer(sentence)
OUTPUT: -0.6115
sentence = "I dislike mushroom but I love fried chicken"
sentiment_analyzer(sentence)
OUTPUT: 0.7184
sentence = "I dislike mushroom but I like fried chicken"
sentiment_analyzer(sentence)
OUTPUT: 0.3506
Additionally, I updated its lexicon with "meme" as a positive sentiment as r/SGExams contains quite a lot of meme posts on current education affairs.
I created a chart to see the overall distribution of sentiments across all education level posts to make comparisons and perhaps draw some conclusions. I used Matplotlib to visualize my data.
Noticeably, the number of negative sentiments related to all education levels outnumbered the positive ones, with Junior College having the most number of negative posts. A deeper analysis shows that posts whose sentiments were classified as neutral, were mainly query related, explaining their neutrality.
I go further to plot timeline charts of each education level with its rolling average to see if there are any observable trends.
The chart turned out to be quite messy and difficult to visualize, so I edited my code to remove the raw sentiment data and just plot each education level’s rolling average.
Evaluating charts
Generally, sentiments across all education levels are mostly negative throughout the year as predicted from the bar chart. I noticed that for Junior College and Secondary school, their sentiments are the most negative from June onwards till the next following year February, which fits the trend of the nearing of the final year exams followed by O and A level results release causing students to express their stress, anxiousness, and doubts on getting into their dream school/course onto the platform.
For Polytechnic and University, no conclusions can be drawn perhaps due to their modular academic system as compared to JC and Secondary’s Final system. A deeper analysis showed that their posts mainly are queries about various specialized courses, AMAs, rants, scholarships, and not about exams, results, thus explaining the unpredictable fluctuations of posts with different sentiments.
Ending Notes
Before I end this post, I would like to add some disclaimers
The results obtained from my project should be taken lightly if possible
There were quite a few things that could have been done better.
For example, when classifying text, I mainly distinguished them via their flairs. Flairs such as META, Rant, and Advice are classified under all the education levels. This form of classification may be inaccurate as some posts although general, are directed towards a specific education level, leading to inaccuracies.
In addition to that, the dataset I used is really small, with just 1000 posts, and the numbers among each education level are not balanced and the spread of data was limited to about a year, making it hard to solid trends.
Anyways, it was a very interesting project and a good learning experience, I look forward to learning more and doing more projects in the future.
All codes are available in my Github here:
LinkedIn Profile: Sean Yap
Cheers!
References
[1]Prof Patrick Henry Winston, Support Vector Machines(2014), MITOpenCourseWare on Artificial Intelligence
[2] Felippe Rodrigues, Scrape Reddit with Python(2018), storybench.org from Northeastern University School of Journalism