Reddit Flair Prediction Series

Predicting Reddit Flairs using Machine Learning and Deploying the Model using Heroku— Part 2

Text Analysis and Model Building

Prakhar Rathi

Published in

Towards Data Science

12 min readMay 27, 2020

Audio for this article

If you’re stuck behind a paywall, click here to get my friend link and view this article.

Welcome to Part 2 of this series where I continue working on the Reddit Flair Detection Problem. In Part 1, I discussed the background of the problem and the data collection method. It is highly recommended that you go through part 1 before starting this because I have shared insights and reasoning behind the data collection process. I have also described the various indicators that I have used in my project and model building. In case, you haven’t been able to complete part 1 but want to continue part 2, you’ll have to get the data.csv file from here. Once you have obtained the data, let’s begin.

Introduction

In this part of the course, I will be working on data analysis, text analysis and text classification. While this tutorial is specific to the project that I worked on, these techniques can be applied to any text classification problem. It is important to note that this is a multi-class text classification problem and there are certain caveats which are exclusive to these types of problems. Most of the tutorials online deal with binary text classification like spam filters. However, you’ll mostly deal with multi-class problems in the real world. Hence, this tutorial should be a good starting point.

Problem Recap

As I have already discussed, this is a supervised multi-class text classification problem. We have already collected the features and the labels and our goal is to build a model which can predict the flair of a post based on the features from that post which we collect. Let’s begin

Important Libraries

These are the libraries that we will be using in this process.

Necessary libraries for the task

In case they are not installed in your system you can download this file and run the following command in your terminal.

pip install -r requirements.txt

Note:- This will install all the packages in your working directory. In case, you want to install these in a virtual environment, refer to this link.

Exploratory Data Analysis

Start by reading the data from the csv file into a DataFrame.

# Reading Data 
data = pd.read_csv('data.csv')

If you have obtained the data from the GitHub link then the next step is important for you. Others can skip this. An explanation can be found here.

# Data Shuffling
data.drop(['Unnamed: 0'], inplace=True, axis=1)
data[:] = data.sample(frac=1).values
data.head()

Since it’s a text classification problem, we will only be using the features containing text for the machine learning model. They are namely Title, Body, Comments and URL(optional). Let’s look at the columns data types and the missing values.

# Display data types and null values
data.info()

There are a lot of null values in the Body column and some missing values in the Comments section. We cannot impute them since these contain user-generated content. However, every entry has a Title and Flair so we don’t have to drop any row and we can use all of them for analysis. There are many classes to predict in this dataset in the Flair.

print(len(data[‘Flair’].unique()))
data[‘Flair’].unique()OUTPUT:11
['Sports' 'Politics' '[R]eddiquette' 'Business/Finance' 'Food' 'AMA'
 'AskIndia' 'Photography' 'Non-Political' 'Science/Technology'
 'Policy/Economy']

So, there are 11 unique classes. For every new post that we get, we need to classify it into one of these 11 classes. I have already mentioned the important features that we will be using for our analysis. Let’s decrease the size of our dataframe and keep only the relevant features.

# List of relevant features
features = [‘Flair’, ‘URL’, ‘Title’, ‘Comments’, ‘Body’]
data = data[features]
data.head()

Now that we have more relevant data, we need to create a couple of dictionaries for future use[1]. The first step is generating a unique ID for each flair. Then, we will create the dictionaries out of those. These dictionaries will let us refer to the flair will the unique IDs that we have generated for them and vice versa.

We have created two dictionaries:-

category_labels : This dictionary has flairs as keys and the ID assigned to them as values which will be used as a means of assigning labels after the prediction.
category_reverse : This is the reverse of the previous dictionary and uses IDs as the keys and has the flairs as the values.

The next step is to create a combined feature which is a combination of the Title, Body and Comments. I am not using URL for now and leaving it up to you to analyse it. There are many creative ways to do that and you can mention them in the comments below. I will create a new feature Combine which will incorporate the aforementioned features.

Creating a new feature Combine

DataFrame after adding the Combine feature

Text Cleaning

This is one of the most important aspects of a text classification project because all words aren’t equally important and some words like and, the and is are so commonly occurring that they will be present in the data of all flair categories and confuse the classifier. It is highly advisable to remove them. In a sentiment analysis project, we might keep the punctuation marks because the number of exclamation marks might totally change the meaning there. However, I did not feel the need to keep them here and hence, I will be removing them in the next step. The common words that I just mentioned are present in the nltk library so you don’t have to make your own list.

# Collect all the english stopwords and display them
STOPWORDS = nltk.corpus.stopwords.words(‘english’)
print(STOPWORDS)

List of nltk STOPWORDS in the English language

Let’s define a cleaning function. We will pass our features through this function to clean them.

Function to clean our data

Text Representation

Classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation[1]. Vectorization is a method to convert words into long lists of numbers, which might hold some sort of complex structuring, only to be understood by a computer using some sort of machine learning, or data mining algorithm. [2]

I would like to give credits to Susan Li for this part of the article. The most correlated unigrams and bigrams test which she introduced in her article is a very insightful technique which lets us find out what the words appear the most in a particular type of flair and gives us insight into the prediction method of the model. If a particular flair has a lot of unrelated words then we might consider adding more data or removing some data.

Now, for each term that occurs in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated as tf-idf. We will use sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for each of consumer complaint narratives:

sublinear_df is set to True to use a logarithmic form for frequency.
min_df is the minimum numbers of documents a word must be present in to be kept.
norm is set to l2, to ensure all our feature vectors have a euclidian norm of 1.
ngram_range is set to (1, 2) to indicate that we want to consider both unigrams and bigrams.
stop_words is set to "english" to remove all common pronouns ("a", "the", ...) to reduce the number of noisy features. [1]

tfidf Vectorization on Combined Data

Output 
(1650, 3299)

Now, each of 1650 consumer complaint narratives is represented by 3299 features, representing the tf-idf score for different unigrams and bigrams.

We can use sklearn.feature_selection.chi2 to find the terms that are the most correlated with each of the products:

Print the list of most correlated unigrams and bigrams for each flair

You’ll find the output below to be pretty intuitive for each flair.

Flair 'AMA':
Most correlated unigrams:
	. hi
	. anything
	. ask
	. questions
	. ama
Most correlated bigrams:
	. ask us
	. us anything
	. hi reddit
	. answer questions
	. ask anything

Flair 'AskIndia':
Most correlated unigrams:
	. advice
	. dad
	. situation
	. afraid
	. family
Most correlated bigrams:
	. ive seen
	. want know
	. feel like
	. work home
	. dont want

Flair 'Business/Finance':
Most correlated unigrams:
	. firms
	. emi
	. hdfc
	. mukesh
	. bank
Most correlated bigrams:
	. credit card
	. mukesh ambani
	. share market
	. reliance jio
	. yes bank

Flair 'Food':
Most correlated unigrams:
	. restaurant
	. chutney
	. recipe
	. chicken
	. food
Most correlated bigrams:
	. im trying
	. every day
	. couldnt find
	. dont eat
	. indian food

Flair 'Non-Political':
Most correlated unigrams:
	. rural
	. dads
	. found
	. bored
	. comics
Most correlated bigrams:
	. im gonna
	. palghar lynching
	. amazon prime
	. india live
	. amid lockdown

Flair 'Photography':
Most correlated unigrams:
	. mm
	. beach
	. nikon
	. shot
	. oc
Most correlated bigrams:
	. stay home
	. equipment nikon
	. one plus
	. da mm
	. nikon da

Flair 'Policy/Economy':
Most correlated unigrams:
	. gdp
	. govt
	. investments
	. nirmala
	. economy
Most correlated bigrams:
	. health workers
	. https internetfreedomin
	. petrol diesel
	. indian economy
	. raghuram rajan

Flair 'Politics':
Most correlated unigrams:
	. sonia
	. removed
	. modi
	. arnab
	. muslims
Most correlated bigrams:
	. home minister
	. arnab goswami
	. pm modi
	. rahul gandhi
	. john oliver

Flair 'Science/Technology':
Most correlated unigrams:
	. vpn
	. iit
	. develop
	. zoom
	. users
Most correlated bigrams:
	. anyone else
	. covid virus
	. home affairs
	. ministry home
	. cow urine

Flair 'Sports':
Most correlated unigrams:
	. ipl
	. football
	. sports
	. cricket
	. cup
Most correlated bigrams:
	. india pakistan
	. know people
	. one time
	. times india
	. world cup

Flair '[R]eddiquette':
Most correlated unigrams:
	. boop
	. askaway
	. beep
	. creator
	. bot
Most correlated bigrams:
	. bot problem
	. bot bot
	. askaway creator
	. beep boop
	. discussion thread

You’ll see that for most flairs, the most correlated words are pretty explanatory.

Modelling the input features and labels

Our next task is to model the input data in a way which is understandable by the classifiers. We need to convert the inputs into a vector of numbers which relates to a numeric label. After getting this vector representation of the text we can train supervised classifiers to predict the “flair” for each Reddit post that a user submits. Let’s start by splitting by data into training and testing sets. There is a reason why I am not vectorizing the data first and that is because if you do that then your vectorizer will consider the whole data as the sample and fit it based on that. This means that your .fit() or .fit_transform() will use the whole data for fitting. When we split the data later, the test data will be split based on the combined training and testing data. However, this model is going to be deployed and we do not have the same luxury with unseen data, hence, we cannot transform it based on combined data. This might reduce the test accuracy but is a better model in the long run in my opinion because it eliminates bias.

Vectorize and transform data

After all the above data transformation, now that we have all the features and labels, it is time to train our classifiers. We can use a number of different classifiers for this problem. I will be using four different types of models and due to the length of the article here, I will only be discussing baseline results which might act as a good comparator. I will be writing a separate article on Google’s BERT and hyperparameter tuning for the current models. The models that I will be using are:-

Each of these classifiers has its own merits and demerits. It is up to you to figure out which one suits your needs the best. I will just walk you through the process of implementing and pipelining them. Here’s how you can train your data.

# Create an instance 
model = MultinomialNB()# Fit to training data
model.fit(X_train_tfidf, y_train)# Predictions on X_test_tfidf
# Obtain X_test_tfidf in the manner described above
model.predict(X_test_tfidf)

This is pretty basic, right? You must have done this multiple times before if you have ever trained a simple classifier. Let’s learn something new then.

Pipelining

There are many moving parts in a Machine Learning (ML) model that have to be tied together for an ML model to execute and produce results successfully. Each stage of a pipeline is fed data processed from its preceding stage; that is, the output of a processing unit is supplied as the input to the next step. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines are a crucial element of an AutoML system. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods. [3] Here’s what our pipeline will look like.

A flowchart of our pipeline (Made with SmartDraw)

Let’s start with the Multinomial Naive Bayes Classifier.

nb_fit = Pipeline([(‘vect’, CountVectorizer()),
                   (‘tfidf’, TfidfTransformer()),
                   (‘clf’, MultinomialNB())])

Similarly, we can make functions for each of our classifiers for a more streamlined approach.

Making our prediction functions

Making Predictions and Evaluating Results

Making functions like the ones created above modularize your code and make your task easier. Now you can make predictions and evaluate results conveniently.

print(“Evaluate Naive Bayes Classifier”)
nb_classifier(X_train, X_test, y_train, y_test)print(“Evaluate Random Forest Classifier”)
random_forest(X_train, X_test, y_train, y_test)print(“Evaluate Logistic Regression Model”)
log_reg(X_train, X_test, y_train, y_test)print(“Evaluate SVC Model”)
svc(X_train, X_test, y_train, y_test)

The following commands print the results. The results may vary based on the data that you have used and the pre-processing that you have done. These are baseline results and were later improvised using hyper-parameter tuning. However, this article is long enough without that so I will cover it in another article.

Evaluate Naive Bayes Classifier
Model Accuracy: 0.53951612903225806
Evaluate Random Forest Classifier
Model Accuracy: 0.6074193548387097
Evaluate Logistic Regression Model
Model Accuracy: 0.6645161290322581
Evaluate SVC Model
Model Accuracy: 0.5248387096774194

We can see that the logistic regression model seems to be working the best. However, that can quickly change after hyperparameter tuning so I will leave that up to you for now. There are many reasons why the performance is low for now including data quality and they would make for a good discussion in the comments below. In the next part, I will be serialising this model for deployment. We will also work with Flask to deploy our machine learning models. The web-app will work in such a way that the user will post a link and we will get the predicted class back. To learn how to create a web-app, move on to Part 3. You can find all the articles in this series here.