Predicting Reddit Comment Upvotes with Machine Learning

Published in

Towards Data Science

10 min readDec 31, 2018

In this article, we will use Python and the scikit-learn package to predict the number of upvotes of a comment on Reddit. We fit a variety of regression models and compare their performance using the following metrics:

R² to measure the goodness of fit
mean absolute error (MAE) and root mean squared error (RMSE) on a test set to measure accuracy.

This article is based on the work from this Github repository. The code can be found in this notebook.

Background

Reddit is a popular social media site. On this site, users post threads in various subreddits like to one below.

Users can comment on threads or other comments. They can also give upvotes or downvotes to other threads and comments.

Our goal is to predict the number of upvotes that comments will receive.

Data

The data, a pickle file containing 1,205,039 rows (comments) that occurred in May of 2015, is hosted on google drive and can be downloaded using this link.

The target variable and relevant features that will be used for modeling are listed below. They can be divided into several categories.

Target variable

score: number of upvotes on the comment

Comment level features

gilded: the number of gilded tags (premium likes) on the comment
distinguished: the type of user on the page. Either ‘moderator’, ‘admin’, or ‘user’
controversiality: a Boolean indicating whether (1) or not (0) comment is controversial (popular comments that are getting close to the same amount of upvotes as downvotes)
over_18: Whether or not the thread has been marked as NSFW
time_lapse: the time in seconds between comment and the first comment on the thread
hour_of_comment: the hour of day comment was posted
weekday: the day of week comment was posted
is_flair: whether or not there is flair text for the comment (https://www.reddit.com/r/help/comments/3tbuml/whats_a_flair/)
is_flair_css: whether or not there is a CSS class for the comment flair
depth: depth of comment in thread (number of parent comments that comment has)
no_of_linked_sr: number of subreddits mentioned in the comment
no_of_linked_urls: number of urls linked in the comment
subjectivity: number of instances of “I”
is_edited: whether or not the comment has been edited
is_quoted: whether or not comment quotes another
no_quoted: number of quotes in the comment
senti_neg: negative sentiment score
senti_neu: neutral sentiment score
senti_pos: positive sentiment score
senti_comp: compound sentiment score
word_count: number of words in the comment

Parent level features

time_since_parent: the time in seconds between comment and the parent comment
parent_score: score of parent comment (NaN if the comment doesn’t have a parent)
parent_cos_angle: cosine similarity between comment and its parent comment’s embeddings (https://nlp.stanford.edu/projects/glove/)

Comment tree root features

is_root: whether or the comment is a root
time_since_comment_tree_root: the time in seconds between comment and the comment tree root
comment_tree_root_score: score of comment tree root

Thread level features

link_score: upvotes of on thread comment is on
upvote_ratio: the percentage of upvotes from all votes on thread comment is on
link_ups: number of upvotes on thread
time_since_link: time in seconds since the thread was created
no_past_comments: number of comments on thread before comment was posted
score_till_now: score of thread at the time this comment was posted
title_cos_angle: cosine similarity between comment and its thread’s title’s embeddings
is_selftext: whether or not thread had selftext

Setup

Let’s load all of the libraries we’ll need.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizerfrom sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressorimport warnings
warnings.filterwarnings('ignore')

We also define some functions for interacting with the models.

def model_diagnostics(model, pr=True):
    """
    Returns and prints the R-squared, RMSE and the MAE for a trained model
    """
    y_predicted = model.predict(X_test)
    r2 = r2_score(y_test, y_predicted)
    mse = mean_squared_error(y_test, y_predicted)
    mae = mean_absolute_error(y_test, y_predicted)
    if pr:
        print(f"R-Sq: {r2:.4}")
        print(f"RMSE: {np.sqrt(mse)}")
        print(f"MAE: {mae}")
    
    return [r2,np.sqrt(mse),mae]def plot_residuals(y_test, y_predicted):
    """"
    Plots the distribution for actual and predicted values of the target variable. Also plots the distribution for the residuals
    """
    fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True)
    sns.distplot(y_test, ax=ax0, kde = False)
    ax0.set(xlabel='Test scores')
    sns.distplot(y_predicted, ax=ax1, kde = False)
    ax1.set(xlabel="Predicted scores")
    plt.show()
    fig, ax2 = plt.subplots()
    sns.distplot((y_test-y_predicted), ax = ax2,kde = False)
    ax2.set(xlabel="Residuals")
    plt.show()def y_test_vs_y_predicted(y_test,y_predicted):
    """
    Produces a scatter plot for the actual and predicted values of the target variable
    """
    fig, ax = plt.subplots()
    ax.scatter(y_test, y_predicted)
    ax.set_xlabel("Test Scores")
    ax.set_ylim([-75, 1400])
    ax.set_ylabel("Predicted Scores")
    plt.show()def get_feature_importance(model):
    """
    For fitted tree based models, get_feature_importance can be used to get the feature importance as a tidy output
    """
    X_non_text = pd.get_dummies(df[cat_cols])
    features = numeric_cols + bool_cols + list(X_non_text.columns)
    feature_importance = dict(zip(features, model.feature_importances_))
    for name, importance in sorted(feature_importance.items(), key=lambda x: x[1], reverse=True):
        print(f"{name:<30}: {importance:>6.2%}")
        print(f"\nTotal importance: {sum(feature_importance.values()):.2%}")
    return feature_importance

Read in data

df = pd.read_pickle('reddit_comments.pkl')

Handle missing values

The data has some missing values, which are handled either by imputation or by dropping observations. Missing values occurred in the following columns for the following reasons:

parent_score: some comments did not have a parent (imputed)
comment_tree_root_score and time_since_comment_tree_root: some comments were the root of a comment tree (imputed)
parent_cosine, parent_euc, title_cosine, title_euc: some comments lacked words that had glove word embeddings (dropped). In addition, some comments did not have a parent (parent_cosine, parent_title imputed)

df = df[~df.title_cosine.isna()] # drop where parent/title_cosine is NaNparent_scrore_impute = df.parent_score.mode()[0] # impute with mode of parent_score column
comment_tree_root_score_impute = df.comment_tree_root_score.mode()[0] # impute with mode of comment_tree_root_score column
time_since_comment_tree_root_impute = df.time_since_comment_tree_root.mode()[0] # impute with mode of time_since_comment_tree_root column
parent_cosine_impute = 0
parent_euc_impute = 0
df.loc[df.parent_score.isna(), 'parent_score'] = parent_scrore_impute
df.loc[df.comment_tree_root_score.isna(), 'comment_tree_root_score'] = comment_tree_root_score_impute
df.loc[df.time_since_comment_tree_root.isna(), 'time_since_comment_tree_root'] = time_since_comment_tree_root_impute
df.loc[df.parent_cosine.isna(), 'parent_cosine'] = parent_cosine_impute
df.loc[df.parent_euc.isna(), 'parent_euc'] = parent_euc_impute

Select variables

In the next step, we define which variables to use when training the model. We make a list for boolean variables, for variables with multiple categories and for numeric variables.

bool_cols = ['over_18', 'is_edited', 'is_quoted', 'is_selftext']cat_cols = ['subreddit', 'distinguished', 'is_flair', 'is_flair_css','hour_of_comment', 'weekday']

numeric_cols = ['gilded', 'controversiality', 'upvote_ratio','time_since_link',
                'depth', 'no_of_linked_sr', 'no_of_linked_urls', 'parent_score',
                'comment_tree_root_score', 'time_since_comment_tree_root',
                'subjectivity', 'senti_neg', 'senti_pos', 'senti_neu',
                'senti_comp', 'no_quoted', 'time_since_parent', 'word_counts',
                'no_of_past_comments', 'parent_cosine','parent_euc',
                'title_cosine', 'title_euc', 'no_quoted','link_score']

Using our list of variables, we can prepare the data for modeling. The step below uses scikit-learn’s LabelBinarizer to make dummy variables out of the categorical columns then combines all variables.

lb = LabelBinarizer()
cat = [lb.fit_transform(df[col]) for col in cat_cols]
bol = [df[col].astype('int') for col in bool_cols]
t = df.loc[:, numeric_cols].values
final = [t] + bol + cat
y = df.score.values
x = np.column_stack(tuple(final))

We split the data into a training and test set using an 80–20 split.

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

Modeling

In this section, we use scikit-learn to fit models on the Reddit data. We start with a baseline model, then try to improve results with Lasso, Ridge, and Elastic Net Regression. In addition, we try K-Nearest Neighbors, Decision Tree, Random Forest and Gradient Boosted Regression.

First, let’s define a dictionary that will store the results of the model diagnostics.

model_performance_dict = dict()

Linear Regression Models

Baseline Model

We fit a simple model to establish a baseline. This model always predicts the mean number of upvotes.

baseline = DummyRegressor(strategy='mean')
baseline.fit(X_train,y_train)
model_performance_dict["Baseline"] = model_diagnostics(baseline)

Linear Regression

linear = LinearRegression()
linear.fit(X_train,y_train)
model_performance_dict["Linear Regression"] = model_diagnostics(linear)

Lasso Regression

lasso = LassoCV(cv=30).fit(X_train, y_train)
model_performance_dict["Lasso Regression"] = model_diagnostics(lasso)

Ridge Regression

ridge = RidgeCV(cv=10).fit(X_train, y_train)
model_performance_dict["Ridge Regression"] = model_diagnostics(ridge)

Elastic Net Regression

elastic_net = ElasticNetCV(cv = 30).fit(X_train, y_train)
model_performance_dict["Elastic Net Regression"] = model_diagnostics(elastic_net)

Nonlinear Regression Models

K-Nearest Neighbor Regression

knr = KNeighborsRegressor()
knr.fit(X_train, y_train)
model_performance_dict["KNN Regression"] = model_diagnostics(knr)

Decision Tree Regression

dt = DecisionTreeRegressor(min_samples_split=45, min_samples_leaf=45, random_state = 10)
dt.fit(X_train, y_train)
model_performance_dict["Decision Tree"] = model_diagnostics(dt)

Random Forest Regression

rf = RandomForestRegressor(n_jobs=-1, n_estimators=70, min_samples_leaf=10, random_state = 10)
rf.fit(X_train, y_train)
model_performance_dict["Random Forest"] = model_diagnostics(rf)

Gradient Boosting Regression

gbr = GradientBoostingRegressor(n_estimators=70, max_depth=5)
gbr.fit(X_train, y_train)
model_performance_dict["Gradient Boosting Regression"] = model_diagnostics(gbr)

Model comparison

We compare the models based on three metrics: R², MAE, and RMSE. To do so, we define the function below.

def model_comparison(model_performance_dict, sort_by = 'RMSE', metric = 'RMSE'):

    Rsq_list = []
    RMSE_list = []
    MAE_list = []
    for key in model_performance_dict.keys():
        Rsq_list.append(model_performance_dict[key][0])
        RMSE_list.append(model_performance_dict[key][1])
        MAE_list.append(model_performance_dict[key][2])

    props = pd.DataFrame([])

    props["R-squared"] = Rsq_list
    props["RMSE"] = RMSE_list
    props["MAE"] = MAE_list
    props.index = model_performance_dict.keys()
    props = props.sort_values(by = sort_by)

    fig, ax = plt.subplots(figsize = (12,6))

    ax.bar(props.index, props[metric], color="blue")
    plt.title(metric)
    plt.xlabel('Model')
    plt.xticks(rotation = 45)
    plt.ylabel(metric)

Let’s use this function to compare the models based on each metric.

model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'R-squared')

model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'MAE')

model_comparison(model_performance_dict, sort_by = 'R-squared', metric = 'RMSE')

Interpreting results

The random forest model is a reasonable choice when taking performance and training time into account. The mean absolute error is approximately 9.7 which means that on average, the model estimate is off by about 9.7 upvotes. Let’s look at some plots for more information about model performance.

y_predicted = rf.predict(X_test)
plot_residuals(y_test,y_predicted)

Comparing the histograms of test scores and predicted scores, we notice that the model tends to overestimate the target variable when it is small. In addition, the model never predicts that the target variable will be much larger than 2,000. It appears that results are skewed by the few cases where the target variable is large. The majority of comments have only a small number of upvotes but model expects these to receive more than they do. However, when the comment has an extreme number of upvotes, the model will underestimate it.

This distribution of residuals suggests that a logical next step would be to explore the results of a stacked model. Stacking is an ensembling technique (like random forests, gradient boosting, etc.) that can often improve performance. We would first fit a classifier to predict the number of upvotes (with classes like few, some, many) and the result would be used as an additional predictor in the regression model. This method has the potential to reduce errors and improve the goodness of fit because, in addition to our original information, the regression model would also have a hint about the number of comments to help it make a prediction.

Tree-based models also allow us to quantify the importance of the features they used.

rf_importances = get_feature_importance(rf)

The least important features are the indicator variables for different subreddits. Since this data only includes comments from five of the most popular, and rather generic, subreddits (food, world news, movies, science, and gaming), we would not expect much these features to be very important. Additionally, there are many comments with little or no importance. These features could be removed. This could help avoid overfitting and decrease the time it takes to train models.

The five most important features are ones that describe the thread that the comment is on or the comment’s parent. We might expect this due to the fact that popular and trending content gets shown to more users, so comments that are close to content that has a lot of upvotes are more likely to get a lot of upvotes as well.

It is also important to note that many of the features that had high importance were ones that had missing values. For this reason, a deeper analysis of the way in which missing values were handled could lead to improved model performance (for example, when we dropped comment tree roots, parent score was by far the most important feature, at ~25%). Interpolation using the mean, median or prediction using a simple linear regression would be worth testing as well.

Conclusion

In this article, we have outlined a machine learning workflow that uses the scikit-learn python library to predict Reddit comment upvotes. We compared the performance of linear and nonlinear regression models and found that a random forest regressor was the optimal choice.

After a quick examination of this model’s residuals, we saw lot’s of room for improvement. Possible next steps for this project include:

Fitting models using fewer features and comparing their performance to the originals
Analyzing missing values and their effect on model performance
Stacking models for improved performance

This article is based on a project that was originally completed by Adam Reevesman, Gokul Krishna Guruswamy, Hai Le, Maximillian Alfaro, and Prakhar Agrawal during the Introduction to Machine Learning course at the University of San Francisco’s Master of Science in Data Science. Relevant work can be found in this Github repository and the code from this article can be found in this notebook.

I would be pleased to receive feedback on any of the above. I can always be reached on LinkedIn or via email at areevesman@gmail.com.