Hey Model, Why Do You Say This Is Spam?

Attaching a Reason to a Model Prediction

Published in

Towards Data Science

14 min readNov 7, 2019

Shapley values are used in machine learning to explain the predictions of a complex predictive model, aka “black box”. In this post, I will use Shapley values to identify YouTube comment key terms that explain why a comment was predicted as either spam or legitimate by a predictive model. The “coalitions” of specific key terms can be thought of as “reasons” for the model to return a given prediction. In addition, I will use clustering to identify groups of comments that share similarities as to how these key terms were used. Finally, I will further generalize the prediction reasons so that groups of comments will be classified using a dictionary of fewer key terms representing reason categories.

Preamble

Following the code snippets presented in this post requires Python and R. The complete Jupyter notebook with all code snippets can be found here. The Python libraries needed to run the code are below. We will run instances of R within Python using the rpy2 library.

import rpy2.ipythonfrom rpy2.robjects import pandas2ripandas2ri.activate()%reload_ext rpy2.ipythonfrom sklearn import model_selection, preprocessing, metrics, svm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition, ensembleimport pandas as pd
import numpy as npimport stringimport matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineimport xgboostimport shap, time

Shapley Method

Shapley values are the results of a study conducted in cooperative game theory. They tell how to fairly distribute the “payout” among the players. When predicting the label of a data instance using a previously trained model, the prediction can be explained by assuming that each feature value of the data instance is a player in a game where the prediction is the payout. The Shapley value is the average marginal contribution of a feature value across all possible coalitions of features. More details are available in the online book by Christoph Molnar.

We can introduce the Shapley method using a simple case: the XOR table. We will go into more details in the following sections.

data = {'F1':[0,0,1,1], 'F2':[0,1,0,1], "Target":[0,1,1,0]} #XOR
df = pd.DataFrame(data)X = df[["F1","F2"]]
y = df["Target"].values.tolist()df

To calculate Shapley values, we first need to train a model. For example, let’s use SVM with a radial basis function kernel. We can now calculate the Shapley values using the Python library shap.

clf = sklearn.svm.SVC(kernel='rbf', probability=False).fit(X, y) df["Prediction"] = clf.predict(X)# Explaining the probability prediction results
explainer = shap.KernelExplainer(clf.predict, X)
shap_values = explainer.shap_values(X)pd.concat([df, pd.DataFrame(shap_values, 
           columns='shap_'+ X.columns.values)], axis=1)

As it can be observed from the results above, when the calculated contributions (Shapley values) of the two features for each data instance are both negative the prediction is 0 and when the Shapley values are both positive, the prediction is 1. Note that the two contributions sum up to the initial difference between the prediction for this instance f hat and the expected prediction 𝐸(𝑓), which is the average of the actual target values, as shown below:

YouTube Spam Comments

We will work with a text classification dataset collecting 1956 comments from 5 different YouTube videos. The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. The dataset contains non-encoded messages that were labeled as legitimate or spam. Since I will be using R again later on in this post, I decided to download the dataset using a R snippet I found when searching for this dataset.

The snippet above will download the file in csv format to a local folder. We can use the pandas library to explore the content.

youtube_data = pd.read_csv("youtube_data/TubeSpam.csv")
youtube_data.head()

We can select the columns CONTENT and CLASS that will be used for modeling later on and drop rows with missing content. Running the snippet below will return 1954 rows and show that label 0 (legitimate comment) and 1 (spam comment) have about the same number of occurrences after removing duplicated rows.

youtube_df = pd.DataFrame()
youtube_df['text'] = youtube_data['CONTENT']
youtube_df['label'] = youtube_data['CLASS']
youtube_df.dropna(inplace=True)youtube_df.groupby('label').describe()

youtube_df.drop_duplicates(inplace=True)

We can look at whether there is any apparent difference in length between legitimate and spam comments

youtube_df['length'] = youtube_df['text'].apply(len)
youtube_df.hist(column='length', by ='label', bins=50, figsize = (10,4))

Apparently, spam comments are on average a bit longer. One of the longest comments is for example:

youtube_df[youtube_df['length'] == 1077]['text'].iloc[0]

Pre-Processing Comments

Before using the text data for modeling purposes, we will clean it up using some standard Natural Language Processing (NLP) techniques. In particular, we will remove numbers, punctuation, extra white spaces, convert all words to lower case, remove stop words, stem and lemmatize words. To do so, we will use the Python library nltk.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer #nltk.download('wordnet')import string, restop_words = list(set(stopwords.words('english')))
stemmer= PorterStemmer()
lemmatizer=WordNetLemmatizer()translate_table = dict((ord(char), None) for char in string.punctuation)def nltk_text_preproc(text_in):
    text_out = re.sub(r'\d+', '', text_in) # rm numbers
    text_out = text_out.translate(translate_table) # rm punct
    text_out = text_out.strip() # rm white spaces    return text_outdef nltk_token_processing(tokens):
    tokens = [i.lower() for i in tokens]
    tokens = [i for i in tokens if not i in stop_words]
    tokens = [stemmer.stem(i) for i in tokens]
    tokens = [lemmatizer.lemmatize(i) for i in tokens]
    
    return tokens

We can look at the first 10 comments in the dataset, before

pd.options.display.max_colwidth = 1000
youtube_df['text'].head(10)

and after, using our pre-processing steps

youtube_df['text'].head(10).map(lambda x: nltk_text_preproc(x)).map(lambda x: nltk_token_processing(word_tokenize(''.join(x)))).map(lambda x: ' '.join(x))

We are now ready to process all of the text in the dataset before proceeding.

youtube_df['text'] = youtube_df['text'].map(lambda x: nltk_text_preproc(x))
youtube_df['text'] = youtube_df['text'].map(lambda x: nltk_token_processing(word_tokenize(''.join(x))))
youtube_df['text'] = youtube_df['text'].map(lambda x: ' '.join(x))

The new dataset contains 1735 rows.

Creating Features From Text

In this section, pre-processed text data will be transformed into feature vectors. We will use the CountVectorizer method of the nltk Python library. This transforms the text rows into a matrix in which every column represents a term from all the text and every cell is the number of occurrences of a particular term in a given row.

count_vect = CountVectorizer(min_df=0.01, 
                             max_df=1.0, ngram_range=(1,3)) 
count_vect.fit(youtube_df['text'])
youtube_df_text = count_vect.transform(youtube_df['text'])

Here we are requesting to drop terms that appear in less than 1% of the comments or in all of the comments, and ask the CountVectorizer to count n-grams of up to 2 contiguous terms.

For example, the second comment (row)

text_example_orig = youtube_df['text'][1]
print(text_example_orig)

becomes

text_example = count_vect.transform([text_example])
print(text_example)

If we want to see what terms correspond to the indexes returned from the CountVectorizer, we can use the following

for row, col in zip(*text_example_transf.nonzero()):
    val = text_example_transf[row, col]
    #print((row, col), val)
    print(count_vect.get_feature_names()[col])

The matrix of counts is not a standard matrix but a sparse matrix

print ('Shape of Sparse Matrix: ', youtube_df_text.shape)
print ('Amount of Non-Zero occurences: ', youtube_df_text.nnz)
print ('sparsity: %.2f%%' % (100.0 * youtube_df_text.nnz /
                             (youtube_df_text.shape[0] * youtube_df_text.shape[1])))

A sparse matrix is an optimal representation of a matrix that contains very few non-zero elements. In fact, representing a sparse matrix with a 2-dimensional array results into a lot of memory being wasted. In our example, there are 1,735 rows * 166 terms that would result into a 2-dimensional matrix with 288,010 elements but only 7,774 have non-zero occurrences (2.7% sparsity) since not all rows contain all terms.

Modeling

We will split the dataset into training and validation sets.

train_x, valid_x, train_y, valid_y, index_train, index_val = model_selection.train_test_split(youtube_df_text, youtube_df['label'], range(len(youtube_df['label'])), stratify=youtube_df['label'], random_state=1, train_size=0.8)

The numbers of legitimate and spam comments that now belong to the training and validation sets are

np.unique(train_y, return_counts=True)

It is now time to train our model

def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_label):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    return classifier, metrics.accuracy_score(predictions, valid_label), metrics.classification_report(predictions, valid_label)

and look at its performance on the validation set. We will use the XGBoost classifier.

classifier, accuracy, confusion_matrix = train_model(xgboost.XGBClassifier(), 
            train_x, train_y, valid_x, valid_y)
print("Xgb, Accuracy: ", accuracy)
print(confusion_matrix)

For the purpose of this post, a highly performing model is not necessary. In fact, we are looking for a general description of how features and labels relate to each other. As a matter of fact, here we just need a decent model.

Computing Shapley Values with SHAP

The Python library SHAP (SHapley Additive exPlanations) can explain the output of any machine learning model and in particular, it provides a high-speed exact algorithm for tree ensemble methods. That is one of the reasons why our model is a XGBoost model. In our case, computing SHAP values takes only fractions of a second. Here, we calculate SHAP values for the whole dataset. Note that SHAP values are log-odds when using XGBoost with the logistic objective function. To convert log-odds margin to probability, we can use the formula odds = exp(log-odds) from which, p = odds/(1+odds). At the end of this section, we will do this exercise but for now let’s first calculate SHAP values and explore them.

t0 = time.time()
explainer = shap.TreeExplainer(classifier)
shap_values_train = explainer.shap_values(youtube_df_text)
t1 = time.time()
timeit=t1-t0
print('time to compute Shapley values (s):', timeit)

It is convenient to convert the sparse matrices into dense matrices.

txt_dense_df = pd.DataFrame(youtube_df_text.todense(), 
                            columns=count_vect.get_feature_names())
shap_values_train_df = pd.DataFrame(shap_values_train, 
                                    columns=txt_dense_df.columns)

With this data frame, we can calculate the overall feature importance

shap_sum = np.abs(shap_values_train_df).mean(axis=0)
importance_df = pd.DataFrame([txt_dense_df.columns.tolist(), 
                              shap_sum.tolist()]).T
importance_df.columns = ['column_name', 
                         'shap_importance (log-odds)']
importance_df = importance_df.sort_values('shap_importance (log-odds)', ascending=False)
importance_df['shap_importance (%)'] = importance_df['shap_importance (log-odds)'].apply(lambda x: 100*x/np.sum(importance_df['shap_importance (log-odds)']))

and select, for example, the top 20 features

topN = 20
top20 = importance_df.iloc[0:topN]["column_name"]print('Cumulative Importance', 
      np.sum(importance_df.iloc[0:topN]["shap_importance (%)"]))shap_values_imp = shap_values_train_df[top20]shap.summary_plot(shap_values_train_df, 
                  txt_dense_df, plot_type="bar")importance_df.iloc[0:topN]

whose cumulative “importance” is about 94%. The bar chart above shows the mean absolute value of the SHAP values for each feature.

More visualizations to interpret the model results can also be used. For example,

j = 1
youtube_df.iloc[j]

is a typical spam comment. We can show the terms contributing to push the model output from the base value (the average model output over the training dataset we used) to the model output. Terms pushing the prediction to higher log-odds values are shown in red, those pushing the prediction to lower log-odds values are in blue. In this particular example, there are all the “ingredients” of a typical spam comment. In fact, all SHAP values of the terms push the model output to a value that is much higher than the average output value.

shap.initjs()# visualize the j-prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values_imp.iloc[j].to_numpy(), 
                txt_dense_df.iloc[j][top20])

For a legitimate comment like

the visualization used above becomes

In my notebook, I added more visualizations that are not the primary purpose of this post.

Finally, let’s do the exercise of converting log-odds to probability and see how SHAP values relate to them.

j = 1# log odds margin to prob
# odds = exp(log odds), p = odds/(1+odds)log_odds = np.sum(shap_values_train_df.iloc[j].to_numpy())
avg_model_output = np.mean(youtube_df['label']) # prob
log_odds_avg_model_output = np.log(avg_model_output/(1-avg_model_output))
predicted_prob = classifier.predict_proba(youtube_df_text.tocsc())[j][1] #target=1
predicted_log_odds = np.log(predicted_prob/(1-predicted_prob))print("Sum of Shaphley values (log-odds)for j-instance:", log_odds, 
      'prob:', np.exp(log_odds)/(1.0+np.exp(log_odds)))
print("Average model output:", avg_model_output)
print("Predicted probability value for j-instance:", predicted_prob,
      "Predicted value:", classifier.predict(youtube_df_text.tocsc())[j])print('log_odds:', log_odds, 'is expected to be equal to pred-expected:', predicted_log_odds-log_odds_avg_model_output)
print('pred-expected (prob):', predicted_prob-avg_model_output)

For comment j = 1, we calculated the SHAP value (6.59) that corresponds to a probability of 99.86% for this comment to be spam (label=1) based on the conversion log-odds (SHAP value) to probability. The predicted probability from the model is 99.85%, which is slightly different but close enough. As I said at the beginning of this post, we expect the SHAP value (log-odds) to be equal to the predicted minus the expected probability, which is the average of all actual targets (0.48).

Clustering SHAP Values

The top 20 terms selected previously in this post can be summarize into more generic “topics” using a dictionary:

top20_dict = {'ACTION': ['check','subscrib','view','share','follow','help', 'visit'],
 'REQUEST': ['plea','hey'],
 'VALUE': ['money','free','billion', 'new', "good", 'best'],
 'YOUTUBE': ['channel', 'song', 'onlin'],
 'COMMENT': ['love', 'like comment']             
}

For each comment, we can thus identify the top one reason running the following snippet.

from itertools import chainntopreason = 1 #change this to allow more reasons to be capturedtop20_dict_values = list(top20_dict.values())
top20_dict_keys = list(top20_dict.keys())shap_values_imp_r = shap_values_imp.copy()
target_values_r = pd.Series(predictions)# Create summarizing labels
top_reasons_all = []
for i in range(shap_values_imp_r.shape[0]):
    
    shap_feat = shap_values_imp_r.iloc[i]
    shap_feat = shap_feat.iloc[np.lexsort([shap_feat.index, shap_feat.values])]
    
    topN = shap_feat.index.to_list()[-1:] 
    topN_value = shap_feat.values[-1:]
    
    topN_idx = []
    for topn in topN:
        for idx in range(len(top20_dict_values)):
            if topn in top20_dict_values[idx] and idx not in topN_idx:
                topN_idx.append(idx)
                
    #topN_idx = [idx for idx in range(len(top20_dict_values)) for topn in topN if topn in top20_dict_values[idx]]
    
    #Ordered by increasing importance
    top_reasons = [top20_dict_keys[x] for x in topN_idx]#print(i, topN, topN_idx, top_reasons)top_reasons_all.append(','.join(top_reasons))
    
shap_values_imp_r['target'] = target_values_r
shap_values_imp_r['top_reasons'] = top_reasons_all

In a separate Jupyter cell, we will use R to select what is the optimal number of clusters to use to group the SHAP values of the comments.

%%R -i shap_values_imp_r -w 800 -h 800library(gplots)d_shap <- dist(shap_values_imp_r[1:(ncol(shap_values_imp_r)-2)]) 
hc <- hclust(d_shap, method = "ward.D2")
dend <- as.dendrogram(hc)library(dendextend)
library(colorspace)## Find optimal number of clusters
clusters_test = list()
for (ncl in 2:50){
    clusters_test[[ncl]] <- cutree(hc, k=ncl)
}

Here, we use hierarchical clustering with a variable number of clusters between 2 and 50. Back to Python, we calculate the silhouette score for each of the clustering results. The decision on the number of clusters to use is made by looking at the silhouette score and the visualization of the clustering. There is always a decision to make when dealing with selecting the number of clusters.

h_clusters = rpy2.robjects.r['clusters_test']h_clusters_sil = []
cl_id = 0
for cl in h_clusters:
    if cl is not rpy2.rinterface.NULL:
        sil = metrics.silhouette_score(shap_values_imp_r.drop(shap_values_imp_r.columns[len(shap_values_imp_r.columns)-2:], axis=1, inplace=False), 
                                       cl, metric='euclidean')
        h_clusters_sil.append(sil)
        #print(cl_id, sil)
        cl_id += 1
    else:
        cl_id += 1plt.plot(range(2, 51), h_clusters_sil)
plt.title('Silhouette')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Index') #within cluster sum of squares
plt.show()

We clustered using Ward’s method that minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. The height of merging (see plot below) indicates the similarity between two instances. The higher the height of merging, the less similar the instances are. The average silhouette of the data is a criterion for assessing the optimal number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster. Silhouette scores range between -1 and 1, being 1 best. In our case, we select 30 clusters in order to have enough granularity in the types of comments and get a good silhouette score of about 0.70.

%%R -i shap_values_imp_r -w 800 -h 800library(gplots)d_shap <- dist(shap_values_imp_r[1:(ncol(shap_values_imp_r)-2)], 'euclidean') 
hc <- hclust(d_shap, method = "ward.D2")
dend <- as.dendrogram(hc)library(dendextend)
library(colorspace)n_clust <- 30dend <- color_branches(dend, k=n_clust) #, groupLabels=iris_species)clusters <- cutree(hc, k=n_clust)
#print(head(clusters))print(format(prop.table(table(clusters,shap_values_imp_r$top_reasons, shap_values_imp_r$target), margin=1), 
             digit=2, scientific = F))
print(format(prop.table(table(clusters,shap_values_imp_r$top_reasons), margin=1), 
             digit=2, scientific = F))heatmap.2(as.matrix(shap_values_imp_r[1:(ncol(shap_values_imp_r)-2)]), 
          dendrogram = "row",
          Rowv = dend,
          Colv = "NA", # this to make sure the columns are not ordered
          key.xlab = "Predicted - Average log-odds",
          #hclustfun=function(d) hclust(d, method="complete"), 
          srtCol=45,  adjCol = c(1,1))

In the heatmap above, we combine the hierarchical clustering (left y-axis), which shows similar comments closer together (the numbers on the right y-axis are the comment row numbers), with the importance of the top 20 terms (x-axis, cells colored according to how the corresponding term in the comment impacted the model prediction).

We can further summarize the heatmap above by aggregating the comments in each cluster. To make this step more generic, we assume our dataset is a new dataset whose labels are unknown and predict the labels based on the clustering results using a k-nearest neighbor classifier with k=1. We can then aggregate SHAP values in each cluster after grouping by the predicted labels.

from sklearn.neighbors import KNeighborsClassifiercluster_model = KNeighborsClassifier(n_neighbors=1)
cluster_model.fit(shap_values_imp,rpy2.robjects.r['clusters'])predicted = cluster_model.predict(shap_values_imp_r[shap_values_imp_r.columns[:-2]]) grouped = pd.concat([shap_values_imp, pd.Series(youtube_df['label'].tolist(), name='avg_tgt')], axis=1).groupby(predicted)# compute average impact to model prediction output. 
sums = grouped.apply(lambda x: np.mean(x))

The heatmap below shows the impact of the top 20 terms to the model predictions (log-odds shown on the z-axis) for each of the 30 clusters (y-axis).

import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode()data = [go.Heatmap(z=sums[sums.columns[1:-1]].values.tolist(), 
                   y=sums.index,
                   x=sums.columns[1:-1],
                   colorscale='Blues')]plotly.offline.iplot(data, filename='pandas-heatmap')

From the heatmap, we can for instance see that cluster 30 is dominated with the “like comment” 2-gram and if we look at the average predicted label, we can see that this is a cluster of legitimate comments (“avg_tgt” value of “sums” data frame). Cluster 8 is a spam cluster on average and dominated with the term “view”.

We can also aggregate the SHAP values of the comments in each cluster to show what “topic” dominated each cluster.

agg_sums = pd.DataFrame({k: sums[v].mean(axis=1) for (k, v) in top20_dict.items()})data = [go.Heatmap(z=agg_sums[agg_sums.columns].values.tolist(), 
                   y=agg_sums.index,
                   x=agg_sums.columns,
                   colorscale='Blues')]plotly.offline.iplot(data, filename='pandas-heatmap')

The comments of cluster 30 are

pd.options.display.max_colwidth = 1000cluster_no = 30ex_2 = youtube_df.iloc[rpy2.robjects.r['clusters']==cluster_no]
ex_2_pred = pd.Series(predictions[rpy2.robjects.r['clusters']==cluster_no])
ex_2_top = shap_values_imp_r.iloc[rpy2.robjects.r['clusters']==cluster_no]['top_reasons']ex_2

The comments of cluster 8 are

Conclusions

In this post, I showed how using Shapley values can add information to a model score, a “reason” for obtaining that predicted label. In fact, a model score should always be followed by an explanation for this score. Also, clustering Shapley values can be used to identify groups of data instances that can be explained with a more generic topic representing multiple reasons.

I would like to thank Sundar Krishnan and Praveen Thoranathula for the helpful discussions.