Introduction
I have always been fascinated with Google’s gmail spam detection system, where it is able to seemingly effortlessly judge whether incoming emails are spam and therefore not worthy of our limited attention.
In this article, I seek to recreate such a spam detection system, but on sms messages. I will use a few different models and compare their performance.
The models are as below:
- Multinomial Naive Bayes Model (Count tokenizer)
- Multinomial Naive Bayes Model (tfidf tokenizer)
- Support Vector Classifier Model
- Logistic Regression Model with ngrams parameters
Using a train-test split, the 4 models were put through the stages of X_train vectorization, model fitting on X_train and Y_train, make some predictions and generate the respective confusion matrices and area under the receiver operating characteristics curve for evaluation. (AUC-ROC)
The resultant best performing model was the Logistic Regression Model, although it should be noted that all 4 models performed reasonably well at detecting spam messages (all AUC > 0.9).


The Data
The Data was obtained from UCI’s Machine Learning Repository, alternatively I have also uploaded the used dataset onto my github repo. In total, the data set has 5571 rows, and 2 columns: spamorham indicating it’s spam status and the message’s text. I found it quite funny how the text is quite relatable.
Definitions: Spam refers to spam messages as they are commonly known, ham refers to non-spam messages.

Data Preprocessing
As the dataset is relatively simple, not much preprocessing was needed. Spam messages were marked with a 1, while ham was marked with a 0.

Exploratory Data Analysis
Now, let’s look at the dataset in detail. Taking an average of the ‘target’ column, we find that that 13.409% of the messages were marked as spam.
Further, maybe the message length has some correlation with the target outcome? Splitting the spam and ham messages into their individual dataframes, we further add on the number of characters of a message as a third column ‘len’.

Further, taking the averages of messages lengths, we can find that spam and ham messages have average lengths of 139.12 and 71.55 characters respectively.
Data Modelling
Now it’s time for the interesting stuff.
Train-test split
We begin with creating a train-test split using the default sklearn split of a 75% train-test split.
Count Vectorizer
A count vectorizer will convert a collection of text documents to a sparse matrix of token counts. This will be necessary for model fitting to be done.
We fit the CountVectorizer onto X_train, and then further transform it using the transform method.
MNNB Model Fitting
Let’s first try fitting a classic Multinomial Naive Bayes Classifier Model (MNNB), on X_train and Y_train.
A Naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. In practice Naive Bayes models have performed surprisingly well, even on complex tasks where it is clear that the strong independence assumptions are false.
MNNB Model Evaluation
In evaluating the model’s performance, we can generate some predictions then look at the confusion matrix and AUC-ROC score to evaluate performance on the test dataset.
The confusion matrix is generated as below:

The results seem promising, with a True Positive Rate (TPR) of 92.6% , specificity of 99.7% and a False Positive Rate (FPR) of 0.3%. These results show that the model performs quite well in predicting whether messages are spam, based solely on the text in the messages.
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.

The model produced an AUC score of 0.962, which is significantly better than if the model made random guesses of the outcome.
Although the Multinomial Naive Bayes Classifier seems to have worked quite well, I felt that maybe the result could possibly be improved further through a different model
MNNB(Tfid-vectorizer) Model Fitting
I then attempt to use a tfidf vectorizer instead of a count-vectorizer to see if it improves the results.
The goal of using tfidf is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
MNNB(Tfid-vectorizer) Model Evaluation
In evaluating the model’s performance we look at the AUC-ROC numbers and the confusionn matrix again. It generates an AUC score of 91.67%
The results seem promising, with a True Positive Rate (TPR) of 83.3% , specificity of 100% and a False Positive Rate (FPR) of 0.0%%.

When comparing the two models based on AUC scores, it seems like the tfid vectorizer did not improve upon model accuracy, but even introduced more noise into the predictions! However, the tfid seems to have greatly improved the model’s ability to detect ham messages to the point of 100% accuracy.
Being a stubborn person, I still believe that better performance can be obtained, with a few tweaks.
SVC Model Fitting
I now attempt to fit and transform the training data X_train using a Tfidf Vectorizer, while ignoring terms that have a document frequency strictly lower than 5. Further adding an additional feature, the length of document (number of characters), I then fit a Support Vector Classification (SVC) model with regularization C=10000.
SVC Model Evaluation
This results in the following:
- AUC score of 97.4%
- TPR of 95.1%
- Specificity of 99.7%
- FPR of 0.3%

Logistic Regression Model (n-grams) Fitting
Using a logistic regression I further include the use of ngrams which allow the model to take into account groups of words, of max size 3, when considering whether a message is spam.
Logistic Regression Model (n-grams) Evaluation
This results in the following:
- AUC score of 97.7%
- TPR of 95.6%
- Specificity of 99.7%
-
FPR of 0.3%
Model Comparison
After training and testing these 4 models, it’s time to compare them. I primarily look at comparing them based on AUC scores as the TPR and TNR rates are all somewhat similar.
The logistic regression had the highest AUC score, with the SVC model and MNNB 1 model being marginally behind. Relatively, the MNNB 2 model seemed to have underperformed the rest. However, I would still opine that all 4 models produce AUC scores which are much higher than 0.5, showing that all 4 perform good enough to beat a model that only randomly guesses the target.

Thanks for the read!
Do find the code here.
Do feel free to reach out to me on LinkedIn if you have questions or would like to discuss ideas on applying Data Science techniques in a post-Covid-19 world!
Dehan C. – Research Assistant – Singapore Management University | LinkedIn