The world’s leading publication for data science, AI, and ML professionals.

Predicting Spam Messages

I used ML models and NLP techniques to predict spam messages. Python codes are available.

Using machine learning algorithms to predict spam messages

Photo by Jason Leung on Unsplash
Photo by Jason Leung on Unsplash

In this article, I try to predict spam messages. You can find all the python codes on my Github account provided at the end of the article.

Nowadays, the first thing that comes to our mind when we hear the word "spam" is a junk message. However, that was not the case eighty years ago. The term spam was created in 1937 by a man called Ken Daigneau to name Hormel Foods’ new meat product. Ken won $100 prize for naming the new item. Some people think that spam stands for Spiced Ham, and many others believe that it stands for "Specially Processed American Meat."

SPAM Brand. Photo by Hannes Johnson on Unsplash
SPAM Brand. Photo by Hannes Johnson on Unsplash

The word "spam" started to be considered as something "annoying" in the 1970s, after Monty Python’s Flying Circus show, when in one of the "_scenes_" the Vikings begin singing the "spam song" and repeating the word "spam" all the time. Later, in the 1980s and 1990s, people were using the repetition of the word "spam" and sometimes even the whole song of Vikings to spam chats. Hence, people started associating the word "spam" with annoying junk messages.

The goal of this article is to use different machine learning techniques to predict whether or not a message is spam. I used the dataset provided by UC Irvine, which contains around 5,500 text messages. Below is the structure of the article:

  1. Exploratory Data Analysis
  2. Text Preprocessing and Feature Engineering
  3. Modeling
  4. Conclusion

Exploratory Data Analysis

First, let’s do some exploratory data analysis. By checking the shape of the dataset, we can see that it has 5,572 observations and two columns. However, out of the 5,572 observations, there are only 5,169 unique values, which means that there are around 450 duplicate rows. I also checked for missing values, but found that none of the columns had any missing value. Below are the first five rows and the shape of the dataset:

The Initial Dataset
The Initial Dataset

After removing the duplicates, the dataset has 5,169 observations, out of which 4,516 are ham, and 653 are spam. A relatively large number of ham messages means that if I take all the messages as ham, I will get around 87% accuracy. Some people use the accuracy of Naive Bayes (a very simple model) as a benchmark, but I will take the overall accuracy as a benchmark and will try to make models that can predict better with higher accuracy. In case you want to learn more about Naive Bayes, check this video.


Text Preprocessing and Feature Engineering

Before starting the modeling part, I need to work with the data and make it suitable for modeling. In this part, I need to convert the emails, web-addresses, money signs, phone numbers, and numbers to specific words and remove all the punctuation. To be more precise, imagine a spam message that has the following message "use www.wearespam.com link to get $1,000,000," and another spam message that contains "use www.wearealsosmap.com link to get $500,000" expression. In this case, we do not want the algorithm to take into account the two different links or prize amounts, but to have them defined with specific words. After converting the text, the two sentences would look like "use webaddr link to get money number." Besides changing some words in the text, I normalized the lexicon and removed all the stopwords from the text. I will now provide more detail on each part to be more precise.

Removing "unnecessary" words – As I already explained, we do not need to have different unique numbers in the text. Still, we need to know whether a specific message contains an email address, phone number, etc. You can use Python’s Regular Expressions package to identify the emails and phone numbers in the text. Find more about the regular expressions on the following website.

Normalize the lexicon – I performed the lemmatization technique for lexicon normalization. In this process, we bring the word to its "base" level. For example, the words "go," "goes," and "going" have the same base, but are used differently. The other approach for lexicon normalization is stemming, but lemmatization has some advantages over stemming. For example, lemmatization transforms the word to its base with the use of vocabulary, whereas stemming works on the word, without taking into account its content. As a result, lemmatization can lead to a better transformation of the words, without changing the meanings. To learn more about lemmatization and stemming, check here.

After cleaning the text, I need to see the most common words. The following chart shows that the most common word in the text is "number." That is because I converted all the numbers to one word, and our text included too many different numbers.

Occurrence of Words
Occurrence of Words

Finally, after removing the stopwords, the unnecessary words, and normalizing the lexicon, it is time to tokenize the text. Tokenization is the process of treating each word or a sequence of words as a specific unit. For example, after conducting unigram tokenization to the "I like apple" sentence, we get three separate tokens: [I], [like], and [apple]. I used unigram and bigram tokenization, because expressions, which include two words, can also be necessary for the analysis. For example, in the phrase "good morning," the two words together, not separately, are essential to capture the exact meaning of the expression. I did not tokenize with three and more words not to increase the number of variables and to minimize the usage of computing power. After tokenizing the text, I made a bag of words to start working with the data. By making the bag of words, we provide numeric value to the words. Bag of words calculates the occurrence of each token in the sentence. It is a simple technique and is easy to use. However, using only a BoW has its drawbacks as well. For example, BoW measures the occurrence of each word in the text, and more frequent words can get higher importance. For example, let’s take the following two sentences: "Bombardment, barrage, curtain-fire, mines, gas, tanks, machine-guns, hand-grenades – words, words, words, but they hold the horror of the world" (Erich Maria Remarque "All Quiet on the Western Front") and "This is a table." After making the bag of words, we see that the word "words" is more common than "table," which means that "words" will get more importance. To solve this issue, I used TF-IDF (Term Frequency-Inverse Document Frequency) weight. The weight provides the significance of each token, but it also considers the frequency of token in the corpus. To learn more about BoW and TF-IDF, you can check here.

After tokenizing the data and calculating TF-IDF weights, I got the final dataset, which consists of 5169 rows and 37891 columns. That’s a lot!!!


Modeling

As I have the data prepared, it is time to make the models. For my analysis, I used four different models: SVM, random forest, logistic regression, and XGBoost. I divided the dataset into two parts and used one to train the models and the other one to test their performance. To improve each model’s performance, I tried different parameters and used Bayesian Optimization for hyperparameter tuning. I took different ranges of parameters and used cross-validation on the training dataset to find the best combination of the parameters. Later, I tested the models on the test dataset and used accuracy and AUC scores to compare their performances. Even though I used Bayesian Optimization for hyperparameter tuning, I want to quickly go through alternative methods, which are widely used as well. Possible substitutes for Bayesian Optimization are grid search and random search, but I used Bayesian Optimization because the other two have some drawbacks. I will briefly explain how each method works and what are their possible downsides.

  • Grid Search method – Measures the results for all the possible combinations of parameters. For example, if we want to decide between a learning rate of 0.01 or 0.1 and a max depth of 4 or 5, then grid search will make four different models (trying all the possible combinations) and take the parameters with which the model performs the best. The disadvantage of the Grid Search method is that in case of many parameters, it will take too much computation power to compare all the possible combinations.
  • Random Search method – Out of the provided parameters, the method takes random combinations of parameters and makes models with them. For this method, we need to specify how many combinations to make. The main disadvantage of the random search method is that it randomly chooses the parameters, and one can never be sure that the method will take the best combinations of parameters.
  • Bayesian Optimization method – Uses Bayes theorem to direct the search of parameters. Hence, it goes to the direction in which the objective function increases/decreases (depending on the objective).

Below you can find the code of hyperparameter tuning for XGBoost:

Hyperparameter Tuning for XGBoost
Hyperparameter Tuning for XGBoost

By knowing the optimal parameters for all the models and making the models, I found that Support Vector Machine (SVM) performs best for the prediction. It has 98% accuracy and AUC score of 0.92. The performance was not significantly different for XGBoost and Random Forest, but those models had lower accuracy and AUC. A relatively simple model, logistic regression, was not preforming well for predictions and had the lowest accuracy and AUC score among all. Also, by looking at the confusion matrix, we can see that most of the errors are associated with False Positives when the algorithm predicts the message to be spam, but in fact, it is ham. Below is the summary of results for SVM:

SVM Performance Summary
SVM Performance Summary

In the end, I made ROC curves for all the models to visually see which one performs the best. Below you can find the ROC plot:

ROC Curve
ROC Curve

Again you can see that SVM (the red line) has the best performance as it has the highest ROC score.

In total, in this article I tried to see which model can help to better predict the spam messages. At first, I cleaned the data, did the necessary transformations with the text and convert word to numeric values to be able to make models. I used four models and got the highest accuracy and AUC score for SVM. The usage of the model can help many businesses to better understand which messages are spam. However, I believe that the predictions can be improved further by using more sophisticated models such as Neural Networks.

You can find the Python codes used for this article on my _github_ account.


Related Articles