How To Design A Spam Filtering System with Machine Learning Algorithm

Explore, Plot and Visualize Your Data

Published in

Towards Data Science

12 min readDec 16, 2018

As a software developer, email is one of the very important tool for communication. To have effective communication, spam filtering is one of the important feature.

So how do spam filtering system actually works? Can we design something similar from scratch?

Outline

The main goal of these two parts of article is to show how you could design a spam filtering system from scratch.

Outlines of this article are summarized as below:

EDA (Exploratory data analysis)
Data Preprocessing
Feature Extraction
Scoring & Metrics
Improvement by using Embedding + Neural Network (Part 2)
Comparison of ML algorithm & Deep Learning (Part 2)

Let’s get started !

Exploratory Data Analysis (EDA)

Exploratory Data Analysis is a very important process of data science. It helps the data scientist to understand the data at hand and relates it with the business context.

The open source tools that I will be using in visualizing and analyzing my data is Word Cloud.

Word Cloud is a data visualization tool used for representing text data. The size of the texts in the image represent the frequency or importance of the words in the training data.

Steps to take in this section:

Get the email data
Explore and analyze the data
Visualize the training data with Word Cloud & Bar Chart

Get the spam data

Data is the essential ingredients before we can develop any meaningful algorithm. Knowing where to get your data can be a very handy tool especially when you are just a beginner.

Below are a few of the famous repositories where you can easily get thousand kind of data set for free :

For this email spamming data set, it is distributed by Spam Assassin, you can click this link to go to the data set. There are a few categories of the data, you can read the readme.html to get more background information on the data.

In short, there is two types of data present in this repository, which is ham (non-spam) and spam data. Furthermore, in the ham data, there are easy and hard, which mean there is some non-spam data that has a very high similarity with spam data. This might pose a difficulty for our system to make a decision.

If you are using Linux or Mac, simply do this in the terminal, wget is simply a command that help you to download file given an url:

wget https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2wget https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2wget https://spamassassin.apache.org/old/publiccorpus/20030228_spam.tar.bz2wget https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2wget https://spamassassin.apache.org/old/publiccorpus/20030228_hard_ham.tar.bz2

Let’s run some code and see what content is inside all these emails!

Explore and Analyze the Data

Let’s take a look at the email message content and have a basic understanding of the data

Ham

This looks like a normal email reply to another person, which is not difficult to classified as a ham:

This is a bit of a messy solution but might be useful -

If you have an internal zip drive (not sure about external) and
you bios supports using a zip as floppy drive, you could 
use a bootable zip disk with all the relevant dos utils.

Hard Ham (Ham email that is trickier)

Hard Ham is indeed more difficult to differentiate from the spam data, as they contain some key words such as limited time order, Special “Back to School” Offer, this make it very suspicious !

Hello Friends!

We hope you had a pleasant week. Last weeks trivia questions was:


What do these 3 films have in common: One Crazy Summer, Whispers in the =
Dark, Moby Dick?=20

Answer: Nantucket Island



Congratulations to our Winners:

Caitlin O. of New Bedford, Massachusetts

Brigid M. of Marblehead, Massachusetts

Special "Back to School" Offer!

For a limited time order our "Back to School" Snack Basket and receive =
20% Off & FREE SHIPPING!

Spam

One of the spam training data does look like one of those spam advertisement email in our junk folder:

IMPORTANT INFORMATION:

The new domain names are finally available to the general public at discount prices. Now you can register one of the exciting new .BIZ or .INFO domain names, as well as the original .COM and .NET names for just $14.95. These brand new domain extensions were recently approved by ICANN and have the same rights as the original .COM and .NET domain names. The biggest benefit is of-course that the .BIZ and .INFO domain names are currently more available. i.e. it will be much easier to register an attractive and easy-to-remember domain name for the same price.  Visit: http://www.affordable-domains.com today for more info.

Visualization

Wordcloud

Wordcloud is a useful visualization tool for you to have a rough estimate of the words that has the highest frequency in the data that you have.

From this visualization, you can notice something interesting about the spam email. A lot of them are having high number of “spammy” words such as: free, money, product etc. Having this awareness might help us to make better decision when it comes to designing the spam detection system.

One important thing to note is that word cloud only displays the frequency of the words, not necessarily the importance of the words. Hence it is necessary to do some data cleaning such as removing stopwords, punctuation and so on from the data before visualizing it.

N-grams model visualization

Another technique of visualization is by utilizing bar chart and display the frequency of the words that appear the most. N-gram means that how many words you are considering as a single unit when you are calculating the frequency of words.

In this article, I have shown the example of 1-gram, and 2-gram. You can definitely experiment by trying bigger n-gram model.

Train Test Split

It is important to split your data set to training set and test set, so that you can evaluate the performance of your model using the test set before deploying it in a production environment.

One important thing to note when doing the train test split is to make sure the distribution of the data between the training set and testing set are similar.

What it means in this context is that the percentage of spam email in the training set and test set should be similar.

The distribution between train data and test data are quite similar which is around 20–21%, so we are good to go and start to process our data !

Data Preprocessing

Text Cleaning

Text Cleaning is a very important step in machine learning because your data may contains a lot of noise and unwanted character such as punctuation, white space, numbers, hyperlink and etc.

Some standard procedures that people generally use are:

convert all letters to lower/upper case
removing numbers
removing punctuation
removing white spaces
removing hyperlink
removing stop words such as a, about, above, down, doing and the list goes on…
Word Stemming
Word lemmatization

The two techniques that might seem foreign to most people are word stemming and word lemmatization. Both these techniques are trying to reduce the words to its most basic form, but doing this with different approaches.

Word stemming — Stemming algorithms work by removing the end or the beginning of the words, using a list of common prefixes and suffixes that can be found in that language. Examples of Word Stemming for English words are as below:

Word Lemmatization — Lemmatization is utilizing the dictionary of a particular language and tried to convert the words back to its base form. It will try to take into account of the meaning of the verbs and convert it back to the most suitable base form.

Implementing these two algorithms might be tricky and requires a lot of thinking and design to deal with different edge cases.

Luckily NLTK library has provided the implementation of these two algorithms, so we can use it out of the box from the library!

Import the library and start designing some functions to help us understand the basic working of these two algorithms.

# Just import them and use itfrom nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizerstemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()dirty_text = "He studies in the house yesterday, unluckily, the fans breaks down"def word_stemmer(words):
    stem_words = [stemmer.stem(o) for o in words]
    return " ".join(stem_words)def word_lemmatizer(words):
   lemma_words = [lemmatizer.lemmatize(o) for o in words]
   return " ".join(lemma_words)

The output of word stemmer is very obvious, some of the endings of the words have been chopped off

clean_text = word_stemmer(dirty_text.split(" "))
clean_text#Output
'He studi in the hous yesterday, unluckily, the fan break down'

The lemmatization has converted studies -> study, breaks -> break

clean_text = word_lemmatizer(dirty_text.split(" "))
clean_text#Output
'I study in the house yesterday, unluckily, the fan break down'

Feature Extraction

Our algorithm always expect the input to be integers/floats, so we need to have some feature extraction layer in the middle to convert the words to integers/floats.

There are a couples ways of doing this, and today I am going to introduce:

CountVectorizer
TfidfVectorizer
Word Embedding

CountVectorizer

First we need to input all the training data into CountVectorizer and the CountVectorizer will keep a dictionary of every word and its respective id and this id will relate to the word count of this word inside this whole training set.

For example, a sentence like ‘I like to eat apple and drink apple juice’

from sklearn.feature_extraction.text import CountVectorizer# list of text documentstext = ["I like to eat apple and drink apple juice"]# create the transformvectorizer = CountVectorizer()# tokenize and build vocabvectorizer.fit(text)# summarizeprint(vectorizer.vocabulary_)# encode documentvector = vectorizer.transform(text)# summarize encoded vectorprint(vector.shape)print(type(vector))print(vector.toarray())# Output# The number follow by the word are the index of the word
{'like': 5, 'to': 6, 'eat': 3, 'apple': 1, 'and': 0, 'drink': 2, 'juice': 4}# The index relates to the position of the word count array below
# "I like to eat apple and drink apple juice" -> [1 2 1 1 1 1 1]# apple which has the index 1 correspond to the word count of 2 in the array

TfidfVectorizer

Word counts are good but can we do better? One issue with simple word count is that some words like ‘the’, ‘and’ will appear many times and they don’t really add too much meaningful information.

Another popular alternative is TfidfVectorizer. Besides of taking the word count of every words, words that often appears across multiple documents or sentences, the vectorizer will try to downscale them.

For more info about CountVectorizer and TfidfVectorizer, please read from this great piece of article, which is also where I gain most of my understanding.

Word Embedding

There is a lot of great articles online that explains the details of word embedding and the algorithm to generate them. So here, I will skip most of them and try to give you a rough idea of what it is.

Word embedding is trying to convert a word to a vectorized format and this vector represents the position of this word in a higher dimensional space.

For words that have similar meaning, the cosine distance of those two word vectors will be shorter and they will be closer to each other.

And in fact, these words are vectors, so you can even perform math operations on them ! The end results of these operation will be another vector that maps to a word. Unexpectedly, those operations produce some amazing result !

Example 1 : King- Man + Woman = Queen
Example 2: Madrid-Spain+France = Paris
Example 3: Walking-Swimming+Swam= Walked

Simply put, word embedding is a very powerful representation of the words and one of the well known techniques in generating this embedding is Word2Vec.

Hooh ! After converting all the sentences into some form of vectors, it comes to the most exciting part of our articles → Algorithm Implementation !

Algorithm Implementation

TfidfVectorizer + Naive Bayes Algorithm

The first approach that I take was to use the TfidfVectorizer as a feature extraction tools and Naive Bayes algorithm to do the prediction. Naive Bayes is a simple and a probabilistic traditional machine learning algorithm.

It is very popular even in the past in solving problems like spam detection. The details of Naive Bayes can be checkout at this article by Devi Soni which is a concise and clear explanation of the theory of Naive Bayes algorithm.

Using Naive Bayes library provided by sklearn library save us a lot of hassle in implementing this algorithm ourselves. We can easily get this done in a few lines of codes

from sklearn.naive_bayes import GaussianNBclf.fit(x_train_features.toarray(),y_train)# Output of the score is the accuracy of the prediction
# Accuracy: 0.995
clf.score(x_train_features.toarray(),y_train)# Accuracy: 0.932
clf.score(x_test_features.toarray(),y_test)

We achieve an accuracy of 93.2%. But accuracy is not solely the metrics to evaluate the performance of an algorithm. Lets try out other scoring metrics and that may help us to understand thoroughly how well this model is doing.

Scoring & Metrics

Disadvantages of accuracy

When it comes to evaluation of a data science model’s performance, sometimes accuracy may not be the best indicator.

Some problems that we are solving in real life might have a very imbalanced class and using accuracy might not give us enough confidence to understand the algorithm’s performance.

In the email spamming problem that we are trying to solve, the spam data is approximately 20% of our data. If our algorithm predicts all the email as non-spam, it will achieve an accuracy of 80%.

And for some problem that has only 1% of positive data, predicting all the sample as negative will give them an accuracy of 99% but we all know this kind of model is useless in a real life scenario.

Precision & Recall

Precision & Recall is the common evaluation metrics that people use when they are evaluating class-imbalanced classification model.

Let’s try to understand what questions Precision & Recall is trying to answer,

Precision: What proportion of positive identifications was actually correct ?

Recall: What actual proportion of actual positives was identified correctly?

So, precision is evaluating, when a model predict something as positive, how accurate the model is. On the other hand, recall is evaluating how well a model in finding all the positive samples.

The mathematical equation for precision & recall are as respective

TP: True Positive

FP : False Positive

TN: True Negative

FN: False Negative

Confusion Matrix

Confusion Matrix is a very good way to understand results like true positive, false positive, true negative and so on.

Sklearn documentation has provided a sample code of how to plot nice looking confusion matrix to visualize your result. You can check it out here, or you can find the code in the notebook that I am sharing at the end of the articles.

Precision: 87.82%
Recall: 81.01%

The recall of this model is rather low, it might not be doing a good enough job in discovering the spam email. How do we do better than this ?

Summary

In this article, I have showed you all the necessary steps needed in designing a spam detection algorithm. Just a brief recap:

Explore and understand your data
Visualize the data at hand to gain a better intuition — Wordcloud, N-gram Bar Chart
Text Cleaning — Word Stemmer and Word Lemmatization
Feature Extraction — Count Vectorizer, Tfidf Vectorizer, Word Embedding
Algorithm — Naive Bayes
Scoring & Metrics — Accuracy, Precision, Recall

Here concludes the first part of demonstration in designing spam detection algorithm.

There is a part 2 of this article! I am going to demonstrate how you could improve the performance in terms of accuracy, precision and recall of the model by using word embeddings and deep learning model. Read the article here if you are interested!

You can clone the notebook from Github or run it from Colab directly. Leave any questions you have at the comments section below.