Filtering Spam Using Naive Bayes

Atiar Rahman
4 min readMay 15, 2019

You out of the billions of people on Earth have been called upon by the Prince of Nigeria to help move his fortune out of the country. You can’t deny this request and give him the money he needs in return for a piece of his fortune. Unfortunately, you never get any money from this fictional prince and realize you’ve just been scammed. This type of scam has been fooling people for ages and is surprisingly still going strong. Ridiculously enough, Americans have lost $703,000 to these types of scams in 2018 alone according to a report by ADT Security Services. One way to reduce the number of people falling for these types of scams is spam filtering. But how does your email service actually filter out spam emails?

One way spam emails are sorted is by using a Naive Bayes classifier. The Naive Bayes algorithm relies on Bayes Rule. This algorithm will classify each object by looking at all of it’s features individually. Bayes Rule below shows us how to calculate the posterior probability for just one feature. The posterior probability of the object is calculated for each feature and then these probabilities are multiplied together to get a final probability. This probability is calculated for the other class as well. Which ever has the greater probability that ultimately determines what class the object is in.

Bayes Rule

For our purposes, the object is an email and the features are the unique words in the email. Thus, there is a posterior probability calculated for each unique word in them email. Plugging this into Bayes Rule, our formula will look something like this:

Now that we understand Naive Bayes, we can create our own spam filter.

Creating A Spam Filter Using Python/Scikit-Learn

Creating your own spam filter is surprisingly very easy. The first step is to get a data set of emails. This can be found on Kaggle and will need to be read into a pandas dataframe. Your dataframe should look something like this:

Sample DataFrame containing emails

In this case, the ‘text’ column contains the message within each email. The ‘label_num’ column has the outcomes for these emails. For this dataset, a 1 represent an email that is ‘spam’ while a 0 represent an email that is not spam or ‘ham’. Besides pandas, you will also need to import the following scikit-learn libraries:

Now that you have your dataset ready, it’s time to train your classifier with just a few lines of code:

In the code above, the first thing I did was create a train-test-split which isn’t necessary to build your classifier. In the next step I use the CountVectorizer() in order to change each email into a vector counting the number of times each word occurs. Now we’re ready to see the classifier being used in action. The following sample emails were created. One is clearly spam while the other is a normal email.

In order to use your classifier, you must vectorize the example emails. Finally, you can classify the emails. For our examples above, we got the following results with the first email classified as ‘spam’ and the second as ‘not spam’.

The classifier was successful. In order to get a better understanding of the performance of the model, the accuracy and F1 score was measured. There were also a total of only 22 false positives and false negatives for a testing set with 1293 emails.

Conclusion

Naive Bayes is powerful yet simple algorithm that is especially useful when filtering out spam emails. To get an even better understanding of Naive Bayes and Bayes rule, this article may be very helpful.

Resources

--

--