Email Classification is a Machine Learning problem that falls under the category of Supervised Learning.

This mini-project of Email Classification is inspired by J.K. Rowling’s publishing of a book under a pen-name. Udacity’s "Introduction to Machine Learning" provides a comprehensive study of the Algorithms and the project.
A couple of years ago, Rowling wrote a book, "The Cuckoo’s Calling," under the name Robert Galbraith. The book received some good reviews, but no one paid much attention to it – until an anonymous tipster on Twitter said it was J.K. Rowling. The London Sunday Times enlisted two experts to compare the linguistic patterns of "Cuckoo" to Rowling’s "The Casual Vacancy," as well as to books by several other authors. After the results of their analysis pointed strongly toward Rowling as the author, the Times directly asked the publisher if they were the same person, and the publisher confirmed. The book exploded in popularity overnight.
Email Classification works on the same basic concepts. By going through the text of the email, we will use Machine Learning algorithms to predict whether the email has been written by one person or the other.
The Dataset
The dataset may be taken from the following GitHub repository:
In this dataset, we have a set of emails, half of which are written by one person (Sara) and the other half by another person (Chris) at the same company. The data is based on a list of strings. Each string is the text of an email, which has undergone some basic preprocessing.
We will classify the emails as written by one person or the other based only on the text of the email. We will use the following algorithms one by one: Naïve Bayes, Support Vector Machine, Decision Trees, Random Forest, KNN, and AdaBoost Classifier.
The repository has 2 pickle files: word_data and email_authors.
The email_preprocess python file serves to process the data from the pickles files. It splits the data into train/test with 0.1 test data.
Naïve Bayes:
Naïve Bayes methods are a set of supervised learning algorithms based on Bayes’ theorem and assuming conditional independence and equal contribution between every pair of features given the value of the class variable. Bayes’ Theorem is a simple mathematical formula used for calculating conditional probabilities.
Gaussian Naïve Bayes is a type of Naïve Bayes where the likelihood of the features is assumed to be Gaussian. The continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values.

We will use Gaussian Naïve Bayes Algorithm from the scikit-learn library to classify the emails among the 2 authors.
Following is the python code which you may implement on any Python IDE, with the required libraries installed on your system.
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
import numpy as np
# using the Gaussian Bayes algorithm for classification of emails.
# the algorithm is imported from the Sklearn library
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# initializaing the test and train features and labels
# the function preprocess is imported from email_preprocess.py
features_train, features_test, labels_train, labels_test = preprocess()
# defining the classifier
clf = GaussianNB()
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of Naive Bayes: ", accuracy_score(pred,labels_test))
Running the code gives us the following results:

The accuracy of Naïve Bayes for this particular problem in 0.9203. Pretty good right? Even the training and predicting times of the algorithm are quite reasonable.
Support Vector Machines
Support Vector Machines is also a type of Supervised Learning used for classification, regression as well as outlier detection. We can use the SVM algorithm to classify data points into 2 classes, through a plane that separates them. SVM has a straight decision boundary. The SVM algorithm is quite versatile, different Kernel functions can be specified for the decision function.
SVM algorithm is based on the hyperplane that separates the two classes, the greater the margin, the better the classification (also called margin maximization).
Our classifier is the C-Support Vector Classification with linear kernel and value of C = 1
clf = SVC(kernel = ‘linear’, C=1)
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#defining the classifier
clf = SVC(kernel = 'linear', C=1)
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of SVM Algorithm: ", clf.score(features_test, labels_test))

The accuracy of the SVM algorithm is 0.9596. We can see a visible tradeoff between the accuracy and the training time. An increase in the accuracy of the algorithm is a result of the longer training time (22.7s as compared to 0.13s in the case of Naïve Bayes). We can play with the training data as well as the kernels to come to an optimum selection which would yield a good accuracy score with less training time!
We will first slice the training dataset down to 1% of its original size tossing out 99% of the training data. With the rest of the code unchanged, we can observe a significant reduction in the training time and a consequent reduction in accuracy. The tradeoff is that the accuracy almost always goes down when we slice down the training data.
Use the following code to slice the training data to 1%:
features_train = features_train[:len(features_train)//100]
labels_train = labels_train[:len(labels_train)//100]
As can be seen, with 1% training data, the training time of the algorithm has been reduced to 0.01s with reduced accuracy of 0.9055.

With 10% Training Data, the accuracy is 0.9550 with training time 0.47s.

We may also change the kernels and the value of C in the scikit-learn’s C-Support Vector Classification.
With 100% Training Data, RBF kernel, and the value of C set to 10000, we get an accuracy of 0.9891 with a training time of 14.718.

Decision Trees
Decision Trees are a non-parametric supervised learning method used for classification and regression. Decision Trees can perform multi-class classification on a dataset. Data is classified stepwise on each node using some decision rules inferred from the data features. Decision Trees are easy to visualize. We may understand the algorithm by visualizing a dataset run through the tree, with a decision to be made at the various nodes.

Let’s see how this algorithm works on our dataset.
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
from sklearn import tree
from sklearn.metrics import accuracy_score
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
# defining the classifier
clf = tree.DecisionTreeClassifier()
print("nLength of Features Train", len(features_train[0]))
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of Decision Trees Algorithm: ", accuracy_score(pred,labels_test))
Running the code above gives us an accuracy of 0.9880 and a training time of 6.116s. That is a very good accuracy score, isn’t it? We have 100% of the training data taken for training the model.

Random Forests

Random forests are an ensemble Supervised Learning algorithm built on Decision trees. Random Forests are used for regression and classification tasks. The algorithm takes its name from the random selection of features.
We can use the Random Forests algorithm from the sklearn library on our dataset: RandomForestClassifier().
The following is the code used for running the random forest algorithm on our email classification problem.
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
# defining the classifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of Random Forest Algorithm: ", accuracy_score(pred,labels_test))

The accuracy of the algorithm is quite low, ie; 0.7707. The training time is 1.2s, which is reasonable but overall, it doesn’t prove to be a good tool for our problem. The reason for the low accuracy is the randomness of feature selection, which is a property of random forests. The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a "forest"), this model uses two key concepts that give it the name random: Random sampling of training data points when building trees.
KNN – K Nearest Neighbors
K Nearest Neighbor is a Supervised Machine Learning algorithm that may be used for both classification and regression predictive problems. KNN is a lazy learner. It relies on distance for classification, so normalizing the training data can improve its accuracy dramatically.

Let us see the results of classifying emails using the KNN algorithm from sklearn library KNeighborsClassifier() with 5 nearest neighbors and the Euclidean metric.
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
# defining the classifier
clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of KNN Algorithm: ", accuracy_score(pred,labels_test))
The accuracy of the algorithm is 0.9379 with a training time of 2.883s. However, it may be noticed that the model tool a considerably longer time to predict the classes.

AdaBoost Classifier

Ada-boost or Adaptive Boosting is also an ensemble boosting classifier. It is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
We will use the classifier from the scikit library. Following is the code:
import sys
from time import time
sys.path.append("C:UsersHPDesktopML Code")
from email_preprocess import preprocess
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
# defining the classifier
clf = AdaBoostClassifier(n_estimators=100, random_state=0)
#predicting the time of train and testing
t0 = time()
clf.fit(features_train, labels_train)
print("nTraining time:", round(time()-t0, 3), "sn")
t1 = time()
pred = clf.predict(features_test)
print("Predicting time:", round(time()-t1, 3), "sn")
#calculating and printing the accuracy of the algorithm
print("Accuracy of Ada Boost Classifier: ", accuracy_score(pred,labels_test))
The accuracy of the classifier has come out to be 0.9653 with a training time of 17.946s. The accuracy is quite good, however, the training time is a bit longer than what is required.

Conclusion
Throughout this article, we have used several Machine Learning algorithms to classify emails between Chris and Sara. The algorithms resulted in different accuracy scores between the range of 0.77–0.98. As can be seen from the table below, where the models are arranged by increasing accuracy:
- the Random Forests algorithm had the lowest accuracy score
- SVM algorithm had the longest training time
- the SVM algorithm with optimized parameters of C=10000 and RBF kernel had the highest accuracy score
- Naive Bayes algorithm had the quickest predicting time

Although there are many other classification algorithms that may be used for our task here, a comparison of the basic algorithms run on the dataset concludes that for our particular problem, SVM is the most accurate, considering its parameters are optimized according to the task we are dealing with.
Do you think other algorithms or models would do the job better, or maybe equally well?
Share your experience and follow me for more articles!