If you are into data science and looking for starter projects then the SMS Spam classification Project is one of those you should work upon! In this tutorial, we would go step by step from importing libraries to full model prediction and lately measuring the accuracy of the model.

About SMS Spam Classification
A good text classifier is a classifier that efficiently categorizes large sets of text documents in a reasonable time frame and with acceptable accuracy, and that provides classification rules that are humanly readable for possible fine-tuning. If the training of the classifier is also quick, this could become in some application domains a good asset for the classifier. Many techniques and algorithms for automatic text categorization have been devised.
The text classification task can be defined as assigning category labels to new documents based on the knowledge gained in a classification system at the training stage. In the training phase, we are given a set of documents with class labels attached, and a classification system is built using a learning method. Classification is an important task in both data mining and machine learning communities, however, most of the learning approaches in text categorization are coming from machine learning research.
Building SMS Spam Classification using Python, Pandas
For this project, I would be using Google Colab, but you can use python Notebook also for the same purpose.
Importing of Libraries
First, we would import the required libraries such as pandas, matplotlib, numpy, sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from google.colab import drive
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
%matplotlib inline
drive.mount('/content/drive')
Note: the last line of the code snippet can be removed if you are not using Google Colab. This last line is for mounting my Google Drive over Google Colab so that I can use the dataset present in my drive.
Importing the dataset
I would be uploading the dataset in my GitHub repo which can be found here.
After downloading the dataset we would import it using pandas’ read_csv function.
dataset = pd.read_csv("/content/drive/My Drive/SMS_Spam_Classification/spam.csv", encoding='latin-1')
Note: Please use your own path for the dataset.
Now as we have imported the dataset, let’s see if we have imported the dataset incorrect format or not by using head() function.
dataset.head()

From the above dataset snippet, I see that we have the column names which we don’t require! Thus now comes the task of cleaning and reformatting the data for us to use it to build our model.
Data Cleaning & Exploration
Now we have to remove unnamed columns. To do so we would use the drop function.
#removing unnamed columns
dataset = dataset.drop('Unnamed: 2', 1)
dataset = dataset.drop('Unnamed: 3', 1)
dataset = dataset.drop('Unnamed: 4', 1)
Now, the next task is to rename the columns v1 and v2 to label and message respectively!
dataset = dataset.rename(columns = {'v1':'label','v2':'message'})
Now, additionally (its an optional step but its always good to do some data exploration also 😛 )
dataset.groupby('label').describe()

Next thing we want to know how many messages are ham and how many messages are spam in our dataset. For that:
count_Class=pd.value_counts(dataset["label"], sort= True)
count_Class.plot(kind = 'bar',color = ["green","red"])
plt.title('Bar Plot')
plt.show();
Explanation: Here we set the sort = True and use the value_counts method of Pandas. This code would make a bar plot of green and red color respectively for spam and not spam classes.
The output you might be getting would be similar to this:

We see that we have a lot of ham messages whereas less spam messages. In this tutorial, we would go on forward with this dataset only without augmenting it (no oversampling/under sampling) I would do here.
Implementing the Naive Bayes for SMS Spam Classification
So first let me encode spam and not spam messages as 1 and 0 respectively.
# Classifying spam and not spam msgs as 1 and 0
dataset["label"]=dataset["label"].map({'spam':1,'ham':0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, dataset['label'], test_size=0.70, random_state=42)
Now, the second line of the above code snippet uses the sklearn library splot method to split the data into training and testing dataset. Here I have mentioned the test data size to be 70 percent of the whole dataset. (You can change it according to your wish here )
BONUS: DONT KNOW ABOUT SPLITTING OF DATASET AND ITS BENEFITS? READ MY THIS ARTICLE WHERE I EXPLAINED ALL!
Now I would be using the Multinomial Naive Bayes algorithm!
list_alpha = np.arange(1/100000, 20, 0.11)
score_train = np.zeros(len(list_alpha))
score_test = np.zeros(len(list_alpha))
recall_test = np.zeros(len(list_alpha))
precision_test= np.zeros(len(list_alpha))
count = 0
for alpha in list_alpha:
bayes = naive_bayes.MultinomialNB(alpha=alpha)
bayes.fit(X_train, y_train)
score_train[count] = bayes.score(X_train, y_train)
score_test[count]= bayes.score(X_test, y_test)
recall_test[count] = metrics.recall_score(y_test, bayes.predict(X_test))
precision_test[count] = metrics.precision_score(y_test, bayes.predict(X_test))
count = count + 1
As you can see that I have incorporated a recall test and precision test also to access my model more accurately as how much good my model is performing.
Now for different values of alpha, I would make a table to see various measures such as Train Accuracy, Test Accuracy, Test Recall, Test Precision.
matrix = np.matrix(np.c_[list_alpha, score_train, score_test, recall_test, precision_test])
models = pd.DataFrame(data = matrix, columns =
['alpha', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])
models.head(n=10)

Now we have to see the best index for Test Precision, as I am concerned more about it here. Note that it’s not always that we have to use Precision to evaluate our model. It depends upon your use cases always!
best_index = models['Test Precision'].idxmax()
models.iloc[best_index, :]
OUTPUT: -
alpha 10.670010
Train Accuracy 0.977259
Test Accuracy 0.962574
Test Recall 0.720307
Test Precision 1.000000
Implementing Random Forest
I would be using RandomForestClassifier function with n_estimators be 100 (you can change this according to your will to get the optimum results)
rf = RandomForestClassifier(n_estimators=100,max_depth=None,n_jobs=-1)
rf_model = rf.fit(X_train,y_train)
In the above code snippet, last time I fit my model with X_train and y_train.
Now, let’s see the predictions. I would be using predict function and calculating Precision, Recall , f- score, and Accuracy measure also.
y_pred=rf_model.predict(X_test)
precision,recall,fscore,support =score(y_test,y_pred,pos_label=1, average ='binary')
print('Precision : {} / Recall : {} / fscore : {} / Accuracy: {}'.format(round(precision,3),round(recall,3),round(fscore,3),round((y_pred==y_test).sum()/len(y_test),3)))
Model Evaluation
Precision : 0.995 / Recall : 0.726 / fscore : 0.839 / Accuracy: 0.963
Thus we see that our model’s accuracy is approx 96 percent which is I think pretty decent. Its precision value is also close to 1, again a decent value.
In my next article, I would use NLP and Neural Network and explain how we can get a more accurate model!
If you liked this tutorial please do share it with your friends or on social media!
Want to have a chat about Data Science? Ping me on LinkedIn!