Naive Bayes and LSTM Based Classifier Models

Building and comparing the accuracy of NB and LSTM models on a given dataset using Python, Keras and the NLTK library.

Ruthu S Sanketh
Towards Data Science

--

If you’re not familiar with the NLTK library and data pre-processing, take a look at this article. If you’re interested in language models and how to build them, read this article. If you’re familiar with NLP and language models, continue reading!

Contents

  1. What are Language Based Classifier Models?
  2. Naive Bayes Classifier
  3. Initial Steps
  4. Building the Model
  5. LSTM Model
  6. Initial Steps
  7. Training the Model
  8. Predictions on Random Samples
  9. Conclusion
  10. Further Reading

What are Language Based Classifier Models?

A statistical language model is a probability distribution over sequences of words which can be used to predict the next word for text generation and many other applications. Classifiers such as Naive Bayes make use of a language model to assign class labels to some instances, based on a set of features which can be numerically represented using statistical techniques. For example, given a set of movie reviews, we can train a model to predict whether the review has a positive or a negative sentiment.

Text generation can now be done by AI — Image by MILKOVÍ on Unsplash
Text generation can now be done by AI — Image by MILKOVÍ on Unsplash

Many machine learning techniques have been applied to automatic text classification apart from the Naive Bayes, such as K-Nearest Neighbors and Support Vector Machines(SVM). The NB classifier is widely used in text classification for its simplicity and efficiency. An LSTM or Long-Short-Term-Memory classifier is an artificial recurrent neural network which has both feedforward and feedback connections, and is usually used for classifying and making predictions on time-series data. In this article, we will be using the NB classifier on an IMDB movie review dataset with contains reviews and their sentiments, and then comparing its accuracy and other metrics with those of an LSTM classifier.

Naive Bayes Classifier

Naive Bayes is a conditional probability model, based on Bayes’ theorem, which states that the conditional probability is given by -

for each of K possible outcomes. Using the chain rule for repeated applications of the definition of conditional probability, this becomes -

Bayes’ theorem using the chain rule

Under the naive assumption that all the features x are mutually independent, the probability now becomes -

Naive Bayes’ probability

This formula is the basis behind the Multinomial Naive Bayes classifier which we will be building, which deals with the occurrences of a word in a single document.

Initial Steps

First we import the required libraries and tools.

import pandas as pd
import numpy as np
import nltk, keras, string, re, html, math

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter, defaultdict
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, classification_report

We then import the IMDB movie reviews dataset. The dataset can be found here. We load the data using pandas, as a dataframe. We then look at some important parameters, such as the number of null values in the dataset. If a dataset has many null values, methods such as median or mode value are used to fill up these cells for better results. We also convert the ‘sentiment’ column to lowercase for ease of encoding later on.

#Loads the IMDB dataset. We load it using pandas as dataframe
data = pd.read_csv('/Users/ruthu/Desktop/IMDB Dataset.csv')
print("Data shape - ", data.shape, "\n") #prints the number of rows and columns

for col in data.columns:
print("The number of null values - ", col, data[col].isnull().sum())
#prints the number of null values in each column

data["review"]= data["review"].str.lower()
data["sentiment"]= data["sentiment"].str.lower() #converts every value in the column to lowercase
data.head()
A preview of the data- Image by author
A preview of the data — Image by author

Looking at a preview of the data, we see that there are some HTML tags, newlines, and other special characters that need to be removed from the ‘review’ column. We also remove stopwords and perform lemmatization of the words in the column using regular expressions.

def cleaning(data):
clean = re.sub('<.*?>', ' ', str(data))
#removes HTML tags
clean = re.sub('\'.*?\s',' ', clean)
#removes all hanging letters afer apostrophes (s in it's)
clean = re.sub(r'http\S+',' ', clean)
#removes URLs
clean = re.sub('\W+',' ', clean)
#replacing the non alphanumeric characters
return html.unescape(clean)
data['cleaned'] = data['review'].apply(cleaning)


def tokenizing(data):
review = data['cleaned']
#tokenizing is done
tokens = nltk.word_tokenize(review)
return tokens
data['tokens'] = data.apply(tokenizing, axis=1)


stop_words = set(stopwords.words('english'))
def remove_stops(data):
my_list = data['tokens']
meaningful_words = [w for w in my_list if not w in stop_words] #stopwords are removed from the tokenized data
return (meaningful_words)
data['tokens'] = data.apply(remove_stops, axis=1)


lemmatizer = WordNetLemmatizer()
def lemmatizing(data):
my_list = data['tokens']
lemmatized_list = [lemmatizer.lemmatize(word) for word in my_list]
#lemmatizing is performed. It's more efficient than stemming.
return (lemmatized_list)
data['tokens'] = data.apply(lemmatizing, axis=1)

def rejoin_words(data):
my_list = data['tokens']
joined_words = ( " ".join(my_list))
#rejoins all stemmed words
return joined_words
data['cleaned'] = data.apply(rejoin_words, axis=1)

data.head()
Preview of the data after preprocessing- Image by author
Preview of preprocessed data— Image by author

We now have a cleaned set of reviews as well as a list of lemmatized and preprocessed tokens for each review. Now let us print some statistics of the data, such as the number of unique tokens, etc. which can be passed as a parameter while initializing our model.

# Prints statistics of Data like avg length of sentence , proportion of data w.r.t class labels
def sents(data):
clean = re.sub('<.*?>', ' ', str(data))
#removes HTML tags
clean = re.sub('\'.*?\s',' ', clean)
#removes all hanging letters afer apostrophes (s in it's)
clean = re.sub(r'http\S+',' ', clean)
#removes URLs
clean = re.sub('[^a-zA-Z0-9\.]+',' ', clean)
#removes all non-alphanumeric characters except periods.
tokens = nltk.sent_tokenize(clean)
#sentence tokenizing is done
return tokens
sents = data['review'].apply(sents)

length_s = 0
for i in range(data.shape[0]):
length_s+= len(sents[i])
print("The number of sentences is - ", length_s)
#prints the number of sentences

length_t = 0
for i in range(data.shape[0]):
length_t+= len(data['tokens'][i])
print("\nThe number of tokens is - ", length_t)
#prints the number of tokens

average_tokens = round(length_t/length_s)
print("\nThe average number of tokens per sentence is - ", average_tokens)
#prints the average number of tokens per sentence

positive = negative = 0
for i in range(data.shape[0]):
if (data['sentiment'][i]=='positive'):
positive += 1
#finds the proprtion of positive and negative sentiments
else:
negative += 1

print("\nThe number of positive examples are - ", positive)
print("\nThe number of negative examples are - ", negative)
print("\nThe proportion of positive to negative sentiments are -", positive/negative)
Statistics of the data- Image by author
Statistics of the data — Image by author

Since the ratio of positive to negative samples is 1, we don’t need to do any data normalization. Let us create the NB classifier now. We encode the ‘sentiments’ column by assigning the integer 0 to negative sentiments and 1 to positive sentiments. This makes training as well as data storage easier.

# gets reviews column from df
reviews = data['cleaned'].values

# gets labels column from df
labels = data['sentiment'].values
# Uses label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
data['encoded']= encoded_labels
print(data['encoded'].head())

# prints(enc.classes_)
encoder_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print("\nThe encoded classes are - ", encoder_mapping)

labels = data['encoded']
Encoded sentiments- Image by author
Encoded sentiments — Image by author

Now that we have the columns ready for the model, we can split the data into train and test sets. We use a 80-20% train-test split and also the stratify parameter so that the ratio of positive to negative sentiments are the same in both the train and test sets.

# Splits the data into train and test (80% - 20%). 
# Uses stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.
train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, labels, test_size=0.2, random_state=42, stratify=labels)

We now have train and test sentences as well as the corresponding train and test labels. There are two approaches possible for building vocabulary for the Naive Bayes classifier -

  1. We can take the whole data (train + test) to build the vocab. In this way while testing no word will be out of vocabulary.
  2. We can take only the train data to build the vocab. In this case, some words from the test set may not be in vocab and hence we will need to perform smoothing so that none of the probability terms are zero.

We use the 2nd approach since the 1st one is data intensive with nearly the same results. Also, since building vocab by taking all words in the train set is also memory intensive, we build the vocab by choosing the 3000 most frequent words in the training corpus.

Smoothing to remove 0 probability — Image by author
Smoothing to remove 0 probability — Image by author
# Uses Count vectorizer to get frequency of the words
vectorizer = CountVectorizer(max_features = 3000)

sents_encoded = vectorizer.fit_transform(train_sentences) #encodes all training sentences
counts = sents_encoded.sum(axis=0).A1
vocab = list(vectorizer.get_feature_names())

Building the Model

Now that we have the vocabulary, let us build the MNB classifier. NLTK also has an inbuilt Multinomial Bayes Classifier model which we can use, but in this article we will be writing code for the classifier to get a better idea of its working.

# Uses laplace smoothing for words in test set not present in vocab of train set.
class MultinomialNaiveBayes:

def __init__(self, classes, tokenizer):
#self.tokenizer = tokenizer
self.classes = classes

def group_by_class(self, X, y):
data = dict()
for c in self.classes:
#grouping by positive and negative sentiments
data[c] = X[np.where(y == c)]
return data

def fit(self, X, y):
self.n_class_items = {}
self.log_class_priors = {}
self.word_counts = {}
self.vocab = vocab
#using the pre-made vocabulary of 3000 most frequent training words

n = len(X)

grouped_data = self.group_by_class(X, y)

for c, data in grouped_data.items():
self.n_class_items[c] = len(data)
self.log_class_priors[c]=math.log(self.n_class_items[c]/n)
#taking log for easier calculation
self.word_counts[c] = defaultdict(lambda: 0)

for text in data:
counts = Counter(nltk.word_tokenize(text))
for word, count in counts.items():
self.word_counts[c][word] += count

return self
def laplace_smoothing(self, word, text_class): #smoothing
num = self.word_counts[text_class][word] + 1
denom = self.n_class_items[text_class] + len(self.vocab)
return math.log(num / denom)

def predict(self, X):
result = []
for text in X:

class_scores = {c: self.log_class_priors[c] for c in self.classes}

words = set(nltk.word_tokenize(text))
for word in words:
if word not in self.vocab: continue

for c in self.classes:

log_w_given_c = self.laplace_smoothing(word, c)
class_scores[c] += log_w_given_c

result.append(max(class_scores, key=class_scores.get))

return result

Now that we have built the MNB classifier, let us pass the train sentences and labels to train the model. After training the model, we obtain its accuracy and other performance metrics on the test set.

MNB = MultinomialNaiveBayes(
classes=np.unique(labels),
tokenizer=Tokenizer()
).fit(train_sentences, train_labels)

# Tests the model on test set and reports the Accuracy
predicted_labels = MNB.predict(test_sentences)
print("The accuracy of the MNB classifier is ", accuracy_score(test_labels, predicted_labels))
print("\nThe classification report with metrics - \n", classification_report(test_labels, predicted_labels))
Performance metrics of the MNB model — Image by author
Performance metrics of the MNB model — Image by author

We see that the accuracy of the MNB model on the test set is around 85%.

LSTM Model

We use the same train and test splits to build and study this model. Keras has an inbuilt LSTM model which we will be using after setting the parameters to our necessity.

Initial Steps

We define some hyperparameters of the model, and perform padding on the train and test sentences.

# Hyperparameters of the model
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 150
padding_type='post'
trunc_type='post'

# tokenizes sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_sentences)

# vocabulary size
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1

# converts train dataset to sequence and pads sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# converts Test dataset to sequence and pads sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

Now we initialize the model, using the Adam optimizer and binary cross entropy loss function. Feel free to try out other optimizers such as SGD. We use word embeddings, which is a technique where words are encoded as real-valued vectors in a high dimensional space, such that the similarity between words in terms of meaning translates to closeness in the vector space, using an Embedded Layer provided by Keras.

# model initialization
model = keras.Sequential([
keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
keras.layers.Bidirectional(keras.layers.LSTM(64)),
keras.layers.Dense(24, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])

# compiles model
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

# model summary
model.summary()
LSTM model summary — Image by author
LSTM model summary — Image by author

Training the Model

Now that we have defined the model and set up the hyperparameters, we train the model. This might take time if you’re running the code on your local machine, but Google Colab with GPU runtime is fast, so we use that.

#training the model
num_epochs = 5
history = model.fit(train_padded, train_labels,
epochs=num_epochs, verbose=1,
validation_split=0.1)
Training the model — Image by author
Training the model — Image by author

Now that we have trained the model, we can find its accuracy on the test set. First we find the probabilities, and convert these to predictions using a 0.5 threshold. Then we get a classification report.

# Gets probabilities
prediction = model.predict(test_padded)
print("The probabilities are - ", prediction, sep='\n')

# Gets labels based on probability 1 if p>= 0.5 else 0
for each in prediction:
if each[0] >=0.5:
each[0] = 1
else:
each[0] = 0
prediction = prediction.astype('int32')
print("\nThe labels are - ", prediction, sep='\n')

# Calculates accuracy on Test data
print("\nThe accuracy of the model is ", accuracy_score(test_labels, prediction))
print("\nThe accuracy and other metrics are \n", classification_report(test_labels, prediction, labels=[0, 1]),sep='\n')
Performance metrics of the LSTM model — Image by author
Performance metrics of the LSTM model — Image by author

We see that the accuracy of the LSTM model on the test set is around 87%.

Predictions on Random Samples

We can also use the model to get predictions for random samples and look at the output.

# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming",
"I have never seen a terrible movie like this",
"the movie plot is terrible but it had good acting"]

# converts to a sequence
test_sequences = tokenizer.texts_to_sequences(sentence)

# pads the sequence
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

# Gets probabilities
prediction = model.predict(test_padded)
print("The probabilities are - ", prediction, sep='\n')

# Gets labels based on probability 1 if p>= 0.5 else 0
for each in prediction:
if each[0] >=0.5:
each[0] = 1
else:
each[0] = 0
prediction = prediction.astype('int32')
print("\nThe labels are - ", prediction, sep='\n')
LSTM random sample predictions — Image by author
LSTM random sample predictions — Image by author

We see that the LSTM model predicts the sentiments with a decent accuracy.

Conclusion

The LSTM model with an accuracy of 87% is slightly better than the MNB model with an accuracy of 85%. Both the models can be used to predict the sentiments of random samples. Their accuracy can be improved by increasing the number of training epochs, or varying the hyperparameters. The entire code used in this article as well as the dataset can be found here.

Further Reading

--

--