Sentiment Analysis using Logistic Regression and Naive Bayes

Let's compare which algorithm is better for classifying the tweets based on their sentiments.

Atharva Mashalkar
Towards Data Science

--

Photo by Raphael Schaller on Unsplash

Supervised ML

In supervised machine learning, you usually have an input X, which goes into your prediction function to get your Y^. You can then compare your prediction with the true value Y. This gives you your cost which you use to update the parameters θ.

But what is Sentiment Analysis?

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

So, let's start sentiment analysis using Logistic Regression

Sentiment Analysis using Logistic Regression

We will be using the sample twitter data set for this exercise.

Given a tweet, or some text, we can represent it as a vector of dimension V, where V corresponds to our vocabulary size. For example: If you had the tweet “I am learning sentiment analysis”, then you would put a 1 in the corresponding index for any word in the tweet, and a 0 otherwise. As we can see, as V gets larger, the vector becomes more sparse. Furthermore, we end up having many more features and end up training θ V parameters. This could result in larger training time and large prediction time. Hence, we will extract frequencies of every word and making a frequency dictionary.

The idea here is to divide the training set into positive and negative tweets. Count all the words and make a python dictionary of their frequencies in positive and negative tweets.

For every tweet make a vector of bias unit, sum of all the positive frequencies(words from positive tweets) of all the words and also their negative frequencies. We will go into detail regarding this in further paragraphs.

Preprocessing a tweet

When preprocessing, you have to perform the following:

  1. Eliminate handles and URLs
  2. Tokenize the string into words.
  3. Remove stop words like “and, is, a, on, etc.”
  4. Stemming- or convert every word to its stem. Like a dancer, dancing, danced, becomes ‘danc’. You can use porter stemmer to take care of this.
  5. Convert all your words to lower case.

In order to carry out the above steps follow the below-given code snippets:

Import the libraries and sample twitter data set provided by nltk (Natural Language Toolkit) package, which contains 5000 positive and 5000 negative tweets. Also, let's import some additional libraries which will help us in carrying out Regular Expression in python.

import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import numpy as np

Here we remove stopwords (words which don’t and any value to the model, without these words the model will provide the same accuracy, ex: ‘the’, ‘is’, ‘are’, etc.) and carry out stemming (removing suffix of few words in order to reduce the vocabulary size). We also import English stopwords from nltk library

Note: Here we are also tokenizing the string into a list of words after removing retweets, hashtags, URLs.

#Preprocessing tweets
def process_tweet(tweet):
#Remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]','', tweet)

#Remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*','', tweet2)

#Remove hastags
#Only removing the hash # sign from the word
tweet2 = re.sub(r'#','',tweet2)

# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)

#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

#Creating a list of words without stopwords
tweets_clean = []
for word in tweet_tokens:
if word not in stopwords_english and word not in string.punctuation:
tweets_clean.append(word)

#Instantiate stemming class
stemmer = PorterStemmer()

#Creating a list of stems of words in tweet
tweets_stem = []
for word in tweets_clean:
stem_word = stemmer.stem(word)
tweets_stem.append(stem_word)

return tweets_stem

Building Frequency dictionary

Now, we will create a function that will take tweets and their labels as input, go through every tweet, preprocess them, count the occurrence of every word in the data set and create a frequency dictionary.

Note: The squeeze function is necessary or the list ends up with one element.

#Frequency generating function
def build_freqs(tweets, ys):
yslist = np.squeeze(ys).tolist()

freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
freqs[pair] = freqs.get(pair, 0) + 1

return freqs

The required functions for processing tweets are ready, now let's build our logistic regression model.

Sigmoid Function

Logistic regression makes use of the sigmoid function which outputs a probability between 0 and 1. The sigmoid function with some weight parameter θ and some input x^{(i)}x(i) is defined as follows:-

h(x^(i), θ) = 1/(1 + e^(-θ^T*x^(i)).

The sigmoid function gives values between -1 and 1 hence we can classify the predictions depending on a particular cutoff. (say : 0.5)

Note that as (θ^T)x(i) gets closer and closer to −∞ the denominator of the sigmoid function gets larger and larger and as a result, the sigmoid gets closer to 0. On the other hand, (θ^T)x(i) gets closer and closer to ∞ the denominator of the sigmoid function gets closer to 1 and as a result the sigmoid also gets closer to 1.

As we have understood the sigmoid function now let's code it!

Note: The function should work for a scalar as well as an array

def sigmoid(z): 
'''
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
'''
# calculate the sigmoid of z
h = 1/(1 + np.exp(-z))

return h

Cost Function and Gradient Descent

The logistic regression cost function is defined as

J(θ)=(−1/m)*​∑i=1 to m​[y(i)log(h(x(i),θ)+(1−y(i))log(1−h(x(i),θ))]

We aim to reduce cost by improving the theta using the following equation:

θj:=θjα*J(θ)/θj

Here, α is called the learning rate. The above process of making hypothesis (h) using the sigmoid function and changing the weights (θ) using the derivative of cost function and a specific learning rate is called the Gradient Descent Algorithm.

Note: You initialize your parameter θ, that you can use in your sigmoid, you then compute the gradient that you will use to update θ, and then calculate the cost. You keep doing so until good enough.

Let's code what we learned.

def gradientDescent(x, y, theta, alpha, num_iters):
'''
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
'''

m = len(x)

for i in range(0, num_iters):

# get z, the dot product of x and theta
z = np.dot(x,theta)

# get the sigmoid of z
h = sigmoid(z)

# calculate the cost function
J = (-1/m)*(np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)))

# update the weights theta
theta = theta - (alpha/m)*np.dot(x.T, h-y)

J = float(J)
return J, theta

Now, let's create a function that will extract features from a tweet using the ‘freqs’ dictionary and above defined preprocessing function (process_tweet).

def extract_features(tweet, freqs):
'''
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
'''
# process_tweet tokenizes, stems, and removes stopwords
word_l = process_tweet(tweet)

# 3 elements in the form of a 1 x 3 vector
x = np.zeros((1, 3))

#bias term is set to 1
x[0,0] = 1

# loop through each word in the list of words
for word in word_l:

# increment the word count for the positive label 1
x[0,1] += freqs.get((word,1),0)

# increment the word count for the negative label 0
x[0,2] += freqs.get((word,0),0)

assert(x.shape == (1, 3))
return x

Now, we will import the data set from nltk and break it into a training set and test set

# split the data into two pieces, one for training and one for testing (validation set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
test_x = test_pos + test_neg
# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

As all the required functions are ready we can finally train our model using the training data set and test it on the test data set

# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]= extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(theta)]}")

J is the final cost and “theta” are the final weights after training the model.

In order to check it before testing on the test data set.

# Check your function# test 1
# test on training data
tmp1 = extract_features(train_x[0], freqs)
print(tmp1)
# #### Expected output
# ```
# [[1.00e+00 3.02e+03 6.10e+01]]

Lets, write two more functions which given a tweet will predict the result using the ‘freqs’ dictionary and theta. The second function will use the predict function and provide the accuracy of the model on the given testing data set.

def predict_tweet(tweet, freqs, theta):
'''
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
'''

# extract the features of the tweet and store it into x
x = extract_features(tweet, freqs)

# make the prediction using x and theta
z = np.dot(x,theta)
y_pred = sigmoid(z)


return y_pred
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""

# the list for storing predictions
y_hat = []

for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)

if y_pred > 0.5:
# append 1.0 to the list
y_hat.append(1)
else:
# append 0 to the list
y_hat.append(0)
# With the above implementation, y_hat is a list, but test_y is (m,1) array
# convert both to one-dimensional arrays in order to compare them using the '==' operator
y_hat = np.array(y_hat)
test_y = test_y.reshape(-1)
accuracy = np.sum((test_y == y_hat).astype(int))/len(test_x)

return accuracy

On testing the model using the test data set we get an accuracy of 99.5%

Sentiment Analysis using Naive Bayes

Naive Bayes algorithm is based on the Bayes rule, which can be represented as follows:

P(XY)=P(Y)P(YX)P(X)​

Here, the process up to creating a dictionary of frequencies (importing libraries, preprocessing, etc.) is the same. The way the algorithm works is as follows:-

  1. Find the log of the ratio of the number of positive tweets and negative sentiment tweets. i.e.

logprior :- log(𝑃(𝐷𝑝𝑜𝑠))−log(𝑃(𝐷𝑛𝑒𝑔))=log(𝐷𝑝𝑜𝑠)−log(𝐷𝑛𝑒𝑔)

2. Instead of keeping the frequencies of each word with the positive and negative labels we take the ratio of their frequency in that label by the total number of frequencies in that label. This will give the probability of occurrence of that word given the tweet is positive/negative.

3. Then we make another property called loglikelihood. It is the log of the ratio of Positive probability to that of the negative probability of a particular word. But what if the probability of the word is zero ( frequency is zero in either positive or negative case) the log may become +/- infinity. Hence to overcome this we use additive smoothing. This wiki article explains more about additive smoothing.

Therefore, to compute the positive probability and the negative probability for a specific word in the vocabulary, we’ll use the following inputs:

  • 𝑓𝑟𝑒𝑞𝑝𝑜𝑠 and 𝑓𝑟𝑒𝑞𝑛𝑒𝑔 are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
  • 𝑁𝑝𝑜𝑠 and 𝑁𝑛𝑒𝑔 are the total numbers of positive and negative words for all documents (for all tweets), respectively.
  • 𝑉 is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We’ll use these to compute the positive and negative probability for a specific word using this formula:

𝑃(𝑊𝑝𝑜𝑠)= (𝑓𝑟𝑒𝑞𝑝𝑜𝑠+1)/(𝑁𝑝𝑜𝑠+𝑉)

𝑃(𝑊𝑛𝑒𝑔)= (𝑓𝑟𝑒𝑞𝑛𝑒𝑔+1)/(𝑁𝑛𝑒𝑔+𝑉)

Notice that we add the “+1” in the numerator for additive smoothing.

And the loglikelihood can be represented as:-

loglikelihood=log(𝑃(𝑊𝑝𝑜𝑠)/𝑃(𝑊𝑛𝑒𝑔))

That's it! We just need to code the above written in order to train our Naive Bayes function. So, first, let's write a function that does all the above work.

def train_naive_bayes(freqs, train_x, train_y):
'''
Input:
freqs: dictionary from (word, label) to how often the word appears
train_x: a list of tweets
train_y: a list of labels correponding to the tweets (0,1)
Output:
logprior: the log prior. (equation 3 above)
loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
'''
loglikelihood = {}
logprior = 0
# calculate V, the number of unique words in the vocabulary
vocab = set([pair[0] for pair in freqs.keys()])
V = len(vocab)
# calculate N_pos and N_neg
N_pos = N_neg = 0
for pair in freqs.keys():
# if the label is positive (greater than zero)
if pair[1] > 0:
# Increment the number of positive words by the count for this (word, label) pair
N_pos += freqs.get(pair, 1)
# else, the label is negative
else:
# increment the number of negative words by the count for this (word,label) pair
N_neg += freqs.get(pair, 1)
# Calculate D, the number of documents
D = len(train_y)
# Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))
D_pos = sum(train_y)
# Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
D_neg = D-D_pos
# Calculate logprior
logprior = np.log(D_pos) - np.log(D_neg)
# For each word in the vocabulary...
for word in vocab:
# get the positive and negative frequency of the word
freq_pos = freqs.get((word, 1),0)
freq_neg = freqs.get((word, 0),0)
# calculate the probability that each word is positive, and negative
p_w_pos = (freq_pos + 1)/(N_pos + V)
p_w_neg = (freq_neg + 1)/(N_neg + V)
# calculate the log likelihood of the word
loglikelihood[word] = np.log(p_w_pos/p_w_neg)
return logprior, loglikelihood
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

Predicting using Naive Bayes

In order to predict the sentiment of a tweet we simply have to sum up the loglikelihood of the words in the tweet along with the logprior. If the value is positive then the tweet shows positive sentiment but if the value is negative then the tweet shows negative sentiment.

So let's write the predicting ( takes in a tweet, loglikelihood, and logprior and returns the prediction) and a testing function ( to test the model using the test data set).

def naive_bayes_predict(tweet, logprior, loglikelihood):
'''
Input:
tweet: a string
logprior: a number
loglikelihood: a dictionary of words mapping to numbers
Output:
p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)
'''
# process the tweet to get a list of words
word_l = process_tweet(tweet)
# initialize probability to zero
p = 0
# add the logprior
p += logprior
for word in word_l:# check if the word exists in the loglikelihood dictionary
if word in loglikelihood:
# add the log likelihood of that word to the probability
p += loglikelihood[word]
return pdef test_naive_bayes(test_x, test_y, logprior, loglikelihood):
"""
Input:
test_x: A list of tweets
test_y: the corresponding labels for the list of tweets
logprior: the logprior
loglikelihood: a dictionary with the loglikelihoods for each word
Output:
accuracy: (# of tweets classified correctly)/(total # of tweets)
"""
accuracy = 0 # return this properly
y_hats = []
for tweet in test_x:
# if the prediction is > 0
if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
# the predicted class is 1
y_hat_i = 1
else:
# otherwise the predicted class is 0
y_hat_i = 0
# append the predicted class to the list y_hats
y_hats.append(y_hat_i)

# error is the average of the absolute values of the differences between y_hats and test_y
error = np.mean(np.absolute(y_hats - test_y))
# Accuracy is 1 minus the error
accuracy = 1 - error
return accuracy

On testing the model on the test data set we get an accuracy of 99.4%. which is slightly less may be due to the assumptions that the Naive Bayes algorithm makes. In fact, it called “Naive” due to its assumptions.

The assumptions are as follows:-

  1. Independence assumption

In the first image, you can see the word sunny and hot tend to depend on each other and are correlated to a certain extent with the word “desert”. Naive Bayes assumes independence throughout. Furthermore, if you were to fill in the sentence on the right, this naive model will assign equal weight to the words “spring, summer, fall, winter”.

2. Relative frequencies

On Twitter, there are usually more positive tweets than negative ones. However, some “clean” datasets you may find are artificially balanced to have the same amount of positive and negative tweets. Just keep in mind, that in the real world, the data could be much noisier.

Conclusion

From the above results, we can see that the Logistic Regression algorithm has performed relatively well as compared to the Naive Bayes algorithm. This can be due to the fact that the Logistic Regression algorithm doesn’t make as many assumptions as that of the Naive Bayes algorithm.

--

--