The world’s leading publication for data science, AI, and ML professionals.

Multinomial Naive Bayes classifier using pointwise mutual information

A multinomial classifier with feature selection in python

Photo by v2osk on Unsplash
Photo by v2osk on Unsplash

Text Classification means assigning documents to a list of categories based on the content of each document. We can improve the performance of classifiers if we select the trainset in a way to maximize the information gain. Pointwise Mutual Information (PMI) is a feature scoring metrics that estimate the association between a feature and a class. You can read this article to learn more about PMI.

We can improve the performance of classifiers if we select the trainset in a way to maximize the information gain.

If we consider words as features and the category of documents as classes, then we can use PMI to reduce the dimensionality of the document space. In other words, when we convert documents with different topics into the bag of words (Document Term matrix), there are a huge number of words as a feature set of documents. But only a small set of words are very helpful to predict the topic of documents. The PMI score between a word and a category represents the information gain of a word given a topic. Thus, we can calculate the PMI score of all words given different topics in a corpus, then rank words in each topic based on its PMI score, and at the end select the top K words in each topic. With this approach, we can reduce the dimensionality of train data in a way to maximize the information gain of words. In this article, I will show you how to use PMI in a Naïve Bayes classifier. Multinomial Naïve Bayes is a probabilistic and supervised learning method. It is a famous method in text classification.


Multinomial Naïve Bayes

Suppose that we have a set of classes C={c_1, c2, …, c|C|} where |C| is the number of classes and a set of vocabulary V={w_1,w2,…,w|V|} while |V| is the total number of vocabularies. Also, we have a set of documents D={d_1,d2, …, d|D|} distributed between classes while |D| is the number of documents. Each document "d" has a set of words W={w_1,w2, …, w|W|} while W is a subset of V and |W| is the number of words in document "d". This method calculates the posterior probability of a class given a document, P(c|d), and then assigns the document to a class with the highest probability. In other words, we choose a category for a document in a way to maximize the joint probability of the document’s words in a class. It is known as the maximum likelihood function.

Since multiplying a set of probabilities may lead to floating-point underflow results, it is common to apply a logarithm to the maximum likelihood function. So, the log-maximum likelihood function for Naïve Bayes is as follow:

We use the below relations to estimate P(c) and P(w_t|c). It is important to mention that we estimate these two probabilities based on the trainset. Suppose N is the total number of documents in a corpus and N_c is the number of documents in class c, then:

Also, given that n(c, w_t) is the number of occurrence of w_t in training documents in class c, considering repetition, and n(c) is the total number of words in class c, then:

To avoid zero probability (since in the joint probabilities in the maximum likelihood function, a zero probability can convert the whole result to zero), we can add a Laplace correction into the above formula. Then we have:

In the Naïve Bayes classifier with Pointwise Mutual Information, instead of estimating the probability of all words given a class, we only use those words which are in the top k words based on their ranked PMI scores. To do so, first, we select a list of words (features) to maximize the information gain based on their PMI score and then apply Naïve Bayes classifier in this set of words.


A case study in python

Here is a step-by-step python code to apply this classifier. Since this article focuses on Multinomial Naïve Bayes Classifier using PMI, I avoid talking about how to convert documents into the bag of words. Thus, we assume that we have a vector space matrix of documents as rows and words as columns. Also, an extra column, "label", identifies the category of each document. Here is the head of the dataset:

"Image by author": A sample data set with 1426 words as features
"Image by author": A sample data set with 1426 words as features

To start the process, first, we need to divide the dataset into trainset and test set. We consider 30% of the dataset as a testset.

cl_dt= pd.read_csv(r'C:...Classification_data.csv')
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(cl_dt, test_size=0.30, random_state=42)

Now, to calculate the PMI score of each word, first, we need to build a dictionary of words and the number of documents in each category that contain that word. The dictionary is as follow:

"list_token" is a function that returns a list of words in a document.

def list_token(doc , ack):
 temp1=pd.DataFrame(row).transpose()
 temp2 = temp1.loc[:, (temp1 != 0).any(axis=0)]
 if ack==0:
 freq = temp2.drop(['Doc_ID','label'], axis=1)
 else:
 freq = temp2.drop(['label'], axis=1)
 list_words =freq.columns.tolist()
 return list_words

We use this function to build a dictionary of words and the number of documents in each category that contain that word.

word_cat_dic={}
n = len(X_train)
for i in range(0,len(X_train)):
    dt=X_train.iloc[i,]
    lst_words=list_token(dt,0)
    cl=dt['label']
    for w in set(lst_words):
        word_cl_dic[w]=word_cl_dic.get(w,{})
        word_cl_dic[w][cl]=word_cl_dic[w].get(cl,0)
        word_cl_dic[w][cl]+=1

This a part of "word_cl_dic" as output:

{'get': {'cloud': 51, 'ntw': 47, 'img': 18, 'inf': 56, 'nlp': 28},
 'environ': {'cloud': 18, 'img': 9, 'inf': 16, 'ntw': 14, 'nlp': 10},
 'error': {'ntw': 23, 'img': 4, 'inf': 20, 'cloud': 8, 'nlp': 9},
 'flush': {'ntw': 6, 'img': 6, 'nlp': 6, 'inf': 6, 'cloud': 6},
 'struct': {'ntw': 8, 'img': 1, 'inf': 4, 'cloud': 2},
 'split': {'ntw': 25, 'cloud': 22, 'img': 22, 'nlp': 36, 'inf': 24},
 'connect': {'ntw': 12, 'img': 7, 'inf': 9, 'nlp': 6, 'cloud': 1},
 'close': {'ntw': 26, 'img': 12, 'inf': 16, 'nlp': 16, 'cloud': 8},
 'sys': {'ntw': 29, 'cloud': 13, 'img': 23, 'inf': 20, 'nlp': 25},
 'select': {'ntw': 2, 'inf': 13, 'img': 2, 'cloud': 1, 'nlp': 1},
 'logging': {'ntw': 12, 'img': 12, 'nlp': 16, 'inf': 21, 'cloud': 16},...}

Now, we calculate the PMI of each word given a class and then choose the top k=100 words per class:

word_features={}
num_top_words=100
#calculate the number of document in each class, N_c
num_doc_cl=X_train.groupby(['label'])['Doc_ID'].count()
for w in word_cl_dic.keys():
  dic = word_cl_dic[w]
  n_kw=0
  for cl in dic.keys():
  n_kw+=dic[cl] #Number of document in trainset contain word "w"
 for cl in dic.keys():
  word_features[cl]=word_features.get(cl,[])
  n_kw_cl=dic[cl] #Number of documents in class "cl" contains "w"
  n_dev_cl=num_doc_cl.loc[cl] #Number of documents in class "cl"

  p_w=n_kw/n
  p_cl = n_dev_cl/n
  p_w_cl= n_kw_cl/n

  pmi=log((p_w_cl)/(p_w*p_cl)) #pointwise mutual information score

  if len(word_features[cl])<k:
    word_features[cl].append((w,pmi))
    if len(word_features[cl])==k:
     word_features[cl].sort(key=operator.itemgetter(1),reverse=True) 
  else:
    if word_features[cl][k-1][1]<pmi:
     word_features[cl][k-1]= (w,pmi)
     word_features[cl].sort(key=operator.itemgetter(1),reverse=True)

The result in "word_features" shows a list of top k=100 words in each class based on their PMI score:

{'cloud': [('delete_function', 1.5032890640590577),
  ('describe_stacks', 1.5032890640590577),
  ('s3', 1.5032890640590577),
  ('create_stack', 1.5032890640590577),
  ('update_stack', 1.5032890640590577),
  ('delete_stack', 1.5032890640590577),
  ('tag_resource', 1.5032890640590577),
  ('list_tags_for_resource', 1.5032890640590577),
  ('untag_resource', 1.5032890640590577),
  ('delete_policy', 1.5032890640590577),
  ('get_waiter', 1.5032890640590577),
  ('validate_configuration', 1.5032890640590577),
  ('codebuild', 1.5032890640590577),
  ('rekognition', 1.5032890640590577),
  ('apigateway', 1.5032890640590577),
  ('xray', 1.5032890640590577),
  ('codepipeline', 1.5032890640590577),...}

Here we create a list of all top k words in all classes:

word_list=[]
for cat in word_features.keys():
    word_features[cat]=dict(word_features[cat])
    for w in word_features[cat]:
        word_list.append(w)

Now that we find the top k=100 words in each class that give us maximum information gain, it’s time to train our classifier. We build two dictionaries. The first one represents n(c,w_t) per each word in "word_features" (the number of occurrence of w_t in training documents in class c) and the second one is the n(c) (the total number of words in class c):

dt_per_cl=X_train.groupby(['label']).sum().reset_index()
cl_word_count_dic={}
cl_word_dic={}
for j in range(0,len(dt_per_cl)):
 cl = dt_per_cl.iloc[j]['label']
 temp_per_cl=dt_per_cl.iloc[j,]
 list_word=list_token(temp_per_cl,1)
 list_words = [w for w in list_word if word_features[cl].get(w,-100000)!=-100000]

 cl_word_count_dic[cl] = cl_word_count_dic.get(cl,0)
 cl_word_dic[cl] = cl_word_dic.get(cl,{})
 sum_all=0
 for w in set(list_words):
 cl_word_dic[cl][w] = cl_word_dic[cl].get(w,0)
 cl_word_dic[cl][w] = temp_per_cl[w]
 sum_all += temp_per_cl[w]

 cl_word_count_dic[cl]=sum_all

Here, we calculate the |V|:

vocab_length=0 
for cl in cl_word_dic.keys():
 length = len(cl_word_dic[cat])
 vocab_length+=length

Now, it’s time to apply the log-maximum likelihood function on the testset to predict the category of each document. We can maximize the likelihood function or minimize the negative of the function. In our approach, we minimize the negative version of the log-likelihood function.

for i in range(0,len(X_test)):

 minimum_neg_log_prob=1000000000
 min_category=''
 dt=X_test.iloc[i,]
 list_word=list_token(dt,0)
 list_words = [w for w in list_word if w in word_list]
 cl_doc=dt['label']
 Doc_ID = dt['Doc_ID']

 for cl in cl_word_count_dic:
   #this is the -p(c), n is total number of documents in trainset
   neg_log_prob= -log(num_doc_cl.loc[cl]/n)

   word_dict = cl_word_dic[cl]
   count_cl = cl_word_count_dic[cl]
   for w in list_words:
    count_word_train=word_dict.get(w,0)
    #this the laplace correction of p(W_t|c)
    ratio = (count_word_train+1)/(count_cl+vocab_length)
    #minimizing the negative version
    neg_log_prob-=log(ratio)

    if minimum_neg_log_prob>neg_log_prob:
      min_category=cat
      minimum_neg_log_prob=neg_log_prob

li_results.append((Doc_ID,min_category,cat_dev))

The result shows each document with its predicted class in the second column (min_category) and the ground truth in the third column (cl_doc):

[('Doc868', 'cloud', 'img'),
 ('Doc440', 'ntw', 'ntw'),
 ('Doc334', 'cloud', 'cloud'),
 ('Doc736', 'nlp', 'nlp'),
 ('Doc785', 'img', 'img'),
 ('Doc523', 'nlp', 'nlp'),
 ('Doc226', 'cloud', 'nlp'),
 ('Doc68', 'cloud', 'inf'),
 ('Doc12', 'cloud', 'ntw'),
 ('Doc549', 'cloud', 'cloud'),...]

We calculate the performance of the classifier as follow by comparing two last columns of "li_results":

precision= 0.5929707286187414
recall= 0.5346135118242998
F-measure= 0.5622820047774385

Conclusion

In this article, we present how to select features of documents in a way to maximize the information gain from those features about the category of documents. We use the Multinomial Naive Bayes method as a classifier and apply Pointwise Mutual Information (PMI) for Feature Selection. All codes for this article available on GitHub.


References

[1] Schneider, K. M. (2005, October). Weighted average pointwise Mutual Information for feature selection in text categorization. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 252–263). Springer, Berlin, Heidelberg.

[2] Manning, C., Raghavan, P., & Schütze, H. (2008). Text classification and Naive Bayes. In Introduction to Information Retrieval (pp. 234–265). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511809071.014

[3] https://github.com/parthasm/Naive-Bayes-Document-Classifier


Related Articles