Using clustering to improve classification — a use case

boosting Naive Bayes for text classification

Published in

Towards Data Science

5 min readJul 31, 2020

In today’s blog, we are going to give the intuition of one of our early articles published in a Hindawi Journal named “International Scholarly Research Notices” [Refrence 1]. The plot of the story is given in the title. The genre is text classification. The main protagonists are naive-Bayes and k-means.

This article will serve a couple of purposes

Motivate you to try your own intuition (little crazy at the outset may be, like using clustering for classification, actually I got a scolding from one of my professors ) and convert into a formal approach
Will serve as a blueprint in converting an intuition into a full-blown research article. The most important part is how systematically you establish your method works.

It can be divided into the following tasks

Task 1: Establish that Naive Bayes needs an improvement

We picked up 3 data-sets that were publicly available. Picked up 3 other ML algorithms and did an empirical study.

Fig 1: Naive Bayes vs Other Classifiers ( Source: Author)

This was enough to illustrate Naive Bayes is indeed not having a good performance compared to other algorithms.

Tips: Give as many procedural details as possible for reproducibility. Pick Algorithms that are popular enough to be compared with.

Is it worth the effort still? I will give two arguments, naive Bayes is one of the easiest methods to understand, secondly, it’s quite fast.

For a quick understanding of Naive Bayes, you can use Reference 3

Task 2: Developing the intuition

This does not have a set formula. Our idea was Naive Bayes assumes that the feature or attributes are independent of each other. This is one of the reasons it is very fast, at the same time this is also a limiting weakness. For text classification, the words are our features. To be specific, the age-old tf-idf technique has been used for converting the text to a number. For more details you can go through Reference 5.

The intuition used was instead of using all the words, can we give the algorithm a set of words that are relatively independent of each other. That is if it wants Pizza, serve him that, rather than a full course buffet.

How do we find words which are independent or relatively independent?

Embedding: The first step was having a feature representation of the words. Our point was if two words have kind of similar frequency across documents then they are similar. let’s use an illustration.

Figure 2: Embedding of the words (Source: Author)

If we look at the above table and assume the tick mark indicates the presence and the cross indicates the absence of the word in the document, then we can conclude

Word 1 and Word 2 are not similar
Word 2 and Word 3 are similar

Now, if we extend this intuition loosely, we may not be very incorrect to conclude that words that are not very similar are relatively independent.

How to find a set of relatively independent words easily?

The number of words, as well as documents, are huge. How do we possibly find this out on such a large matrix? Another intuition, if we can group the features based on similarity, can we not get group of words (features) that are relatively independent. How to do it? Does not it seem like a clustering problem? Bingo! We simple use k-means on the word embedding

Form k Clusters, each cluster will have a set of similar words
Pick one representative word from each of the clusters(Nearest to the mean)

That’s the crux of the paper. There are some minor nuances for which, I will recommend you to read the paper. We first use Chi-Square to remove some features (More on that some other day), let’s just assume it helps to weed out inconsequential words. This feature selection method, we called as FSCHICLUST.

We may be interested to know, how many words were reduced by this method?

Fig 3: Feature Reduction Percentage achieved by FSCHICLUST (Source: Author)

We can see from Figure 3, there is a dramatic reduction in the number of words that are being used now. You must be very skeptical, that with so few words, does it really work. We answer that now.

Task 3: Establish your scheme works

This is a tricky part and we need to really plan it out well, by breaking it up.

Firstly, is it better than vanilla naive Bayes?
Secondly, if yes (the answer of the first question) then is the performance comparable with other classifiers?
Thirdly, if we apply any standard feature selection algorithm with naive Bayes does our method works better?

We have to prove that these sets of words give better classification accuracy or not, this time we have increased the number of data-sets.

Fig 4:Vanilla Naive Bayes and Feature Selection using FSCHICLUST (Source: Author)

Does it do well as compared to other classifiers? The classifiers were compared on the same data-sets.

Fig 5: Improved Naive Bayes vs Other Classifiers(Source: Author)

Thirdly, does it do well compared to other feature selection algorithms

Fig 6: Propose method with other feature selection Algorithm (Source: Author)

So, we see all our questions are well satisfied and so it may satisfy the reviewers as well.

Conclusion:

In this article, we discussed how naive Bayes can be improved at text classification using k-means.

We first established naive Bayes does a poor job.
Then we developed a feature selection scheme based on clustering, which selects only a few words
Demonstrated this scheme works, as mentioned earlier this has to be done systematically, probably the most important point.

References:

[1] Dey Sarkar S, Goswami S, Agarwal A, Aktar J. A novel feature selection technique for text classification using Naive Bayes. International scholarly research notices. 2014;2014. (https://www.hindawi.com/journals/isrn/2014/717092/)

[2] Sarkar, Subhajit Dey, and Saptarsi Goswami. “Empirical study on filter-based feature selection methods for text classification.” International Journal of Computer Applications 81.6 (2013).

[3] https://towardsdatascience.com/all-about-naive-bayes-8e13cef044cf

[4]https://archive.ics.uci.edu/ml/datasets.php

[5] https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76

Using clustering to improve classification — a use case

boosting Naive Bayes for text classification

Written by Dr. Saptarsi Goswami