Feature scoring metrics in word-document matrix

A comparison of TF-IDF and Pointwise Mutual Information (PMI)

Arghavan Moradi
Towards Data Science

--

Photo by Darling Arias on Unsplash

TF-IDF is a very common feature scoring metrics to find the importance of a word in a document. This metric gives a lower score to more common words such as “this”, “is” between different documents. On the other side, it gives a higher score to more specified words such as “ethernet” in a document related to the “Network” topic. In a mathematical form, the TF-IDF score of a word “w” in a document “d” from a set of documents “D” is calculated as below:

But how if there are “C” categories of documents in a set of documents “D” and we want to calculate the importance of a word with respect to the category “c” of document “d”? To solve this problem, we can use another group of feature scoring metrics known as “association measures”. One of the common association measures is “Pointwise Mutual Information (PMI)”. It is a common score for feature selection in text classification. For example, we can estimate if the vibe of a text message is positive or negative by calculating the PMI score of words in this document with respect to the “positive” or “negative” categories. Another mean of this metric is when we want to normalize a vector-space matrix of word-document in a way to respect the importance of a word “w” in a category “c”. This mean is the goal of this article. It has a focus on the presence/absence of a word in documents in a specific category. In a mathematical form the PMI score between a word (feature) “w” and a category (topic or class) “c” is calculated as follow:

p(w|c) is the number of documents in category “c” that contain the word “w”. Alternatively, p(w) is the number of documents that contain the word “w”, out of the total number of documents (regardless of their topics). PMI gives a higher score to frequent words in a specific topic. Suppose that we have two words w1 and w2 while the probability of w1 in documents in category c_i is equal to the probability of w2 in documents with the same category.

Now suppose that w1 is a word more related to category c_i and w2 is a more general word in all categories. Then, in this case, we have:

Thus, if we calculate PMI for these two words, the score assigned to w1 is higher than the score associated with w2.

A sample case

Now, let's compare the difference between these two scores with a simple sample. In our sample, we have 9 documents (D=9) and 3 different categories of documents: network, biology, and mathematic (C=3). Alternatively, we have two different words: “authentication” and “evaluation”.

a sample word-document matrix

If we calculate the TF-IDF of these two words in Doc1, We figure out that the TF-IDF of the word “authentication” and “evaluation” in document “Doc1” are equal. However, “Doc1” is in the category of “network” and the word “authentication” is more related to “network”.

Now, if we calculate the PMI between these two words and the category of “network”, the PMI score of “authentication” is greater than the PMI score of “evaluation” in documents with the category of “network”. Please notice that we consider the presence/absence of a word in a document to calculate PMI.

We can add epsilon to the logarithm value to avoid having zero as a result. Thus:

Conclusion

In this article, we compare two feature scoring metrics: TF-IDF and Pointwise Mutual Information (PMI). TF-IDF finds the importance of a word in a document. The PMI score finds the importance of a word in a category. In other words, it calculates the association between a word and a class. It is worth to mention that there is another mean of PMI to calculate the association between different words in a corpus of documents with the purpose of dimensional reduction. You can read this article for more information.

References

[1] Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 31–40.

[2] Forman, G. (2008, October). BNS feature scaling: an improved representation over tf-idf for svm text classification. In Proceedings of the 17th ACM conference on Information and knowledge management (pp. 263–270).

[3] https://medium.com/dataseries/understanding-pointwise-mutual-information-in-nlp-e4ef75ecb57a

--

--