How and Why TF-IDF Works

You’d like to get a number representation of how important a word is across a set of documents. Maybe you’re looking to use this representation to summarize a current event according to several articles just written about it, or you’d like to perform sentiment classification of Yelp reviews. After some googling, you’ve found TF-IDF, and its mathematical definition. How and why does it work?
Mathematical definition

- t=term
- d=document
- D=set of documents
TF-IDF defines importance of a term by taking into consideration the importance of that term in a single document, and scaling it by its importance across all documents.
Importance of a term in a document (term frequency): tf(t,d)
Term frequency answers the question of, how many times does this word appear in this document among the number of times all words appear in this document? In other words, how important is this word to this specific document?

Given a document containing only the sentence:
The cat is in the box.
You would say that the word ‘house’ appears 0 times out of all 6 words that appear in the document, or tf(‘house’, document1)=0/6=0.
Similarly, in a different document containing a single sentence:
Yes, the cat is in the box that’s in the house.
The word ‘house’ appears 1 time out of all 11 words that appear in the document, or tf(‘house’, document2)=1/11.
Importance of a term across all documents (inverse document frequency): idf(t,D)
Inverse document frequency answers the question of, how common (or uncommon) is this word among all the documents I have?

We have 2 documents at this point. The resulting inverse document frequency of the word ‘house’ is represented by idf(‘house’, D)=log(2/1), because it appears in 1 out of the 2 documents in our collection.
Similarly, the inverse document frequency of the word ‘the’ is represented by idf(‘the’, D)=log(2/2). You might have the following questions:
For IDF, why do we take the inverse? Why do we use logarithmic scale? Why is it multiplied with TF?
The ratio created by (total # documents)/( # documents that contain the word) is inverted to give a higher value to words that are less common among all the documents. For instance, in the example above, ‘house’ is more uncommon among all 2 documents, as it appears in one and not the other, and is therefore a value of 2/1, or 2 (instead of 1/2). The word ‘the’ is more common among all 2 documents, as it appears in all documents, and is therefore a value of 2/2, or 1. If these values weren’t inverted, ‘the’ would be deemed "more important" because it would have a higher value (1 > 1/2). This is why IDF is important to the overall calculation, as it "takes care" of words like ‘the’ that naturally appear in the English language by giving it a lower IDF value. I will add on to what else "takes care" means in a little bit.
Inputting these values into the logarithmic scale serves to put it on the "same scale" as the term frequency:

If IDF was not logarithmically scaled, high IDFs would have an astronomical effect on the TF-IDF value. Imagine a term with an IDF of 2, and another term with an IDF of 4. Now, imagine those same terms, but they now have IDFs of 20 million, and 40 million, respectively. Their resulting TF-IDF values would be tremendously different. The term with the higher IDF would definitely overpower the other. The other term might as well just not be considered at all for importance. With the logarithmic scale, the effects of IDF values are "smoothed":

This way, TF-IDF values are on a more equal playing field.
Coming back to how lower IDF values "take care" of naturally appearing words in the English language. An IDF of 1 means that a term appears in every single _ document in our collection, as # documents in our collection=# documents the term appears i_n. This could occur with a term like ‘the’. With _log(1)=_0, the term is given a value of zero, and is thus "taken care" of, by being removed as a candidate for term importance.
TF-IDF is a popular approach used to weigh terms for NLP tasks because it assigns a value to a term according to its importance in a document scaled by its importance across all documents in your corpus, which mathematically eliminates naturally occurring words in the English language, and selects words that are more descriptive of your text. NLP tasks such as text summarization, information retrieval, and sentiment classification are therefore some tasks that utilize TF-IDF for its powerful weighing operation. Now that you understand how and why it works, try your hand at some of those tasks!