The world’s leading publication for data science, AI, and ML professionals.

TF-IDF Demystified

Learn all you need to know about a key player in text vectorization

There are a plethora of ways to convert text into vectors for machine learning and natural language applications. Here, we explore one of the more headache-inducing methods: the TF-IDF.

Of course, when I call it headache-inducing, I mainly speak for myself. This is a topic that I struggled to get my head around when I first started learning Data Science.

TF-IDF is utilized in many projects that require analyzing text or building products that leverage NLP. As I delved into a variety of NLP projects, I often found myself running into this topic time and time again.

For that reason, I stress the importance of being familiar with the TF-IDF.

That being said, understanding the concept alone may not be sufficient; you might also have to be able to explain it verbally.

TF-IDF is often covered in data science interviews for positions that require NLP expertise. Although it is not the most complicated word vectorization method, it can be difficult to explain in words.

As a result, this has become a go-to topic for many well-intentioned (or sadistic) interviewers that wish to weed out candidates.

I write this article with some emotional investment, hoping that this 5 letter enigma doesn’t have people running around in circles as I did.

TF-TDF

The term frequency-inverse document frequency, or TF-IDF, is an approach to representing words as vectors to extract insights from textual data.

One of the reasons beginners struggle to wrap their heads around the TF-IDF is that its numerical representation of words is abstract. The values assigned to each word for each document do not have a concrete meaning.

The abstract nature of the TF-IDF stems from the fact that it is not an actual statistic; it is a product of 2 separate statistics.

The TF-IDF of a word is determined by 2 factors: the term frequency (TF) and the inverse document frequency (IDF).

Let’s break down each component slowly, lest you get a headache and click off this article too soon.

Term Frequency

The term frequency refers to the number of occurrences of a word in the selected bodies of text. It can be derived by the following formula:

Representing each word based on its frequency in a document alone would strongly resemble the bag-of-words approach.

The bag-of-words model simply records the number of occurrences of each word in each document. Relying on word frequency for context alone can result in a higher weightage for words that may be prevalent in multiple documents (e.g. stop words).

Inverse Document Frequency

This issue can be addressed by introducing the inverse document frequency (IDF) component to offset words that occur at a high frequency for many documents.

The inverse document frequency can be derived by the following formula:

With the inclusion of this statistic, a word won’t get an inflated valuation if it is present in many documents.

Think of the IDF component as a counterbalance that ensures that the TF component does not overvalue a word’s relevance in a document.

With the TF and IDF values of a word, you are now able to effectively quantify that word’s importance in a document.

By combining the TF component and the IDF component, you get the TF-IDF.

Shocking, I know.

The numerical representation of a word in a document gives insight into how important that word is in that document. If one word has a higher TF-IDF value than another in a vector, its relevance to that document is deemed to be greater.

Use Cases

The TF-IDF has proven to be an effective text vectorization method that is applicable in many real-life scenarios.

Due to its ability to quantify a word’s importance in a document, it is ideal for keyword extraction. With the TF-IDF’s valuation of each word in a document, it can identify words with the highest values and deem them to be keywords (also useful for text classification or text summarization).

The TF-IDF is also commonly used for information retrieval. Think of the search engines that you use on a regular basis. They are designed to provide you with the most relevant documents by evaluating each document based on its relevance to the user query and returning the highest-ranking results.

Limitations

Although the TF-IDF is able to give a lot of insight into the words present in documents, it comes with a few drawbacks.

Firstly, the TF-IDF does not pay any attention to the sequential order of the words in the text. This alone leads to some context being lost after the text is vectorized.

Secondly, the TF-IDF does not take into consideration the semantic values of the words. Each word is assumed to be independent of all the others.

Suppose that you are searching for information related to the word "king". In such a case, the TF-IDF will only consider the word "king". Synonyms like "ruler" or "monarch" would be treated as completely different entities even though they have similar semantic meanings.

Using TF-IDF In Python

A word’s TF-IDF value is not difficult to compute; it’s simple algebra. You could easily develop your own vectorizer that converts text into TF-IDF vectors using the formula.

That being said, if you are aiming to use the TF-IDF in any of your NLP projects, it is ideal to rely on the scikit-learn module’s TF-IDF vectorizer.

The scikit-learn’s TF-IDF vectorizer can perform all the tedious calculations for you, but it goes beyond doing just that.

The vectorizer also allows you to:

  • normalize the TF-IDF values (This mitigates bias from corpora that are too long or short)
  • choose the n-gram range used in the vectorization (I wrote an article covering n-grams if you need a refresher)
  • reduce dimensionality by only including the top words in terms of term frequency

Here, we will load 3 pieces of text from the built-in corpora in the NLTK package.

For the sake of the demonstration, let’s build a vectorizer that removes stop words, considers unigrams and bigrams, and only chooses the 10 words with the highest term frequency.

From the output, you can see how each row of TF-IDF values represents each document.

Note: The chosen parameters are not optimal. The best n-gram range and dimensionality for the vectors depend on the text in question and can only be determined through experimentation.

Conclusion

Now you are familiar with how the TF-IDF is computed, what its values represent, and why it is so prevalent in NLP applications.

If you have gained a solid understanding of this approach to text vectorization, you have reached a significant milestone in your road to mastering natural language processing.

I wish you the best of luck in your data science endeavors!


Related Articles