Data Science

TF-IDF: Term Frequency and Inverse Dense Frequency Techniques

With examples!

Delal Tomruk
Towards Data Science
3 min readDec 27, 2020

--

Photo by William Iven on Unsplash

TF-IDF is used to measure the importance of a word in data. It is particularly useful for scoring the words in text related computations, such as text analysis and Natural Language Processing (NLP) algorithms.

We measure TF-IDF scores using the following formula:

Source: R-bloggers, https://www.r-bloggers.com/2014/02/the-tf-idf-statistic-for-keyword-extraction/

Simply put:

TF = number of times the term appears in a document/total number of words in the document

IDF = log(number of documents/number of documents the term appears)

In this equation, we can observe that TF-IDF calculates the number of times a word appears in each document, however the frequency diminishes if the word appears in other documents. Therefore, the word is not particularly important for the specific document, as it has also appeared commonly in other documents as well.

This explains why we don’t exclude stop words in a TF-IDF computation. Stop words are the words that appear commonly in a text and don’t have a particular meaning (some examples are: ‘the’,a’ and ‘is’). Since our calculation offsets the effect if the word appears in other documents, it is very likely that stop words appear commonly across all documents and will result in a low TF-IDF score in any case.

If the computation score is high, it means the word is rare.

A Quick Example

Assume that a document has 20 words and 5 of them is the word “great”. The TF will be calculated as:

tf: 5/20 = 0.25

Now assume that we have 5 documents in total and the word “great” appears in 2 of them. The IDF will be calculated as:

idf: log(5/2)= 0.398

Therefore, the TF-IDF will be:

tf-idf: (0.25)(0.398) = 0.0995

Another Quick Example — with Sample Code!

To compute the TF-IDF score, we first need to remove all punctuation and lower case the words.

#replace punctuation characters with a spacedf['example'] = df['example'].str.replace('[^\w\s]','')#store words in lowercase formdf['example'] = df['example'].str.lower()

Count how many times each word appears in the document. (You can also calculate TF-IDF with scikit-learn.)

d={}examples = df['example']for p in examples:       dict = {}       examples_df = df.get_group(p)       for i, row in examples_df.iterrows():             count = row['count']             word = row['words']

d[p] = examples_dict

Define the computation of IDF in a lambda function and apply it to the respective columns.

df['idf'] = df.apply(lambda x: math.log(total/x.total_word_count), axis = 1)df['example_tfidf'] = df.apply(lambda x: x.example_tfidf * x.idf, axis = 1)

You can find an extensive example on my Github.

Thanks for reading! Let me know if you have any feedback :)

--

--