The world’s leading publication for data science, AI, and ML professionals.

A Glance at Text Similarity

How To Compute the Similarity between Documents

Natural Language Processing Notes

Photo by Tim J on Unsplash
Photo by Tim J on Unsplash

What Is Text Similarity?

Consider the following 2 sentences:

  • My house is empty
  • There is nobody at mine

A human could easily determine that these 2 sentences convey a very similar meaning despite being written in 2 completely different formats; The intersection of the 2 sentences only has one word in common, "is", and it doesn’t provide any insight into how similar the sentences. Nonetheless, we’d still expect a similarity algorithm to return a score that informs us that the sentences are very similar.

This phenomenon describes what we’d refer to as semantic text similarity, where we aim to identify how similar documents are based on the context of each document. This is quite a difficult problem because of the complexities that come with natural language.

On the other hand, we have another phenomenon called lexical text similarity. Lexical text similarity aims to identify how similar documents are on a word level. Many of the traditional techniques tend to focus on lexical text similarity and they are often much faster to implement than the new deep learning techniques that have slowly risen to stardom.

Essentially, we may define text similarity as attempting to determine how "close" 2 documents are in lexical similarity and semantic similarity.

This is a common, yet tricky, problem within the Natural Language Processing (NLP) domain. Some example use cases of text similarity include modeling the relevance of a document to a query in a search engine and understanding similar queries in various AI systems in order to provide uniform responses to users.

Popular Evaluation Metrics for Text Similarity

Whenever we are performing some sort of Natural Language Processing task, we need a way to interpret the quality of the work we are doing. "The documents are pretty similar" is subject and not very informative in comparison to the model has a 90% accuracy score. Metrics provide us with objective and informative feedback to evaluate a task.

Popular metrics include:

  • Euclidean Distance
  • Cosine Similarity
  • Jaccard Similarity

I covered the Euclidean Distance and Cosine Similarity in Vector Space Models, and Sanket Gupta‘s article on an Overview of Text Similarity Metrics covers the Jaccard similarity metric in good detail.

To gain a better understanding of the two ways we evaluate text similarity, let’s use code the example above in python.

Lexical Text Similarity Example in Python

# importing libraries
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# utility function to evaluate jaccard similarity
def jaccard_similarity(doc_1, doc_2):
    a = set(doc_1.split())
    b = set(doc_2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))
# defining the corpus
corpus = ["my house is empty", "there is no one at mine"]
# to evaluate cosine similarities we need vector representations
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
# printing results
print(f"Cosine Similarity: {cosine_similarity(df, df)[0][1]}nJaccard Simiarity: {jaccard_similarity(corpus[0], corpus[1])}")
>>>> Cosine Similarity: 0.11521554337793122
     Jaccard Simiarity: 0.1111111111111111

Contrary to our beliefs that these documents are similar, both evaluation metrics say that our sentences aren’t very similar at all. This is expected because as we said previously, the documents do not contain similar words which hence they are not considered to be similar.

Semantic Text Similarity Example in Python

from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.matutils import softcossim
corpus = ["my house is empty", "there is no one at mine"]
dictionary = corpora.Dictionary([simple_preprocess(doc) for doc in corpus])
glove = api.load("glove-wiki-gigaword-50")
sim_matrix = glove.similarity_matrix(dictionary=dictionary)
sent_1 = dictionary.doc2bow(simple_preprocess(corpus[0]))
sent_2 = dictionary.doc2bow(simple_preprocess(corpus[1]))
print(f"Soft Cosine Similarity: {softcossim(sent_1, sent_2, sim_matrix)}")
>>>> Soft Cosine Similarity: 0.7836213218781843

As expected, when we considered the context of the sentences being used, we were able to identify that our texts are very similar despite not having many common words.

Wrap Up

After reading this article, you now know what text similarity is and the different ways you can go about measuring how similar text are (i.e. lexical text similarity and sematic text similarity). Here’s an idea for aspiring Data Scientists: If you’re a jobseeker and looking for a break into NLP, an idea may be to create a resume parser that tells you how similiar your resume is to the job description.

Thanks for reading! Connect with me on LinkedIn and on Twitter to stay up to date with my latest post on Artificial Intelligence, Data Science, and Freelancing.

Related Articles

5 Ideas For Your Next NLP Project

Sentiment Analysis: Predicting Whether A Tweet Is About A Disaster

Introduction To Machine Translation


Related Articles