The world’s leading publication for data science, AI, and ML professionals.

The Best Document Similarity Algorithm in 2020: A Beginner’s Guide

If you want to know the best algorithm on document similarity task in 2020, you've come to the right place.


Picking the winner from 5 popular algorithms based on an experiment

Photo by David Clode on Unsplash
Photo by David Clode on Unsplash

If you want to know the best algorithm on document similarity task in 2020, you’ve come to the right place.

With 33,914 New York Times articles, I’ve tested 5 popular algorithms for the quality of document similarity. They range from a traditional statistical approach to a modern deep learning approach.

Each implementation is less than 50 lines of code. And all models being used are taken from the Internet. So you will be able to use it out of the box with no prior data science knowledge, while expecting a similar result.

In this post, you’ll learn how to implement each algorithm and how the best one is chosen. The agenda goes by:

  1. Defining the best;
  2. Experiment goal statement;
  3. Data setup;
  4. Comparison criteria;
  5. Algorithm setup;
  6. Picking the winner;
  7. Suggestion for starters;

You want to dip your toes into natural language processing and AI. You want to spice up the user experience with relevant suggestions. You want to upgrade the old existing algorithm. Then you will love this post.

Shall we get started?

Data Scientists Argue For The Absolute Best

You decide to search for the term "best document similarity algorithms".

Then you will get search results from academic papers, blogs, to q&a. Some focus on a tutorial of a specific algorithm, while the other focuses on a theoretical overview.

In academic papers, a headline says this algorithm performed 80% accuracy beating all the others that achieved only 75%. Ok. But will that difference enough to make it noticeable in our eyes? What about 2% increase? How easy is it to implement that algorithm? Scientists have a bias towards going for the best in a given test set leaving out the practical implication.

The language of data scientists. Src) Google Research BERT repository.
The language of data scientists. Src) Google Research BERT repository.

In Q&A, hyped enthusiasts dominate the conversation. Some say the best algorithm today is BERT. That algorithm concept is so revolutionary it beats everything else. The cynics, on the other hand, call everything depends on the job. Some answers are from years ago predating deep learning. Take a look at this Stackoverflow. When the most voted was written in 2012, it’s hard to judge what it truly means to us.

Google would be happy to throw in millions of dollars in engineer power and the latest computational power just to improve their search 1% better. That will not likely be practical nor meaningful for us.

What is the trade off between the performance gain and the technical expertise required for implementation? How much memory does it require? How fast does it run with minimal preprocessing?

What you want to see is how one algorithm is better than another in a practical sense.

This post will provide you with a guideline as to which algorithm to implement for your next document similarity problem.

Diverse algorithms, full-length popular articles, pretrained models

There are 4 goals in this experiment:

  1. By running multiple algorithms on the same dataset, you will see which algorithm fairs against another and by how much.
  2. By using full-length articles from popular media as our dataset, you will discover the effectiveness of real-world applications.
  3. By accessing article URLs, you will be able to compare the differences in result quality.
  4. By only using the pretrained models available publicly, you will be able to set up your own document similarity and expect similar output.

"Pre-trained models are your friend."

Data Setup – 5 base articles

For this experiment, 33,914 New York Times articles are selected. They are from 2018 to June 2020. The data has been mostly collected from the RSS feed that is parsed with full content. An average length of articles is 6,500 characters.

From the pool, we choose 5 as the basis for similarity search. Each represents a different category.

On top of the semantic categories, we will measure written formats as well. The more description is down below.

  1. How My Worst Date Ever Became My Best (Lifestyle, Human Interest)
  2. A Deep-Sea Magma Monster Gets a Body Scan (Science, Informational)
  3. Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled (Business, News)
  4. Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal (Sports, News)
  5. 2020 Democrats Seek Voters in an Unusual Spot: Fox News (Politics, News)

Judgment Criteria

We will use 5 criteria to judge the nature of similarities. Please skip this section if you just want to see the results.

  1. Tag Overlap
  2. Section
  3. Subsections
  4. Story Style
  5. Theme

Tags are the closest proxy to human judgments in content similarity. Journalists themselves write down the tags by hand. You can inspect them at news_keywords meta tags in the HTML headers. The best part of using tags is that we can objectively measure how much overlap the two contents have. Each tag goes in size ranging from 1 to 12. The more overlap 2 articles have, the more similar they are.

Second, we look at the section. That’s how New York Times categorizes articles at the highest level: science, politics, sports, etc. The first part of URL displays section (or slug) right after the domain (nytimes.com/…).

The second is a subsection. For example, an opinion section can be subdivided into world, or world into Australia. Not all articles contain it, and it is not as significant as the other two.

The fourth is the style of writing. Most analysis of document comparison only looks at the semantics. But since we are comparing the recommendation in a practical use case, we want similar writing styles as well. For example, you do not want to get a commercially focused reading about "the top 10 running shoes" right after the "running shoes and orthotics" from an academic journal. We will group articles based on the writing guidelines taught at Jefferson County Schools. The list follows human interest, personality, the best (ex: product review), news, how-to, past events, and informational.

5 Algorithm Candidates

These were the algorithms we will look at.

  1. Jaccard
  2. TF-IDF
  3. Doc2vec
  4. USE
  5. BERT

Each algorithm was run against 33,914 articles to find the top 3 articles with the highest scores. That process is repeated for each of the base articles.

The input was the article content in full length. Titles were ignored.

Be aware, some algorithms are not built for document similarity in mind. But with such diverse opinions on the internet, we shall see the outcome with our own eyes.

We will not focus on conceptual understanding nor on a detailed code review. Rather the aim is to demonstrate how easy the setup is. Don’t worry if you do not understand every detail explained here. It’s not important to follow along the post. For understanding the theory, check the reading list at the bottom for the excellent blogs written by others.

You can find the entire codebase in the Github repo.

If you just want to see the results, skip this section.

Jaccard

Paul Jaccard proposed this formula over a century ago. And the concept has been the standard go-to for similarity tasks for a long time.

Luckily, you will find the jaccard the easiest algorithm to understand. The math is straightforward with no vectorization. And it lets you write codes from scratch.

Also, jaccard is one of the few algorithms that does not use cosine similarity. It tokenizes the words and calculates the intersection over union.

We use NLTK to preprocess the text.

Steps:

  1. Lowercase all text
  2. Tokenize
  3. Remove stop words
  4. Remove punctuation
  5. Lemmatize
  6. Calculate intersection/union in 2 documents

    TF-IDF

This is another well established algorithm that has been around since 1972. With enough battle testing for decades, it is the default search implementation of Elasticsearch.

Scikit-learn offers nice out of the box implementation of TF-IDF. TfidfVectorizer lets anyone try this in a blink of eyes.

The results of TF-IDF word vectors are calculated by scikit-learn’s cosine similarity. We will be using this cosine similarity for the rest of the examples. Cosine similarity is such an important concept used in many machine learning tasks, it might be worth your time to familiarize yourself (academic overview).

Thanks to scikit-learn, this algorithm came out with the shortest lines of codes.

Doc2vec

Word2vec came out in 2014, which took the breath of developers at the time. You might have heard of the famous demonstration:

King – Man = Queen

Word2vec is great at understanding the individual word, vectorizing the whole sentence takes a long time. Let alone the entire document.

Instead we will use Doc2vec – a similar embedding algorithm that vectorizes paragraphs instead of each word (2014, Google Inc). In a more digestible format, you can check out this intro blog by Gidi Shperber.

Unfortunately for Doc2vec, no corporation sponsored pretrained model has been published. We will use pretrained enwiki_dbow model from this repo. It is trained on English Wikipedia (unspecified number but the model size is decent at 1.5gb).

The official documentation for Doc2vec states you can insert any amount of input however long. Once tokenized, we will feed in the entire document, using gensim library.

Universal Sentence Encoder (USE)

This is a popular algorithm published by Google much more recently in May 2018 (the famous Ray Kurzweil was behind this publication🙃 ). The implementation detail is well documented in Google’s Tensorflow.

We will use the latest official pretrained model by Google: Universal Sentence Encoder 4.

As the name suggests, it’s been built with a sentence in mind. But the official document does not constrain the input size. There’s nothing that’s stopping us from using it for a document comparison task.

The whole document is inserted into Tensorflow as is. No tokenization is done.

Bidirectional Encoder Representations from Transformers (BERT)

This is a big shot. Google open sourced BERT algorithm in November 2018. In the following year, Google’s vice president in Search published a blog post calling BERT their biggest leap in the past 5 years.

It is specifically built to understand your search query. When it comes to understanding the context of one sentence, BERT seems to outperform every other mentioned here.

The original BERT task was not meant to handle a large amount of text input. For embedding multiple sentences, we will use Sentence Transformers open source project published by UKPLab (from German University), whose calculation speed is superior. They also provide us with a pretrained model as well that is comparable to the original model.

So each document is tokenized into sentences. And the results are averaged to represent the document in one vector.

Winner Algorithms

Let’s see how each algorithm performs in our 5 different types of articles. We select the top 3 articles by the highest scores for comparison.

In this blog post, we will go through only the results of the best performing algorithm for each of the five. For the full results along with individual article links, please see the algorithm directories in the repo.

1. How My Worst Date Ever Became My Best

BERT wins.

The article is a human interest story that involves a romantic date for a divorced woman in 50s.

This style of writing does not carry specific nouns like celebrity names. It is not time-sensitive either. One human interest story from 2010 would be likely as relevant today. Therefore, not one algorithm was far off in the comparisons.

It was a close call against USE. While a USE story took a detour into a social issue such as LGBTQ, BERT focused solely on romance and dating. Other algorithms diverged into the topics on family and children, possibly from seeing the word "ex husband".

BERT Results For The 1st Article (dating) - Document Similarity Experiment
BERT Results For The 1st Article (dating) – Document Similarity Experiment

2. A Deep-Sea Magma Monster Gets a Body Scan

TF-IDF wins.

This scientific article talks about 3D scanning the active volcano inside the ocean.

3D scanning, volcano, and ocean are rare terminologies and they are easy to pick up. All algorithms faired well.

TF-IDF correctly picked those that only talk about volcano within the earth’s ocean. USE was a good contender as well with the focus on volcanoes on Mars instead of oceans. Others picked up articles about Russian military submarines which is not science-related and off topic.

TF-IDF Results For The 2nd Article (volcano) - Document Similarity Experiment
TF-IDF Results For The 2nd Article (volcano) – Document Similarity Experiment

3. Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled

TF-IDF wins.

The article talks about what has happened to Renault and Nissan after the former CEO Carlos Ghosn escaped.

The ideal matches would talk about those 3 entities. Compared to the previous two, this article is much more event driven and time-sensitive. The relevant news should happen similar to this date or after (it is from November 2019).

TF-IDF correctly chose articles that focus on Nissan CEO. Others picked articles that were talking about generic automotive industry news such as an alliance between Fiat Chrysler and Peugeot.

It’s also worth mentioning Doc2vec and USE led to exactly the same result.

TF-IDF Results For The 3rd Article (Nissan) - Document Similarity Experiment
TF-IDF Results For The 3rd Article (Nissan) – Document Similarity Experiment

4. Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal

Tie among Jaccard, TF-IDF and USE.

The article is about tennis player Dominic Thiem in Australian Open (tennis match) in 2020.

The news is event driven and very specific to the individual. So ideally the matches will be about Dominic and Australian Open.

Unfortunately, the result suffered from a lack of sufficient data. All of them talked about tennis. But some of the matches were talking about Dominic in French Open from 2018. Or, did about Roger Federer from the Australian Open.

The result is a tie among the 3 algorithms. This speaks of the crucial importance: we need to our best at collecting, diversifying, and expanding the data pool for the best similarity matching result.

Jaccard Results For The 4th Article (tennis) - Document Similarity Experiment
Jaccard Results For The 4th Article (tennis) – Document Similarity Experiment

5. 2020 Democrats Seek Voters in an Unusual Spot: Fox News

USE wins.

The article is about Democrat with a particular focus on Bernie Sanders appearing on Fox News for the 2020 election.

Each topic can be big on its own. There is an abundance of articles about democratic party candidates and the election. Since the gist of the story is a novelty, we prioritized those that discuss the relation of democratic party candidate and Fox.

A side note: in practice, you want to be careful with suggestions in politics. Mixing liberal and conservative news can easily upset readers. Since we are dealing with New York Times alone, it will not be our concern.

USE found articles that talked about Bernie Sanders and TV cables such as Fox and MSNBC. Others picked articles that discuss other Democrat candidates in 2020 election. Those were considered too generic.

USE Results For The 5th Article (Fox) - Document Similarity Experiment
USE Results For The 5th Article (Fox) – Document Similarity Experiment

King of Speed

Before concluding the winner, we need to talk about the performance time. Each algorithm performed very differently in terms of speed.

The result was that TF-IDF implementation was way faster than any other. To calculate through 33,914 documents on a single CPU from the start to end (tokenization, vectorization, and comparison), it took:

  • TF-IDF: 1.5 min.
  • Jaccard: 13 min.
  • Doc2vec: 43 min.
  • USE: 62 min.
  • BERT: 50+ hours (each sentence was vectorized).

TF-IDF took only one and a half minutes. That’s 2.5% of what it took on USE. Of course, you can incorporate multiple efficiency enhancements. But the potential gain needs to be considered discussed first. It would give a reason us another reason to have a hard look at the development difficulty trade-off.

Here’s the winner algorithms from each article.

  1. BERT
  2. TF-IDF
  3. TF-IDF
  4. Tie among Jaccard, TF-IDF, and USE
  5. USE

From the result, we can say for a documentation similarity in news articles, TF-IDF is the best candidate. That’s particularly true if you use it with minimal customization. It is also surprising given TF-IDF is the second oldest algorithm invented. Rather you might be disappointed that the modern state-of-art AI deep learning does not mean anything in this task.

Of course, each deep learning technique can be improved by training your own model and preprocess the data better. But all come with development costs. You want to be thinking hard about how better that effort will bring in relative to the naive TF-IDF method.

Lastly, it is fair to say we should forget about Jaccard and Doc2vec altogether in document similarity. They do not bring any benefit compared to the alternative today.

Recommendation For Starters

Say you are deciding to implement the similarity algorithm in your application from scratch, here is my recommendation.

1. Implement TF-IDF first

The state of art in document similarity match out of the box is TF-IDF, despite the deep learning hype. It gives you a high-quality result. And the best of all, it is lightning fast.

As we’ve seen, upgrading it to deep learning methods might or might not give you a better performance. Lots of thoughts must be placed beforehand to calculate the trade-off.

2. Accumulate Better Data

Andrew Ng gave an analogy "data is the new oil" back in 2017. You cant expect your car to run without oil. And that oil has to be good.

Document similarity relies as much on the data diversity as on the specific algorithm. You should put most of your effort to find unique data to enhance your similarity result.

3. Upgrade to Deep Learning

Only and only if you are dissatisfied with the result of TF-IDF, migrate to USE or BERT. Set up the data pipeline and upgrade your infrastructure. You will need to take into account the explosive calculation time. You will likely preprocess the word embedding, so you can handle similarity matching at a run time much faster. Google wrote a tutorial on that topic.

4. Tweak the Deep Learning Algorithm

You can slowly upgrade your model. Training your own model, fitting the pretrained into the specific domain, etc. There are also many different deep learning models available today. You can try one by one to see which one fits your specific requirement the most.

Document Similarity Is One Of Many NLP Tasks

You can achieve document similarities with various algorithms: some are traditional statistical approach and others are cutting edge deep learning methods. We have seen how they compare to one another in the real world New York Times articles.

With TF-IDF, you can easily start your own document similarity on your local laptop. No fancy GPU is necessary. No large memory is necessary. With high-quality data, you will still get competitive results.

Granted, if you want to do other tasks such as sentiment analysis or classification, deep learning should suit your job. But, while researchers try to push the boundary of deep learning efficiency and performance, it’s not healthy for all of us to live in the hype loop. It creates tremendous anxiety and insecurity for the newcomers.

Staying empirical can keep our eyes on reality.

Hopefully, the blog has encouraged you to start your own NLP project.

Let’s start getting our hands dirty.


This post is made in conjunction with the document clustering of Kaffae project I’m working on. Along that line, I am planning to do another series on a sentence similarity.

Stay tuned😃

Further Reading


Related Articles