Getting Started

Calculating Document Similarities using BERT, word2vec, and other models

Varun
Towards Data Science
8 min readSep 26, 2020

--

Photo by Viktor Talashuk on Unsplash

Introduction

Document similarities is one of the most crucial problems of NLP. Finding similarity across documents is used in several domains such as recommending similar books and articles, identifying plagiarised documents, legal documents, etc.

We can call two documents similar if they are semantically similar and define the same concept or if they are duplicates.

To make machines figure out the similarity between documents we need to define a way to measure the similarity mathematically and it should be comparable so that machine can tell us which documents are most similar or which are least. We also need to represent text from documents in a quantifiable form (or a mathematical object, which is usually a vector form), so that we can perform similarity calculations on top of it.

So, converting a document into a mathematical object and defining a similarity measure are primarily the two steps required to make machines perform this exercise. We will look into different ways of doing this.

Similarity Function

Some of the most common and effective ways of calculating similarities are,

Cosine Distance/Similarity - It is the cosine of the angle between two vectors, which gives us the angular distance between the vectors. Formula to calculate cosine similarity between two vectors A and B is,

In a two-dimensional space it will look like this,

angle between two vectors A and B in 2-dimensional space (Image by author)

You can easily work out the math and prove this formula using the law of cosines.

Cosine is 1 at theta=0 and -1 at theta=180, that means for two overlapping vectors cosine will be the highest and lowest for two exactly opposite vectors. For this reason, it is called similarity. You can consider 1 - cosine as distance.

Euclidean Distance - This is one of the forms of Minkowski distance when p=2. It is defined as follows,

In two-dimensional space, Euclidean distance will look like this,

Euclidean distance between two vectors A and B in 2-dimensional space (Image by author)

Jaccard Distance - Jaccard Index is used to calculate the similarity between two finite sets. Jaccard Distance can be considered as 1 - Jaccard Index.

We can use Cosine or Euclidean distance if we can represent documents in the vector space. Jaccard Distance can be used if we consider our documents just the sets or collections of words without any semantic meaning.

Cosine and Euclidean distance are the most widely used measures and we will use these two in our examples below.

Embeddings

Embeddings are the vector representations of text where word or sentences with similar meaning or context have similar representations.

vector representation of words in 3-D (Image by author)

Following are some of the algorithms to calculate document embeddings with examples,

Tf-idf - Tf-idf is a combination of term frequency and inverse document frequency. It assigns a weight to every word in the document, which is calculated using the frequency of that word in the document and frequency of the documents with that word in the entire corpus of documents. For more details on tf-idf please refer to this story.

Let’s define following as the corpus(collections) of documents for which we want to calculate similarities,

Document corpus (Image by author)

We will perform basic text cleaning, removing special characters, removing stop words, and convert everything to lower case. Then we will convert documents to their tf-idf vectors and calculate pairwise similarities using cosine and euclidean distance.

Pairwise cosine similarity would just be the dot product of the tf-idf vectors becasue tf-idf vectors from sklearn are already normalised and L2 norm of these vectors is 1. So denominator of cosine similarity formula is 1 in this case.

print (tfidf_vectors[0].toarray())print (pairwise_similarities.shape)print (pairwise_similarities[0][:])
# documents similar to the first document in the corpus
most_similar(0,pairwise_similarities,'Cosine Similarity')
Documents similar to first document based on cosine similarity and euclidean distance (Image by author)

Word2vec - As the name suggests word2vec embeds words into vector space. Word2vec takes a text corpus as input and produce word embeddings as output. There are two main learning algorithms in word2vec: continuous bag of words and continuous skip gram.

Continuous bag of words (CBOW) and skip-gram model (Image by https://arxiv.org/pdf/1301.3781.pdf)

We can train our own embeddings if have enough data and computation available or we can use pre-trained embeddings. We will use a pre-trained embedding provided by Google.

We will start with tokenising and padding each document to make them all of the same size.

# tokenize and pad every document to make them of the same size
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer=Tokenizer()
tokenizer.fit_on_texts(documents_df.documents_cleaned)
tokenized_documents=tokenizer.texts_to_sequences(documents_df.documents_cleaned)
tokenized_paded_documents=pad_sequences(tokenized_documents,maxlen=64,padding='post')
vocab_size=len(tokenizer.word_index)+1
print (tokenized_paded_documents[0])
tokenised document (Image by author)

Let’s load the pre-trained embeddings. Each word is represented as a 300-dimensional vector.

# loading pre-trained embeddings, each word is represented as a 300 dimensional vectorimport gensimW2V_PATH="GoogleNews-vectors-negative300.bin.gz"
model_w2v = gensim.models.KeyedVectors.load_word2vec_format(W2V_PATH, binary=True)

Using this embedding we can convert every word of our document corpus into a 300-dimensional vector. As we have 6document and we have padded each document to be of maximum size 64, vector representation of the corpus would be of shape 6X64X300.

# creating embedding matrix, every row is a vector representation from the vocabulary indexed by the tokenizer index. 
embedding_matrix=np.zeros((vocab_size,300))
for word,i in tokenizer.word_index.items():
if word in model_w2v:
embedding_matrix[i]=model_w2v[word]
# creating document-word embeddings
document_word_embeddings=np.zeros((len(tokenized_paded_documents),64,300))
for i in range(len(tokenized_paded_documents)):
for j in range(len(tokenized_paded_documents[0])):
document_word_embeddings[i][j]=embedding_matrix[tokenized_paded_documents[i][j]]
document_word_embeddings.shape
document-word embedding shape (Image by author)

Now we have to represent every document as a single vector. We can either average or sum over every word vector and convert every 64X300 representation into a 300-dimensional representation. But averaging or summing over all the words would lose the semantic and contextual meaning of the documents. Different lengths of the documents would also have an adverse effect on such operations.

One better way of doing this could be taking a weighted average of word vectors using the tf-idf weights. This can handle the variable length problem to a certain extent but cannot keep the semantic and contextual meaning of words. After doing that we can use the pairwise distances to calculate similar documents as we did in the tf-idf model.

# calculating average of word vectors of a document weighted by tf-idfdocument_embeddings=np.zeros((len(tokenized_paded_documents),300))
words=tfidfvectoriser.get_feature_names()
for i in range(len(document_word_embeddings)):
for j in range(len(words)):
document_embeddings[i]+=embedding_matrix[tokenizer.word_index[words[j]]]*tfidf_vectors[i][j]
print (document_embeddings.shape)pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)
most_similar(0,pairwise_similarities,'Cosine Similarity')
most_similar(0,pairwise_differences,'Euclidean Distance')
Documents similar to first document based on cosine similarity and euclidean distance (Image by author)

GloVe - Global Vectors for word Embedding (GloVe) is an unsupervised learning algorithm to produce vector representations of word. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

We will use the pre-trained Glove embeddings from Stanford. All the steps would remain same as word2vec embeddings it’s just that in this case we will use the Glove pre-trained model. We are using Glove embeddings of 100-dimensions because of the large size of the embeddings file. You can use higher dimensions also.

Documents similar to first document based on cosine similarity and euclidean distance (Image by author)

Doc2Vec - Doc2vec is an unsupervised learning algorithm to produce vector representations of sentence/paragraph/documents. This is an adaptation of word2vec. Doc2vec can represent an entire documents into a vector. So we don’t have to take average of word vectors to create document vector.

Distributed Bag of Words version of Paragraph Vectors(PVDOBW) and Distributed Memory version of Paragraph Vectors(PVDM) (Image from https://arxiv.org/pdf/1405.4053.pdf)

We will use gensim to train a Doc2vec model on our corpus and create vector representations of documents.

Documents similar to first document based on cosine similarity and euclidean distance (Image by author)

BERT- Bidirectional Encoder Representation from Transformers (BERT) is a state of the art technique for natural language processing pre-training developed by Google. BERT is trained on unlabelled text including Wikipedia and Book corpus. BERT uses transformer architecture, an attention model to learn embeddings for words.

BERT consists of two pre training steps Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). In BERT training text is represented using three embeddings, Token Embeddings + Segment Embeddings + Position Embeddings.

BERT training architecture (Image from https://arxiv.org/pdf/1810.04805.pdf)
BERT input representation (Image from https://arxiv.org/pdf/1810.04805.pdf)

We will use a pre trained BERT model from Huggingface to embed our corpus. We are loading the BERT base model, which has 12 layers (transformer blocks), 12 attention heads, 110 million parameters and hidden size of 768.

Documents similar to first document based on cosine similarity and euclidean distance (Image by author)

So you have seen multiple ways of representing documents in vector forms and measuring similarities. You can customise them for your own problem and see what works best for you.

Complete code of this story is available here - https://github.com/varun21290/medium/blob/master/Document%20Similarities/Document_Similarities.ipynb

References:

--

--