The world’s leading publication for data science, AI, and ML professionals.

A novel approach to Document Embedding using Partition Averaging on Bag of Words

How to take a collection of vector embeddings and average them preserving the multi-sense topicality of their manifold structures.

Hands-on Tutorials

Photo by Beatriz Pérez Moya on Unsplash
Photo by Beatriz Pérez Moya on Unsplash

This is the third article of the "Embed, Cluster, and Average" series. Before diving deep into this tutorial, I recommend reading first the previous two articles: Extracting rich embedding features from pictures using PyTorch and ResNeXt-WSL and Manifold clustering in the embedding space using UMAP and GMM.

In this tutorial, we will take the embedding extracted from COCO pictures using the ResNext-WSL model, the sparse topic representation provided by the UMAP transformation, the GMM clustering model, and we will produce an embedding representation for collections of pictures (Bag Of Words documents). We will show why traditional averaging techniques don’t work very well for documents made of several multi-topics objects, and we will extend the Partitioned Smoothed Inverse Frequency (P-SIF) methodology to work for a general-purpose domain beyond words and texts.

The bag-of-words model

The bag-of-words model represents documents as a weighted collection of objects (the words). Each document is represented as a term-frequency matrix where the rows are the documents, the columns are the words in the set of possible values (the vocabulary), and the values are the number of occurrences of the word in the document.

For this tutorial, we will assume that the words are the pictures and the document can be any collection of them. For convenience, we will use the COCO categories to define the documents and the term-frequency to correspond to the number of annotations of that category in the pictures.

Term-frequency matrix of the bag-of-word model of the COCO categories
Term-frequency matrix of the bag-of-word model of the COCO categories

If we plot the sum of the term frequencies in each row we have the following distribution:

Sum of term frequencies for each document
Sum of term frequencies for each document
Distribution of documents sizes
Distribution of documents sizes

We can observe that person and chair are very popular annotations in the dataset but on average each document (the category) is made of a hundred words (the pictures) and we know that each word may represent multiple contexts.

Words weighted averaging

The naive way to represent each document would be to take the embedding vectors of each picture and perform a weighted average based on the term frequency. By doing so we would have represented the document in the same embedding space of the pictures, which is a convenient property. Nevertheless, as we will see in the comparison section at the bottom, averaging over many multi-topics words would end up in almost identical vectors for all of the documents. It would be very hard to preserve the original information in the words as they would compensate each other resulting in flat less informative vectors.

Term frequency-inverse document frequency (tf-idf) averaging

The tf-idf is a statistic that measures how important is a word to a document in a collection of them (the corpus). The more frequent a word is across the document and the less important will be. Likely, rare words will have a higher weight.

This technique works quite well for text data in order to filter out common words such as the, a, an, is, at, on, and to give importance to keywords. When the word dictionary (the set of all possible pictures) does not follow a language distribution then this technique does not give much improvement.

The major issue with our kind of averaging is not in the weighting of each word but rather the algebraic operation (the sum/mean) that strips off all of the informative components of the vectors.

PCA 3D projection of the TF-IDF document matrix colored by COCO supercategories
PCA 3D projection of the TF-IDF document matrix colored by COCO supercategories

We can see that documents belonging to the same supercategory tend to cluster near each other but the separation is not very clear.

Partitioned SIF weighted averaging

The novel approach we introduce is inspired by the P-SIF averaging methodology.

The key intuitions are:

  1. Cluster all of the "words" in "topics" using sparse coding (the paper author refers to as dictionary learning but for us, they are manifold clusters).
  2. For each topic, do the weighted average of the pictures by Smooth Inverse Frequency.
  3. Concat the averages vectors in each topic.
  4. Remove the first principal component.
Overview of the P-SIF document averaging algorithm. Image by Author.
Overview of the P-SIF document averaging algorithm. Image by Author.

Let’s go through each step one by one.

Topic clustering

The original paper used a sparse encoding via dictionary learning algorithms such as K-SVD, a generalization of k-means for sparse coding. K-SVD would learn a sparse representation of the words via singular value decomposition (SVD), similarly to PCA, and would represent each word as a linear combination of the atoms. Each atom is like a cluster centroid. Thus, the algorithm works by alternating by learning a sparse coding of the data and updating the atoms to better fit the data. It is like optimizing SVD and k-means in a single algorithm.

In the previous article, Manifold clustering in the embedding space using UMAP and GMM, we have already discussed the limitations of those linear projections and k-means in the case of data arranged in manifold structures. Moreover, as of today, we could not find any mature implementation of the K-SVD algorithm in addition to what the paper author provided in the related GitHub repository. Therefore, we have replaced K-SVD with the combination of UMAP and GMM.

SIF Averaging

In order to perform the Smooth Inverse Frequency (SIF), we first need to calculate the probability of each word to appear in a document p.

Distribution of p(word), the probability of the word to appear in a document
Distribution of p(word), the probability of the word to appear in a document

We can then calculate the SIF weights as a function of the smoothing parameter alpha (we have used a value of 0.001).

In order to perform the SIF document averaging we have to multiply each word in the document by the SIF weights and the word topic probabilities. We can now average the words in each topic with the combined weights.

Concatenate the topic vectors

In the previous step, we generated one vector for each document and topic. The concatenation would simply stack all of those topic vectors together. We can think of the concatenated vector as a way to preserve most of the original information in the bag of words. Only the words belonging to a particular topic will contribute to the averaging, or the document coordinates, in the embedding topic sub-space.

In our tutorial, we had embedding vectors of dimensionality 2048 and 40 clusters. Thus, each document is now represented in 2048*40 = 81920 dimensions. Pretty high, huh?

We can reduce this dimensionality either by reducing the embedding space of the words before averaging or by reducing the number of topics. Those two aspects were already sub-optimized during the previous clustering step. We could, but not mandatorily, replace the original word embedding space with the reduced UMAP space (50 dimensions). Hence, we could obtain document vectors of size 50*40 = 2000 dimensions.

Generally speaking, we can apply any dimensionality reduction transformation and not necessarily the same one used during the clustering. The convenient choice of re-using the same UMAP space used for clustering should not limit the numerous options available. The aim of the clustering step is to discover topics and assign words to topics probabilities.

Remove the first principal component

In order for the SIF averaging technique to be effective, and consistent with the theoretical math behind it, we need to remove the common discourse vector. The idea is that all of the words are generated according to a process that depends on the micro-topic (the local discourse). Nevertheless, all of the documents will share a component of this discourse and that component can be removed using SVD. As a proxy, we can therefore calculate the first principal component using PCA and then subtract it from the documents embedding matrix obtained after averaging and concatenation.

Reconstruction error removing the principal components to the documents vectors, the dot represents the first component
Reconstruction error removing the principal components to the documents vectors, the dot represents the first component

We can also observe that by retaining the remaining 50 components we could reduce the dimensionality of document vectors further without loss of information.

The left plot is the reconstruction curve for the document vectors concatenated in the original space while the plot on the right is the reconstructions in the UMAP space. The dot represents the first principal component.
The left plot is the reconstruction curve for the document vectors concatenated in the original space while the plot on the right is the reconstructions in the UMAP space. The dot represents the first principal component.

We can also observe that by retaining the remaining 50 components we could have likely reduced the dimensionality of document vectors further without significant loss of information.

Results comparison

In order to test our results, we will calculate the cosine similarity between each pair of documents (in our example the COCO picture categories). We should expect similar documents, documents belonging to the same COCO supercategory, to have a higher similarity compared to the remaining ones.

We compare 6 configurations:

  1. bag of words
  2. bag of words in the UMAP space
  3. TF-IDF
  4. TF-IDF in the UMAP space
  5. p-SIF
  6. p-SIF in the UMAP space

We can plot a heatmap of the similarity matrix of each configuration:

Similarity matrix (bag of words)
Similarity matrix (bag of words)
Similarity matrix (bag of words in UMAP space)
Similarity matrix (bag of words in UMAP space)
Similarity matrix (TF-IDF)
Similarity matrix (TF-IDF)
Similarity matrix (TF-IDF in the UMAP space)
Similarity matrix (TF-IDF in the UMAP space)
Similarity matrix (P-SIF)
Similarity matrix (P-SIF)
Similarity matrix (P-SIF in thee UMAP space)
Similarity matrix (P-SIF in thee UMAP space)

We can observe that the UMAP transformation with the traditional bag of words and TF-IDF averaging leads to completely misleading results where all of the documents end up with a similar representation (and a similar cosine distance) and likely in areas of the UMAP space without any real meaning. The TF-IDF transformation does not make any relevant difference. We already expected the technique to not work well for non-language distributions. We can finally see that our revised p-SIF, whether in the original embedding space and the UMAP space, does provide sparser and more consistent results.

In order to better visualize how they perform, let’s calculate the ratio between the average cosine similarity within and without pairs of documents belonging to the same supercategory. A higher ratio corresponds to a better representation.

Average ratio of within and without cosine similarity of all of the pairs in each supercategory.
Average ratio of within and without cosine similarity of all of the pairs in each supercategory.

We can observe that p-SIF always dominates for all of the supercategories and for a few of them (e.g. animal, appliance, electronic) it is more than 7x better. By micro-averaging the results of all of the supercategories we can summarize in a single metric: the total ratio of within/without mean cosine similarity.

Average ratio of within and without cosine similarity of all of the pairs in all of the supercategories.
Average ratio of within and without cosine similarity of all of the pairs in all of the supercategories.

Our p-SIF is confirmed to be a better technique for Document Embedding averaging in the case of multiple contexts. The difference between the p-SIF in the original space and the p-SIF in the UMAP space is minimal but allows us to reduce by a factor of ~40 the size of our vectors.

Conclusions

In this article, we have proposed a re-adapted version of the partitioned-Smooth Inverse Frequency (p-SIF) algorithm. We have replaced the K-SVD dictionary learning step with the combination of UMAP + GMM. We believe the latter is a more suitable technique for bag-of-words models of embedding vectors generated from deep neural networks and arranged in manifolds in high-dimensional spaces. We have demonstrated the effectiveness of preserving the multiple topics and context of the original pictures that are averaged into the COCO supercategory documents. This novel proposed technique can be used for measuring the similarity between documents but can also be used for generating averaged embedding vectors in a new feature space that is tailored for representing the multi-sense topics that generic documents might represent.

Read the previous two articles of the "Embed, Cluster, and Average" series:

Extracting rich embedding features from COCO pictures using PyTorch and ResNeXt-WSL.

Manifold clustering in the embedding space using UMAP and GMM.

You can find the code and the notebooks at https://github.com/gm-spacagna/docem.


Originally published at https://datasciencevademecum.com on January 9, 2021.


Related Articles