Hands-on Tutorials

This is the third article of the "Embed, Cluster, and Average" series. Before diving deep into this tutorial, I recommend reading first the previous two articles: Extracting rich embedding features from pictures using PyTorch and ResNeXt-WSL and Manifold clustering in the embedding space using UMAP and GMM.
In this tutorial, we will take the embedding extracted from COCO pictures using the ResNext-WSL model, the sparse topic representation provided by the UMAP transformation, the GMM clustering model, and we will produce an embedding representation for collections of pictures (Bag Of Words documents). We will show why traditional averaging techniques don’t work very well for documents made of several multi-topics objects, and we will extend the Partitioned Smoothed Inverse Frequency (P-SIF) methodology to work for a general-purpose domain beyond words and texts.
The bag-of-words model
The bag-of-words model represents documents as a weighted collection of objects (the words). Each document is represented as a term-frequency matrix where the rows are the documents, the columns are the words in the set of possible values (the vocabulary), and the values are the number of occurrences of the word in the document.
For this tutorial, we will assume that the words are the pictures and the document can be any collection of them. For convenience, we will use the COCO categories to define the documents and the term-frequency to correspond to the number of annotations of that category in the pictures.

If we plot the sum of the term frequencies in each row we have the following distribution:


We can observe that person and chair are very popular annotations in the dataset but on average each document (the category) is made of a hundred words (the pictures) and we know that each word may represent multiple contexts.
Words weighted averaging
The naive way to represent each document would be to take the embedding vectors of each picture and perform a weighted average based on the term frequency. By doing so we would have represented the document in the same embedding space of the pictures, which is a convenient property. Nevertheless, as we will see in the comparison section at the bottom, averaging over many multi-topics words would end up in almost identical vectors for all of the documents. It would be very hard to preserve the original information in the words as they would compensate each other resulting in flat less informative vectors.
Term frequency-inverse document frequency (tf-idf) averaging
The tf-idf is a statistic that measures how important is a word to a document in a collection of them (the corpus). The more frequent a word is across the document and the less important will be. Likely, rare words will have a higher weight.
This technique works quite well for text data in order to filter out common words such as the, a, an, is, at, on, and to give importance to keywords. When the word dictionary (the set of all possible pictures) does not follow a language distribution then this technique does not give much improvement.
The major issue with our kind of averaging is not in the weighting of each word but rather the algebraic operation (the sum/mean) that strips off all of the informative components of the vectors.

We can see that documents belonging to the same supercategory tend to cluster near each other but the separation is not very clear.
Partitioned SIF weighted averaging
The novel approach we introduce is inspired by the P-SIF averaging methodology.
The key intuitions are:
- Cluster all of the "words" in "topics" using sparse coding (the paper author refers to as dictionary learning but for us, they are manifold clusters).
- For each topic, do the weighted average of the pictures by Smooth Inverse Frequency.
- Concat the averages vectors in each topic.
- Remove the first principal component.

Let’s go through each step one by one.
Topic clustering
The original paper used a sparse encoding via dictionary learning algorithms such as K-SVD, a generalization of k-means for sparse coding. K-SVD would learn a sparse representation of the words via singular value decomposition (SVD), similarly to PCA, and would represent each word as a linear combination of the atoms. Each atom is like a cluster centroid. Thus, the algorithm works by alternating by learning a sparse coding of the data and updating the atoms to better fit the data. It is like optimizing SVD and k-means in a single algorithm.
In the previous article, Manifold clustering in the embedding space using UMAP and GMM, we have already discussed the limitations of those linear projections and k-means in the case of data arranged in manifold structures. Moreover, as of today, we could not find any mature implementation of the K-SVD algorithm in addition to what the paper author provided in the related GitHub repository. Therefore, we have replaced K-SVD with the combination of UMAP and GMM.
SIF Averaging
In order to perform the Smooth Inverse Frequency (SIF), we first need to calculate the probability of each word to appear in a document p.

We can then calculate the SIF weights as a function of the smoothing parameter alpha (we have used a value of 0.001).
In order to perform the SIF document averaging we have to multiply each word in the document by the SIF weights and the word topic probabilities. We can now average the words in each topic with the combined weights.
Concatenate the topic vectors
In the previous step, we generated one vector for each document and topic. The concatenation would simply stack all of those topic vectors together. We can think of the concatenated vector as a way to preserve most of the original information in the bag of words. Only the words belonging to a particular topic will contribute to the averaging, or the document coordinates, in the embedding topic sub-space.
In our tutorial, we had embedding vectors of dimensionality 2048 and 40 clusters. Thus, each document is now represented in 2048*40 = 81920 dimensions. Pretty high, huh?
We can reduce this dimensionality either by reducing the embedding space of the words before averaging or by reducing the number of topics. Those two aspects were already sub-optimized during the previous clustering step. We could, but not mandatorily, replace the original word embedding space with the reduced UMAP space (50 dimensions). Hence, we could obtain document vectors of size 50*40 = 2000 dimensions.
Generally speaking, we can apply any dimensionality reduction transformation and not necessarily the same one used during the clustering. The convenient choice of re-using the same UMAP space used for clustering should not limit the numerous options available. The aim of the clustering step is to discover topics and assign words to topics probabilities.
Remove the first principal component
In order for the SIF averaging technique to be effective, and consistent with the theoretical math behind it, we need to remove the common discourse vector. The idea is that all of the words are generated according to a process that depends on the micro-topic (the local discourse). Nevertheless, all of the documents will share a component of this discourse and that component can be removed using SVD. As a proxy, we can therefore calculate the first principal component using PCA and then subtract it from the documents embedding matrix obtained after averaging and concatenation.

We can also observe that by retaining the remaining 50 components we could reduce the dimensionality of document vectors further without loss of information.

We can also observe that by retaining the remaining 50 components we could have likely reduced the dimensionality of document vectors further without significant loss of information.
Results comparison
In order to test our results, we will calculate the cosine similarity between each pair of documents (in our example the COCO picture categories). We should expect similar documents, documents belonging to the same COCO supercategory, to have a higher similarity compared to the remaining ones.
We compare 6 configurations:
- bag of words
- bag of words in the UMAP space
- TF-IDF
- TF-IDF in the UMAP space
- p-SIF
- p-SIF in the UMAP space
We can plot a heatmap of the similarity matrix of each configuration:






We can observe that the UMAP transformation with the traditional bag of words and TF-IDF averaging leads to completely misleading results where all of the documents end up with a similar representation (and a similar cosine distance) and likely in areas of the UMAP space without any real meaning. The TF-IDF transformation does not make any relevant difference. We already expected the technique to not work well for non-language distributions. We can finally see that our revised p-SIF, whether in the original embedding space and the UMAP space, does provide sparser and more consistent results.
In order to better visualize how they perform, let’s calculate the ratio between the average cosine similarity within and without pairs of documents belonging to the same supercategory. A higher ratio corresponds to a better representation.

We can observe that p-SIF always dominates for all of the supercategories and for a few of them (e.g. animal, appliance, electronic) it is more than 7x better. By micro-averaging the results of all of the supercategories we can summarize in a single metric: the total ratio of within/without mean cosine similarity.

Our p-SIF is confirmed to be a better technique for Document Embedding averaging in the case of multiple contexts. The difference between the p-SIF in the original space and the p-SIF in the UMAP space is minimal but allows us to reduce by a factor of ~40 the size of our vectors.
Conclusions
In this article, we have proposed a re-adapted version of the partitioned-Smooth Inverse Frequency (p-SIF) algorithm. We have replaced the K-SVD dictionary learning step with the combination of UMAP + GMM. We believe the latter is a more suitable technique for bag-of-words models of embedding vectors generated from deep neural networks and arranged in manifolds in high-dimensional spaces. We have demonstrated the effectiveness of preserving the multiple topics and context of the original pictures that are averaged into the COCO supercategory documents. This novel proposed technique can be used for measuring the similarity between documents but can also be used for generating averaged embedding vectors in a new feature space that is tailored for representing the multi-sense topics that generic documents might represent.
Read the previous two articles of the "Embed, Cluster, and Average" series:
Extracting rich embedding features from COCO pictures using PyTorch and ResNeXt-WSL.
Manifold clustering in the embedding space using UMAP and GMM.
You can find the code and the notebooks at https://github.com/gm-spacagna/docem.
Originally published at https://datasciencevademecum.com on January 9, 2021.