Tag2Image and Image2Tag — Joint representations for images and text

Published in

Towards Data Science

5 min readJun 3, 2018

Parsing a complex scene and describing its content is not a complex task for humans. Indeed, people can quickly summarize a complicated scene in a few words. But it’s much more complex for computers. To produce systems that can achieve this, we need to combine current state-of-the-art techniques in both Computer Vision and Natural Language Processing.

As a first step, we are going to see how to generate feature vectors for both visual images and text data. Then describe the CCA algorithm that will help us join the constructed features in a common space. And finally, present the results of 2 pipelines (Text2Image and Image2Text) on Microsoft COCO dataset.

Transfer Learning

Image Features

A Convolutional Neural Network (CNN) can be used to extract features from images. The 16-layer VGGNet pre-trained on ImageNet is an example. It was the state-of-the-art model in ImageNet Challenge 2014. We only need to remove the last fully-connected layer and treat the rest of the CNN as a fixed feature extractor for our dataset. This would compute a 4096-D vector for every image.

Tag Features

Word embeddings are a group of natural language processing tools aiming at mapping words into a high geometric space. In other words, a word embedding function takes a text corpus as input and produces the word vectors as output such that the distance between any two vectors would capture part of the semantic relationship between the two associated words.

As an example, ”student” and ”aircraft” are words that are semantically different, so a reasonable embedding space would represent them as vectors that would be very far apart. But ”breakfast” and ”kitchen” are related words, so they need to be embedded close to each other.

To achieve this mapping, we can use any state-of-the-art pre-trained model: Word2Vec(300-dimensional word vectors pre-trained on Google News dataset) or GLOVE(300-dimensional word vectors pre-trained on Common Crawl dataset with 1.9M vocab)

CCA (Canonical Correlation Analysis)

Now, we are going to give the high level ideas about a popular and successful approach for mapping visual and textual features to the same latent space.

Two-view CCA minimizes the distance (equivalently, maximizes the correlation) between images (triangles) and their corresponding tags (circles)

Given 2 sets of N vectors: X referring to the image features and Y representing the textual features. Let their covariances Σxx and Σyy respectively, and let Σxy be the cross-covariance.

The linear Canonical Correlation Analysis (CCA) seeks pairs of linear projections that maximizes the correlation of the two views:

The CCA objective can be written as the following optimization problem:

Let x and y be points referring to the textual and visual data respectively. To compare x and y we can use the cosine similarity:

Qualitative Results

In MS COCO dataset, each image is described by 5 captions. A first step consists of pre-possessing these captions by removing all the stop words and then concatenate them to get one bag of words (BoW). Afterwards, we perform a weighted average of all the embedding words using the TF-IDF technique which consists on weighting words based on how frequently they occur in each caption.

Example of an image and its corresponding captions

Tag2Image

For this task, we aim to retrieve images that are described by a given query text. Given a query text, we first project its feature vector into the CCA space, and then use it to retrieve the most similar image features from the database.

Query 1: “A man playing tennis”.

Query 2: “A man jumping in the air in a skateboard”.

We can clearly remark that the retrieved images are very close from the query image.

Image2Tag

Here, we aim to find a set of tags that properly describe a query image. Given a query image, we first project its feature vector into the CCA space, and then use it to retrieve the most similar text features.

In general the retrieved keywords describe quite well the query images. However, we can identify some errors (in red). For instance, with the last example, the word ”walking” was wrongly retrieved. We think that it may be due to the fact that there are many images in the training set that contain both ”people” and ”walking”.

Conclusion

Canonical Correlation Analysis can be used to build pipelines of multimodal retrieval. Giving a data set of images and their tags, CCA maps their corresponding feature vectors to the same latent space where a common similarity measure could be used to perform Image2Tag and Tag2Image search tasks.

Stay tuned and if you liked this article, please leave a 👏!

Reference

[1] CCA:A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics.