This blog post describes the TRIC model – an architecture for Relative Image Captioning task that was created as a part of my Master Thesis. Below you can find the list of questions that will be answered in this post:
- What is Relative Image Captioning, how it is different from Image Captioning, and what it can be used for?
- What was the original way of solving this task?
- How Transformer architecture can be adopted as a solution to the Relative Image Captioning problem?
- What are the main challenges while training such a system?
Before diving into the post you should probably brush up on the following topics:
All of them are described in my thesis in a pretty concise way so I highly recommend it – you can find a link right below. But if you want to check them from another source it is also covered. To each of the topics listed above, I have attached a link to my personal favorite resource concerning this particular subject.
The codebase for this project can be found here CODE REPO And the full thesis can be found here THESIS
Intro
Earlier this month I defended my master’s thesis in Computer Science at the Warsaw University of Technology. My specialization was AI so it was natural for me to write in the field of Deep Learning. When I was starting with Deep Learning there was one task that captured my attention the most – Image Captioning. Initially, it seemed almost magical that the model is able to generate a caption describing the image’s content. Then of course I dig deeper into how does it really work – still amazing but the magic disappeared.
My personal goal was to pick a topic that gives a possibility to research some new idea/approach to an already studied (to some extent) problem. Also, I wanted it to be somehow connected to Image Captioning. Hence, I picked Relative Image Captioning.
What is Relative Image Captioning, how it is different from Image Captioning, and what it can be used for?
Relative Image Captioning (RIC) task is the variation of Image Captioning, established in 2018 by IBM researchers (source). The goal is to generate a caption that describes relative differences between the target and candidate image with respect to the target one.

Looking at the image above, it can be seen that the model generated a caption describing relative differences between the two shirts – the bottom one is the target image and the top one is the candidate. In contrast to Image Captioning, RIC receives two images as the input. While generating the caption, the model does not only have to capture the content of those two images but also be able to tell what are the differences between them always with respect to the target one.
Okay, but why do we even need such a model?
The answer is directly connected to the Dialog-based Retrieval.

Dialog-based Retrieval system is designed to communicate with the user through dialog and based on its turns it should retrieve wanted information. In the image above we can see an example of such a system made for fashion retrieval. The goal of this particular system is to mitigate the shopping assistant – imagine that instead of browsing hundreds of pages, all with different filters you can just chat with a bot and simply describe what you are looking for. The idea is great but also the complexity of the task.
In order to train such a system, one would need to acquire a lot of sample dialogs conducted by humans. A collection of such dataset would be incredibly expensive and time-consuming. Instead of doing that, IBM researchers proposed RIC model acting as a user in the dialog loop (paper).
So what was the original approach to this problem?
The first approach to solve this problem was the adoption of the two most popular Image Captioning models: Show and Tell, Show Attend, and Tell. In short, this model follows Encoder-Decoder architecture.
As the Encoder, we have some kind of CNN that is used to encode information from images to feature vectors X_target and X_candidate. There is one Encoder for both images and it is processing them independently.
The Decoder is some kind of recurrent network so it can be vanilla RNN, LSTM, GRU, etc. Between the Encoder and the Decoder, there is a way of joining feature vectors of two input images – in the original work it was an element-wise difference but any other would also work similarly (concatenation, addition, multiplication, etc.).
Below you can see the visualization of the architectures proposed by the authors of the RIC.

From top to bottom. The first one follows the Encoder-Decoder architecture described above. The second one is adding the encoded information about the clothes’ attributes to the images’ representations (those are Texture, Fabric, Shape, Part, and Style). The third one is the same as the second one but with added Attention Mechanism as in the Show Attend and Tell model.
When I learned about this approach I thought that one thing that could improve this model is the usage of Transformer over RNN. In general, there are two main reasons to do that (if you want to get a better grasp on that topic I recommend this link):
- RNNs are sequential and cannot be parallelized when Transformers are fully parallelizable hence the power of GPU can be fully utilized here -> faster training.
- Transformers are able to handle long-range dependencies because they are processing the sentence as a whole leveraging the Self-Attention mechanism. RNNs are doing it sequentially, token by token.
After a quick chat with my supervisor, we came to a conclusion that it is worth trying so I come up with two precise objectives for my Master Thesis:
- Propose, implement, train, and analyze the performance of Transformer-based architecture for Relative Image Captioning problem.
- Identify key challenges concerning training Relative Image Captioner and point further research directions in User Modeling.
How Transformer architecture can be adopted as a solution to the Relative Image Captioning problem?

The diagram above presents the architecture of TRIC (Transformer-based Relative Image Captioner) that was implemented as a part of my Master Thesis. It adopts Transformer and Bert embeddings to Relative Image Captioning task.
Let’s start with two images at the bottom. Firstly, two images are processed by pre-trained ResNet101 in order to produce feature maps F1 and F2 of size 196×768. Then those representations are combined (by element-wise multiplication) into one which is called *F = F1 F2. F is fed into Transformer Relative Image Encoder which Encoder part of the Transformer architecture. It accepts 196 vectors each of size 768 so the output FE is also of the size 196×768. FE** is one of the inputs to the Transformer Multimodal Decoder (TMD).
The other one is the representation of the relative caption. In order to obtain it, one needs to process relative caption through the BERT model and establish the strategy for extraction of such embeddings. That is due to the fact that BERT produces several representations for each token – one representation per layer. I used 12-layer BERT so there are 12 vectors representing each token in the caption. I adopted the strategy of summing up (element-wise) the last 4 vectors for each token – that is based on various experiments performed by authors of BERT (source). Having n vectors each of size 768 (where n is the length of the caption and 768 is the hidden dim of BERT), one has to add information about the position of tokens within the caption. That is due to the fact that Transformer processes all tokens parallelly so there is no natural information about the position like in the sequential approach. Positional Encoding layer is straight copy-paste from Attention is all you need paper so it is based on sine and cosine functions (not learned like in other approaches). Caption’s embeddings with positional information are passed into the TMD.
In TMD we have two inputs: FE and caption’s embeddings. This is the part of the model that combines textual and visual information hence it is called multimodal. TMD is the Decoder part of the Transformer architecture. Lastly, the output of the TMD is passed through the linear layer followed by Softmax which produces probability distributions for tokens that are being generated.
If you are interested in details concerning this model you can check my thesis along with the code repository:
OK, so that is the model architecture. Besides that, I promised some insights about the main challenges while training such systems.
What are the main challenges while training such a system?
Below you can see a list of challenges that I was able to identify while training the RIC model. For now, I will just list them – further, each point will be expanded on.
- Cost of training
- The problem of distinguishing between the target and the candidate image.
- Issue of proper evaluation metric occurs.
- Representation of captions.
Cost of training. Training Transformer-based models aren’t cheap. In order to take advantage of the parallelization, one has to utilize the power of GPU computing. In general, you have two options to do that: be in possession of powerful GPUs or outsource them using cloud solutions. I personally used the second option adopting the Google Cloud Platform solutions. Below you can see the loss function of the best model I was able to obtain within the range of 1000zł free GCP credits.

As you can see I was able to overfit the model to the train set but nothing more. The performance on the validation set is very weak. Due to the lack of free credits, I was not able to add regularization to the model and further improve the loss on the validation set. However, I was able to identify key challenges while training such a system.
The problem of distinguishing between the target and the candidate image.

One of the problems I was able to capture was the model not being able to distinguish between candidate and target image. As it can be seen in the image above the model is able to generate meaningful captions but the direction of the relationship is wrong. The model is generating a caption describing the candidate image with respect to the target one, not the other way around as it supposes to. This can be caused by the fact that during training, Image Encoder is provided only with the joint representation of the images. It is dictated by the limited resources of GPUs used during training. One of the future Research directions could be a study of ways of representing two separate images along with the emphasis on the direction of the relative difference between them – differences between target and candidate images with respect to the target one.
Issue of proper evaluation metric occurs. BLEU metric, initially invented for the purpose of automatic evaluation of the machine translation task, was used to evaluate the origin RIC model. Due to the poor performance, the authors decided to test the captioning model with a group of evaluators, showing that the BLEU metric does not fully demonstrate the model’s capabilities. A similar issue was observed in a TRIC model evaluation when the generated caption was semantically correct but the model used different words than those presented in the ground-truth caption.

RIC architectures should be evaluated with semantic-based metrics in order to fully grasp models’ capabilities. However, the use of semantic metrics further adds to the computational cost of training the entire architecture. One could try to incorporate semantic-based evaluation into the RIC task while at the same time focusing on the reasonable computational cost of such evaluation.
Representation of captions. In the TRIC model, embeddings for tokens are based on the BERT language model, trained on large corpuses of text which are BookCorpus (800M words) and English Wikipedia (2500M words). While this is a good starting point for having contextualized embeddings of relative captions, one could pre-train or fine-tune such a language model on a fashion-related corpus. As a result, better representation of domain-specific words would be obtained – for example, phrases like "v-neck", "off-white" and other dash-separated expressions could obtain more meaningful embeddings, based on appropriate context. I believe that further research in this area can really push RIC models towards producing human-like relative captions and therefore be the strong basis for the more advanced Dialog-based Image Retrieval Systems.
Conclusions
I believe that Dialog-based Retrieval systems are the future of E-commerce. While E-commerce companies strive to continually improve the quality of online shopping for their customers, their search engines are still mostly keyword or filter-based, imposing many restrictions on user search options. Those restrictions can be lifted by conversational interfaces in form of a chatbot. Such a solution can provide an ability to conduct a conversation with a bot which will be mitigating an on-site conversation with a shopping assistant. The RIC model plays a crucial part in training such a system and it was very exciting being able to research this topic.
Besides I am very happy to write this blog post because it is my first one and I always wanted to do this. If you have any questions regarding the model (or some other) feel free to contact me on LinkedIn or in the comments.
References
[1] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, and Rogério Schmidt Feris. Dialog-based interactive image retrieval. CoRR, abs/1805.00145, 2018.
[2] Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogério Schmidt Feris. Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback. CoRR, abs/1905.12794, 2019.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
[5] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.
[6] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.