I have been working on an exploratory project to build a conceptual description model that generates interpretation for artworks. To begin, I started using a CNN-LSTM architecture that can work as a simple caption generator. In this post, I will describe how to build a basic CNN-LSTM architecture to create a model that can output text caption based on images. Please stay tuned for more on my project later.
CNN-LSTM
The main approach to this image captioning is in three parts: 1. to use a pre-trained object-recognition network to get features from images and 2. to map these extracted feature embeddings to text sequences, then lastly 3. to use the long-short term memory (LSTM) to predict the word that follows a sequence given the map of features and text sequence.
Data
This is a supervised network where we are mapping the image to a set of specific captions. So we need a set of data that has images and captions. Here are a few datasets that are publicly available to explore.
- Flickr30K Dataset: Bryan Plummer and his colleagues’ work of 30,000 images from Flickr and multiple corresponding descriptions.
- Conceptual Captions Dataset: Google AI’s effort to collect conceptual caption from the web. It contains captions and URL links for more than 3 million images.
I will not go into details of how to download or resize images here. I have resized all images to fit within 500×500 pixels and in the jpg format. Each file name should be an identifying ID that links the file to image descriptions.
Feature Extraction
The first step is to extract features using a pre-trained network. This code takes the local directory containing jpg files and outputs the reference dictionary that has an image ID as a key and the feature embedding as its value.
This process may take a while, so I added a code that lets you dump the progress on a local drive at every 1000 iterations.
Description Preprocessing
The accompanying captions need to be preprocessed. This function assumes that the set of description takes the format of a list with nested tuples (id, description)
. Some images may have more than one description associated with them, and each pair should be represented as a separate tuple. In this step, we are also adding special words in each sentence to mark the beginning and the end of the sequence (seqini and seqfin).
[(id1, 'description1-1'), (id1, 'description1-2'), (id2, 'description2-1') ... ]
Additionally, I added a function that randomly selects n
numbers of descriptions, so I can set the maximum number of captions, in case there are way too many captions per some images.
This function converts the list of description tuples into a dictionary of preprocessed descriptions in the below format:
{ 'id1': ['description1-1', 'description1-2', ...], 'id2': ... }
# or
{ 'id1': 'description1', 'id2': 'description2 }
Cross-validation
Now that we have our dataset, let’s divide them into train, validation, test sets (here I’m dividing it into 7:1.5:1.5).
from sklearn.model_selection import train_test_split
train_list, test_list = train_test_split(list(descriptions.keys()), test_size = 0.3)
val_list, test_list = train_test_split(test_list, test_size = 0.5)
Sequence Generator
To recap, now we have three lists of ids for a train set, test set, and a validation set. We also have a description dictionary and a feature dictionary. Now we need to convert the description into a sequence. For example, let’s say an image with an id of 1234 has a description that says "dog is running". This needs to be broken down into input and output sets as below:
- 1234: [seqini] → [dog]
- 1234: [seqini][dog] → [is]
- 1234: [seqini][dog][is] → [running]
- 1234: [seqini][dog][is][running] → [seqfin]
But not only do we need to tokenize the sentence into text sequences, but we also need to map these into the integer labels. So for instance, let’s say our entire corpus has these unique words: [seqini, fast, dog, is, running, seqfin]. We can map these to an integer map: [0, 1, 2, 3, 4, 5]. This will result in the below sets:
- 1234: [0] → [2]
- 1234: [0, 2] → [3]
- 1234: [0, 2, 3] → [4]
- 1234: [0, 2, 3, 4] → [5]
So the logic is to assign unique words in the entire training description set to unique indices to create a text to sequence map. Then we iterate through each of the descriptions to tokenize them, create subsets (input → output) of sequence development, then convert them into an integer sequence. Since we use several variables created during the training set to process the test set, I put them all into a sequence generator object class.
This will return the two sets of inputs and one set of output for each training and validation set.

Model Training
Now let’s train the model using Keras. As mentioned in the approach, the idea is to train the sequence of words through the LSTM layers so it can derive the best possible word (actually a number assigned to a word) that would follow the sequence.
Prediction
Predicting a test image follows the above steps backward. You compute the feature embedding for the test image and feed it into the model with the initial sequence, which is an integer representation of the initiating word ‘seqini’. Then you take the prediction, add it to the sequence, and feed it into the model again, and repeats until the model predicts the integer sequence for the ending word ‘seqfin’. Then using the tokenizer, we convert the integers back to the mapped vocabularies. The below function will make a prediction given an image id.
Evaluation
Evaluating a machine-generated caption is not really straightforward as we can imagine. Instead, we can look at how well the n-grams of the prediction match the reference captions. This measure is called the BLEU (Bilingual Evaluation Understudy) Score. The idea is to take the average of the percentage of 1- to 4-grams within the reference descriptions that were found in the machine-generated captions, then to apply a penalty for texts that may inflate the percentage.
from nltk.translate.bleu_score import corpus_bleu
bleu_1 = corpus_bleu(references, # list (or next list)
predictions, # list
weights = (1, 0, 0, 0)) # 1-4 grams weights
We looked at how to create an image captioning model using CNN-LSTM architecture. For humans, describing the visual scene involves a different level of language representation and the perception that depends on one’s past experience and current context. Also developing a sentence sequence in human cognition might not be actually ‘sequential’, as the syntax does not directly match the visual saliency. So there are some important distinctions that can be taken into consideration to push this model further for more successful and human-like performance.
Happy Learning!
