The world’s leading publication for data science, AI, and ML professionals.

Building a Deep Learning Image Captioning Model on Azure

What's going on in this image? A tutorial that shows you how to build a deep learning model on the cloud that generates image descriptions.

Hands-on Tutorials

What’s going on in this image? A tutorial on how to create a deep learning computer vision and NLP model on the cloud.

In onboarding for my new role leveraging AI/ML, I set out to build a deep learning model on the cloud from scratch. During college, I interned at an art museum, where I saw firsthand the industry-wide challenge of digitizing artworks to make them accessible to broad audiences.

Even when an artwork was digitized for online access, it wasn’t always easy to find. A user would have a hard time looking up a painting in the museum’s database that they perhaps saw but couldn’t remember the name of. They would likely use semantic language to describe the image (e.g., "man sitting on a bench with an instrument") instead of key phrases such as the painting’s title ("Mezzetin"), date (1718–20), or the artist’s name (Antoine Watteau). This use case inspired me to build a neural network that could identify the key components of an image and generate a descriptive caption based on the scene.

The applications of an image captioning model, however, extend beyond the art museum space. Organizations that deal with many images that require easily searchable descriptions – such as factories, newspapers, or digital archives – could benefit from the automatic generation of captions without relying on the labor-intensive process of manual description. Moreover, captions have the added benefit of supporting accessibility standards. Written captions can be read aloud by adaptive technology for individuals with vision disabilities.

Azure Cognitive Services offers image tagging and image descriptions services. However, I wanted to see how I could build and deploy a model in the cloud from scratch. My goal was to train, test, and deploy a deep learning model using Azure’s tools and services, and ultimately see how its performance compared to some of Azure’s existing offerings.

To build the deep learning model, Jason Brownlee’s image captioning model served as a jumping point. I used Azure Databricks and Azure Machine Learning as the platforms for creating my deep learning model. The general training and deployment workflow, however, is similar regardless of the cloud provider platform you use:

  1. Store data
  2. Set up the training and testing environment on Azure
  3. Load and pre-process data
  4. Train and evaluate a model
  5. Register model
  6. Prepare model for deployment
  7. Deploy model to compute target
  8. Consume deployed model (i.e., web service) for inference

In what follows, I walk through how I built the model and share tips for spinning up a version for yourself. At the end, I discuss why the model performs as it does and identify opportunities for improvement.

Overview of Tech Stack and Architecture

What happens when an image gets described? There are two main considerations at play: one, the model must identify the key subjects and supporting details in an image. This component would consist of "Computer Vision," with the model recognizing what is visually represented and outputting a unique feature vector. The second concept is natural language processing. The model must then translate what it’s seeing into a coherent sentence that summarizes whatever is happening in the image.

The project was built on the Microsoft Azure cloud. Specifically, the model was trained on Databricks using data ingested from Blob Storage. The models were then registered in an Azure Machine Learning Workspace so that the model could be deployed as a RESTful API endpoint for end usage. A user could send multiple public URLs to images to the endpoint and receive a caption generated from the Azure Computer Vision Cognitive Service and the deep learning model.

Below is the architectural diagram that maps out the tools and services used:

The dataset I used was the Flickr 8k dataset, which consists of over 8,000 images scraped from Flickr’s database that is available under the Creative Commons license. It also included A CSV file containing five captions written by a person corresponding to each image (over 40,000 captions total). Each image had multiple captions in different syntactical styles because there are many ways to describe an image, with some more accurate, detailed, or concise than others.

A pre-trained word embedding model, GloVe, was used to map the words in the training data to a higher dimensional space to extract accurate relationships between words.

The model architecture I used was a formulation of a Recurrent Neural Network (RNN) and a data generator to load the data progressively. The former approach helped build a sentence by choosing words based on previous ones in the sentence. The latter avoided the burden of loading all the data at once in-memory on the cluster node. I took advantage of the pretrained VGG16 model to apply transfer learning (the InceptionV3 model could serve as an alternative). This way, I could customize an existing model to meet the needs of my solution using Keras.

To assess the quality of the generated captions, I utilized the BLEU score, which is a quantitative evaluation of how natural sounding the caption is compared to the "ground truth" (i.e., human-written) captions.

Process

Learn how to train and deploy a deep learning model on Azure in eight steps.

1. Store data in Blob Storage

Blob Storage is optimal for storing large amounts of unstructured data such as images. (You can learn about creating a Blob Storage account here.) Though you can store your downloaded data to blob storage using the SDK or the Azure Portal, Azure Storage Explorer has a handy UI that allows you to directly upload your data without requiring much effort.

2. Setting up your Azure environment virtual machines (VMs) and loading data using a Databricks mount point

The environment for creating a deep learning model can be done by creating an instance of an Azure Machine Learning (AML) Workspace and a Databricks Workspace through the Azure Portal. From there, you can launch the Databricks Workspace, create notebooks, and spin up compute clusters.

There are many different configurations you can choose for your compute cluster. For the Databricks Runtime Version, it’s best to use a runtime with "ML" in the version name so you can ensure that the installed library versions are compatible with the popular Machine Learning libraries. The beauty of spinning up clusters on Databricks is that the nodes are shut down after inactivity, which is a financially beneficial feature.

Sidenote: if you use a large dataset such as the Flickr 16k or 30k dataset, I recommend you train on GPUs to train it more quickly. Horovod, a distributed training framework that was born out of Uber, could accelerate deep learning training even further.

In Databricks, the data is ingested via a mount point. This authorizes the notebook in Databricks to read and write data in the Blob storage container where all the folders (i.e., blobs) exist. The benefit of creating a mount point to your data is so that you don’t have to finagle the tooling for your specific data sources but instead utilize the Databricks File System as though the data were stored locally.

Here’s the code snippet on how to set up a mount point to load the data:

You can then navigate to the directory using the prefix dbfs/mnt/mount-name. For example, you can write something like open('dbfs/mnt/mycontainer/Flickr8K_data/captions.csv', 'r') to read the file.

3. Pre-processing image and caption data

To prepare the data for model training, the captions dataset and images directory required preprocessing. This step consisted of creating a "vocabulary dictionary" of all words that were present in the training dataset (this is the word bank the model draws from to create captions), reshaping the images to the target size of the VGG16 (224 by 224 pixels) model, and extracting the features of each image using the VGG16 model.

After processing all the captions and applying standard Natural Language Processing (NLP) cleaning practices (e.g., making all words lowercase and removing punctuation), the vocabulary size was 8,763 words.

I won’t go into much further detail about preprocessing the data because that step is outlined well in Brownlee’s walk-through.

4. Defining the model and training it on data

Sometimes in NLP, your training data isn’t sufficient for the model to understand accurate relationships between words. GloVe is an algorithm that represents similarity between words in a vector space.

Using pre-trained word vectors available through Stanford’s NLP project helps distinguish relationships among the words in the training vocabulary set to create a comprehensible sentence. I used the dataset with 400,000 words represented in 200-dimensional vectors. By mapping the training vocabulary to the pre-trained GloVe model, the resulting embedding matrix could be loaded into the model for training.

The use of a data generator function to load the captions and image data avoided the expense of storing the entire dataset in memory at once. I recommend checking out Brownlee’s explanations for more details on how the model architecture works.

When building the model, a sentence can be thought of as a sequence of words. Long Short-Term Memory (LSTM) is a special type of RNN, which is a neural network that is a standard approach to processing sequences, that preserves the context of a sequence by paying special attention to preceding inputs. Unlike other RNN’s, however, the LSTM can build a caption word-by-word where chronological ordering matters by predicting the next word most likely to occur based on the previous words.

Below is an architecture diagram of the neural network, which takes two inputs, the feature vector representing the image and the incomplete caption. You can learn more about the different parts (e.g., photo feature extractor, sequence processor, and decoder) of the model here.

The following function translates the neural network architecture into a Keras-defined deep learning model and outputs the model object for training. The data generator passed the training data to the model using the model.fit() function. The model was trained on 20 epochs, which took 1 hour and 36 minutes total.

5. Register model for deployment as an Azure Container Instance (ACI)

With the model trained, it can now be deployed an Azure Container Instance (ACI) so anyone could use it and create captions for any directory of images. Typically, if the model were meant for production, it could be deployed as an Azure Kubernetes Service (AKS), which you can learn about here.

The Azure Machine Learning (AML) Workspace created awhile back now comes in handy. The Workspace is called through the following code snippet, which requires you to log into your Azure account for security.

The second step is to register the model and any essential assets that the result requires in the Azure Machine Learning Workspace so that the endpoint could access them. I registered both the model and tokenizer (vocabulary) because the construction of the output sentences references the tokenizer throughout the process to predict the next word most likely to occur in the sentence.

Let’s also confirm that the model and tokenizer were successfully registered.

6. Prepare deployment configurations by writing an entry script

How does the endpoint know how to use the trained model and give an output? All of that is defined in the entry script for the ACI. The entry script defines the kind of user input to expect, what to do with the data, and how to format the results.

The below code defines an endpoint that accomplishes four steps:

  1. Receives a SAS URI to a directory of images and accesses the registered model and tokenizer
  2. Calls the trained model to build a caption for each image
  3. Calls the Azure Computer Vision Cognitive Services resource to generate a supplemental caption
  4. Returns the captions from the custom model and the Azure resource in JSON format

The entry script must have two functions in particular, init() and run(). The former loads the important assets; the latter intakes the data and performs most of the hard work preprocessing the image(s) and constructing captions. The code snippet below writes the script to a local file called "score.py," which will then be included in the configuration of the ACI deployment.

7. Deploy model to compute target

The ACI also needs a defined environment with the requisite dependencies it needs to score the input. The environment and entry script get defined in the inference configuration.

The moment of truth: once the ACI gets defined, it’s ready for deployment as an Azure Web Service so it can get called. The scoring URI that is returned from the deployment is what will be used to test the endpoint through a REST API call, so copy the URI and save it for later.

8. Score new data by calling the Azure Web Service

There are two ways to test the model endpoint: in Python or Postman. The endpoint takes any public URL of an image. As expected, the endpoint returns both the caption from the trained model and Azure Cognitive Service’s image captioning service for comparison. Test it with an image from any website or one of your own images to see the different results.

In Python, using the requests library to construct the API call is as follows:

If you prefer to work in Postman for faster debugging and the ease of a UI, that process looks like the following:

Et voila! You’ve now successfully trained and deployed a deep learning model on Azure!

Learn more about deploying your machine learning models on the cloud here.

Project results

Not all captions are created equally. There are many ways to describe an image, each of which may be syntactically correct. But some are more concise and fluid than others.

As mentioned earlier, the BLEU Score is a reliable way of measuring how good a translation is on a scale of 0 to 1. The NLTK Bleu Score package can be imported to evaluate the performance of the model by comparing the predicted captions to the actual captions written by humans.

The resulting BLEU Score with weights (1.0, 0, 0, 0) was 0.48. Typically, BLEU scores between 0.50 and 1 provide good results. Learn more about calculating the BLEU Score here.

Model Evaluation and Improvement

The model performance could still use room for improvement. It does reasonably well on images that have simple backgrounds and contain no more than a handful of distinct subjects.

However, it sometimes struggles to accurately identify what the subjects are (e.g., labeling an ambiguous figure as a man or inanimate objects as dogs) and with photos taken indoors or, if outdoors, in certain seasons. I have thoughts on why this might be, as well as what could be done in the future to make the captions more accurate and realistic sounding.

My hypothesis is that the training dataset is biased. The data was scraped from Flickr in 2014, a time when digital cameras were widely available and used by casual shooters. Many of the images are taken in family-oriented settings such as sports games or backyards. While there are images taken in other settings, too, an overwhelming number contain dogs fetching a ball, kids playing in the grass, and groups of individuals.

With a larger dataset that had more variance in the types of scenes depicted in an image, we could expect more accurate results on images that the model has not seen before. The trade-off of a larger dataset, however, is longer training time. To mitigate this, an option could be to partition the data or convert it to a Delta Lake or parquet file in tandem with distributed training or using GPU VMs.

If one wants to increase model accuracy, one option would be to play with the hyperparameters used in training the model. For example, increasing the epochs (i.e., how many times the model passes through the entire training dataset) while keeping the batch size (i.e., how many samples it trains on at a time) relatively small helps the model catch more nuances in the dataset because the speed of gradient descent convergence decreases. The right balance between number of epochs and batch size could lead to a model that performs better overall without skyrocketing training time.

Another way to improve results is to add complexity to the neural network such as adding more layers to the model architecture. By making a model more complex, it can produce potentially better results because it is able to train on the data in finer detail.

Once the model is improved and production-ready, it could be deployed as an AKS rather than an ACI instance for use on large directories.

Final Thoughts

Like many, I found Andrew Ng’s Deep Learning and Machine Learning courses immensely helpful in learning about the deep learning space. If these topics are new to you, I recommend giving his videos a shot.

To recap, we walked through how to set up a machine learning environment on Azure, define and train a deep learning model using Keras within Databricks, and deploy the model as a container instance for evaluation on new images.


Related Articles