Multi-label image classification with Inception net

Published in

Towards Data Science

6 min readApr 2, 2017

Update: This article has been updated to fix possible issues with accuracy calculation pointed out by Theresa Barton. The git repo has also been updated and you can find all the changes there.

Inception v3 is a deep convolutional neural network trained for single-label image classification on ImageNet data set. The TensorFlow team already prepared a tutorial on retraining it to tell apart a number of classes based on our own examples. We are going to modify the retraining script retrain.py from that tutorial to change the network into a multi-label classifier.

If you just want to jump to the resulting code, it’s here with all the necessary files and information required to make it work. From now on, I will assume that you have cloned the mentioned repository and refer to its files.

So what needs to be done? First of all, we have to somehow tell the network which are the correct labels for each image. Then we have to modify both the last layer that is being retrained and the method of evaluating the generated predictions to be actually able to train it with regard to multiple possible correct classes for each image.

Requirements

Data preprocessing

Prepare training images

Put all the training images into one folder inside images directory.
Try to remove all duplicate images, they could atificially inflate the test and validation accuracy.
The name of the folder does not matter. I use multi-label.

Prepare labels for each training image

We need to prepare files with correct labels for each image. Name the files <image_file_name.jpg>.txt = if you have an image car.jpg, the accompanying file should be named car.jpg.txt.

Put each label on a new line inside the file, nothing else.

Now copy all the created files into the image_labels_dir directory located in project root. You can change the path to this folder by editing global variable IMAGE_LABELS_DIR in retrain.py.

Create a file containing all labels

The original Inception net used a folder structure to derive the list of classes. In our case, all of the training images are inside one folder and we therefore need to list the classes in an external file.

Create file labels.txt in project root and fill it with all the possible labels. Each label on a new line, nothing else. Just like an image_label file for an image that is in all the possible classes.

Modifying main

The main() method originally loaded the directory structure containing images for each label in separate folders and created a validation, testing and training sets for each class by:

image_lists = create_image_lists(FLAGS.image_dir, FLAGS.testing_percentage,cFLAGS.validation_percentage)

We now have all the images inside one directory and therefore the image_lists.keys() contains only one element and that is the folder with all of our images (e.g. multi-label). All the training images are split into validation, testing and training sets accessible through this key.

Now that we have our data correctly split up, we just need to load the list of labels and calculate the class count:

with open(ALL_LABELS_FILE) as f:
    labels = f.read().splitlines()
class_count = len(labels)

Creating ground_truth vectors

Add get_image_labels_path() method which is just slightly edited get_image_path() method returning a path to a file containing correct image labels = e.g. image_labels_dir/car.jpg.txt for car.jpg.
Edit get_random_cached_bottlenecks() method:

This method creates the ground_truth vectors containing the correct labels of each returned image. Originally it simply created a vector of zeroes:

ground_truth = np.zeros(class_count, dtype=np.float32)

and then put a 1.0 in a position of the correct label, which we knew, because it is the name of the folder we took the image from:

ground_truth[label_index] = 1.0

It’s not that simple with multi-label classification. We will need to load all the correct labels for the given image from its image_label_file.

Get a path to the file with correct labels:

labels_file = get_image_labels_path(image_lists,label_name,image_index, IMAGE_LABELS_DIR, category)

Read all lines = labels from file and save them into an array true_labels:

with open(labels_file) as f:
   true_labels = f.read().splitlines()

Initialize the ground_truth vector with zeroes:

ground_truth = np.zeros(class_count, dtype=np.float32)

Indicate the correct labels in the ground_truth vector with 1.0:

idx = 0
for label in labels:
   if label in true_labels:
      ground_truth[idx] = 1.0
   idx += 1

The labels list is an added parameter to the get_random_cached_bottlenecks() method and contains names of all the possible classes.

That’s it! We can improve this solution by caching the created ground_truths. That prevents creating the ground_truth vector every time we request it for the same image, which is bound to happen if we train for multiple epochs. That is what the global dictionary CACHED_GROUND_TRUTH_VECTORS is for.

Modifying training

The add_final_training_ops() method originally added a new softmax and fully-connected layer for training. We just need to replace the softmax function with a different one.

Why?

The softmax function squashes all values of a vector into a range of [0,1] summing together to 1. Which is exactly what we want in a single-label classification. But for our multi-label case, we would like our resulting class probabilities to be able to express that an image of a car belongs to class car with 90% probability and to class accident with 30% probability etc.

We will achieve that by using for example sigmoid function.

Specifically we will replace:

final_tensor = tf.nn.softmax(logits, name=final_tensor_name)

with:

final_tensor = tf.nn.sigmoid(logits, name=final_tensor_name)

We also have to update the way cross entropy is calculated to properly train our network:

Again, simply replace softmax with sigmoid:

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits,ground_truth_input)

Modifying evaluation

The method add_evaluation_step() inserts the operations we need to evaluate the accuracy of the predicted labels. Originally it looked like this:

correct_prediction = tf.equal(tf.argmax(result_tensor, 1), tf.argmax(ground_truth_tensor, 1))

Okay, what is happening here?

Both result_tensor and ground_truth_tensor can be imagined as 2D arrays:

|        | label1  | label2 | label3  |
| image1 |    0    |    1   |    0    |
| image2 |    1    |    0   |    0    |

Therefore this line:

tf.argmax(result_tensor, 1)

returns index of the maximal value in each row. Each row because of the (axis = 1) parameter.

We will get the indices with the highest values and compare them, while knowing that because only one label can be correct the ground_truth_tensor contains only one 1 in each row.

To adapt this approach to our multi-label case we simply replace the argmax() with round() which turns the probabilities into 0 and 1. Then we compare the result_tensor with ground_truth_tensor already containing only the 0 and 1:

correct_prediction = tf.equal(tf.round(result_tensor), ground_truth_tensor)

That’s all the changes we need to do to properly classify images with multiple labels.

Running the retraining

Simply run this command from project root:

python retrain.py \
--bottleneck_dir=bottlenecks \
--how_many_training_steps 500 \
--model_dir=model_dir \
--output_graph=retrained_graph.pb \
--output_labels=retrained_labels.txt \
--summaries_dir=retrain_logs \
--image_dir=images

I recommend playing with the number of training steps to prevent overfitting your model.