Multi-Label Classification and Class Activation Map on Fashion-MNIST

franky
Towards Data Science
5 min readJul 2, 2018

--

Fashion-MNIST is a fashion product image dataset for benchmarking machine learning algorithms for computer vision. This dataset comprises 60,000 28x28 training images and 10,000 28x28 test images, including 10 categories of fashion products. Figure 1 shows all the labels and some images in Fashion-MNIST.

Figure 1. Fashion-MNIST dataset. [github and arxiv]

There are many articles about Fashion-MNIST [ref]. However, the goal of this post is to present a study about deep learning on Fashion-MNIST in the context of multi-label classification, rather than multi-class classification. Furthermore, this post also investigates if we could visualize how a convolutional neural network sees an image and identify the regions which activate a particular label.

1. Visualization, Visualization, Visualization

Before jumping into all the details, let’s visualize some results we could achieve.

Figure 2. (a) A multi-label image, (b) The predicted probabilities over labels, (c) The class activation maps for the labels with higher probabilities.

Let’s break down the diagrams in Figure 2 one by one.

  1. The first diagram shows a multi-label image consisting of four fashion product images from Fashion-MNIST: a trouser, a sneaker, a bag, and a dress. The labels associated with this multi-label image are 1, 7, 8, and 3. Note that practically we consider the presence of labels (e.g., 1/3/7/8) rather than the order of labels (e.g., 1/7/8/3) in this study. If an image comprises only two categories of fashion products, only two labels will be assigned to the image.
  2. Giving the multi-label image, the second diagram shows the predicted probabilities, calculated by our trained convolutional neural network, across all the labels. Note that the probabilities for Label 1, Label 3, Label 7, and Label 8 are all close to 1, which indicates that our trained convolutional neural network has done a good job to tag the right labels on the given image.
  3. The third diagram shows the heat maps associated with the top four predicted labels, i.e., the regions which activate the labels with higher probabilities. The purpose of highlighting the heat maps (a.k.a., the class activation maps) is to open the black box of our convolutional neural network and to see if it really pinpoints the critical areas to conduct the prediction.

Two more examples: one with 3 labels (Figure 3) and one with 2 labels (Figure 4).

Figure 3. An image with 3 labels: 2, 3, and 8.
Figure 4. An image with 2 labels: 3 and 9.

The overall training history of over 9,000 images for training and 1,000 images for validation is given in Figure 5. The accuracy on 1,000 test images is around 95% ~ 96%. The program is implemented in Keras with Theano as the backend.

Figure 5. Training history on accuracy and loss.

2. Multi-Label Fashion-MNIST

Here is a brief of our new dataset for multi-label classification:

  • 10,000 646 x 184 training images and 1,000 646 x 184 test images;
  • each image has four fashion product images randomly selected from Fashion-MNIST;
  • the meta-data file keeps the ordered labels for an image, together with its one-host encoding scheme.

Figure 6 shows some images in our dataset and Figure 7 shows some info in the meta-data file.

Figure 6. Sample images in Multi-Label Fashion-MNIST.
Figure 7. Sample meta-data in Multi-Label Fashion-MNIST.

3. Architecture of Neural Network

The focus of our neural network is to integrate the essential mechanisms for performing multi-label classification and generating class activation maps together. We will walk through the basic framework for performing multi-label classification, then examine the enhancement necessary for generating class activation maps.

First of all, let’s compare and contrast the common practices for implementing multi-class classification (where one image belongs to one class only) and multi-label classification (where one image could be associated with multiple labels.)

  • A basic neural network architecture for image classification usually consists of a stack of convolutional layers and max pooling layers for feature extraction, followed by some full-connected layers for class/label prediction.
  • For multi-class classification, the last layer in the model uses a softmax function for class prediction, and the training process uses a categorical_crossentropy function as the loss function.
  • For multi-label classification, the last layer in the model uses a sigmoid function for label prediction, and the training process uses a binary_crossentropy function as the loss function.

The practices above are based on the conventional wisdom of implementing the last-layer activation function and the loss function in deep learning as shown in Figure 8.

Figure 8. Summary of last-layer activation function and loss function. [Deep Learning with Python, Ch4/Ch9]

Now, let’s consider the mechanism for generating class activation maps. There are many different ways to create class activation maps. Our idea and code are based on global average pooling layers for object localization. There are two major points in this approach. First, the class activation map for a given class is regarded as a weighted sum over its feature maps out of the last convolutional layer. Second, a global average pooling layer is used to convert a feature map into a single value and acts as the glue for calculating the associated weights. Figure 9 from the original paper illustrates the complete process.

Figure 9. Class Activation Mapping. [ref]

Figure 10 shows the code in Keras for building and training the deep neural network. Figure 11 shows the summary of the whole model.

Figure 10. The code for building and training the deep neural network.
Figure 11. The summary of the whole model.

Figure 12 shows the code for creating the class activation map.

Figure 12. The code for creating a class activation map.

4. Conclusions

This post has presented a study on deep learning for Fashion-MNIST in the context of multi-label classification. We have also tried to open the black box of our convolutional neural network to see the class activation maps for the predicted labels. The most interesting but challenging issue in this study is implementing and validating the class activation maps with the global average pooling function, particularly when one map doesn’t look right. Finally, let me use two figures to illustrate the directions for future work.

Ref-1. DeepFashion
Ref-2. Grad-CAM

One more thing … DO YOU MNIST?

Ref-3. DO YOU MNIST from anztee.com and teezily.com

--

--