Review: IDW-CNN — Learning from Image Descriptions in the Wild Dataset Boosts the Accuracy (Semantic Segmentation)

Outperforms FCN, CRF-RNN and DeepLabv2

Published in

Towards Data Science

7 min readJul 30, 2019

In this story, IDW-CNN, by Sun-Yat-sen University, The Chinese University of Hong Kong, and SenseTime Group (Limited), is briefly reviewed.

Segmentation accuracy is increased by learning from an Image Descriptions in the Wild (IDW) dataset.
Unlike previous image captioning datasets, where captions were manually and densely annotated, images and their descriptions in IDW are automatically downloaded from Internet without any manual cleaning and refinement.

This is a 2017 CVPR paper with tens of citations. (Sik-Ho Tsang @ Medium)

Outline

Constructing an Image Description in the Wild (IDW) Dataset
IDW-CNN Architecture
Training Approaches
Experimental Results

1. Constructing an Image Description in the Wild (IDW) Dataset

IDW is built with 2 stages.

1.1. First Stage

21 prepositions and verbs that are frequently presented, such as ‘hold’, ‘play with’, ‘hug’, ‘ride’, and ‘stand near’, and 20 object categories from VOC12 such as ‘person’, ‘cow’, ‘bike’, ‘sheep’, and ‘table’, are prepared.
Their combinations in terms of ‘subject + verb/prep. + object’ leads to 20×21×20 = 8400 different phrases, such as ‘person ride bike’, ’person sit nearbike’, and ‘person stand near bike’.
Some are rarely presented in practice, for example ‘cow hug sheep’.
Hundreds of meaningful phrases are collected after removing meaningless phrases.

1.2. Second Stage

These phrases are used as key words to search images and their surrounding texts from the Internet.
The invalid phrases, such as ‘person ride cow’, are discarded, if the number of their retrieved images is smaller than 150 to prevent rare cases or outliers, which may lead to over-fitting in training.
As a result, 59 valid phrases are obtained. Finally, IDW has 41,421 images and descriptions.

**The number of images in IDW with respect to each object category in PASCAL VOC 2012**

The above histogram reveals the image distribution of these objects in real world without any manually cleaning and refinement.

1.3. Image Description Representation

Each image description is automatically turned into a parse tree, where we select useful objects (e.g. nouns) and actions (e.g. verbs) as supervisions during training.
Each configuration of two objects and the action between them can be considered as an object interaction, which is valuable information for image segmentation but it is not presented in the labelmaps of VOC12.

**The constituency tree generated by language parser**

First, Stanford Parser is used to parse image descriptions and produce constituency trees as above. But it still contains irrelevant words that do neither describe object categories nor interactions.
Then, it is needed to convert constituency trees into semantic trees, which only contains objects and their interactions. 1) filter the leaf nodes by their part-of-speech, preserving only nouns as object candidates, and verbs and prepositions as action candidates. 2) nouns are converted to objects. The lexical relation data in WordNet to unify the synonyms. Nouns that do not belong to the 20 object categories will be removed from the tree. 3) map the verbs to the defined 21 actions using word2vec. 4) extract the object interactions from the semantic tree through the nodes.
e.g. ‘girl plays with lamb, holding lamb’ is first filtered out of the description, and then is further transferred into ‘person plays with sheep, holding sheep’.
After parsing all image descriptions in IDW, 62,100 object interactions are obtained in total.

**The number of images with respect to the number of interactions, showing that each image has 1.5 interactions on average.**

The construction of IDW has no manual intervention and has extremely low expense compared to previous datasets.

**The constituency tree after POS tag filtering (Left), and object interactions (Right)**

1.4. Three Test Sets

int-IDW: randomly choose 1,440 images from IDW as a test set of object interaction prediction.
seg-IDW: annotate the per-pixel labelmap for each image in int-IDW, resulting in a segmentation test set. seg-IDW is more challenging than VOC12 in terms of the object diversity in each image.
zero-IDW: zero-shot test set includes 1,000 images of unseen object interactions. For instance, the image of ‘person ride cow’ is a rare case (e.g. in bullfight) and is not appeared in training.

2. IDW-CNN Architecture

**(a) IDW-CNN, which has two streams, (b) Each subnet has the same network structure.**

The network can be divided into three main parts.

2.1. Feature Extraction

IDW-CNN employs DeepLabv2 as a building block for feature extraction.
IDW-CNN only inherits ResNet-101 from DeepLabv2, yet removing the other components such as multi-scale fusion and CRF.
Given an image I, ResNet-101 produces features of 2048 channels. The size of each channel is 45 × 45.

2.2. Seg-stream

The above features are employed by a convolutional layer to predict segmentation label map (denoted as Is), the size of which is 21×45×45.

2.3. Int-stream

This stream has three stages.
In the first stage, we reduce the number of feature channels from 2048 to 512 by a convolutional layer, denoted as h, so as to decrease computations for the subsequent stages.
Each feature map in hm, hm_i, is obtained by preforming the elementwise product (“⊗”) between h and each channel of Is, which represents a mask. Therefore, each hm_i ∈ R of 512×45×45 represents the masked features of the i-th object class.
In the second stage, each hm_i is utilized as input to train a corresponding object subnet, which outputs a probability characterizing whether object i is presented in image I.
21 object subnets are trained, which have the same network structures but their parameters are not shared except the fully connected layers are shared. It is in orange color at the right of the above figure.
Overall, the second stage determines which objects are appeared in I.
In the third stage, 22 action subnets are trained, each of which predicts the action between two appeared objects. It is in blue color at the right of the above figure.
For instance, if both ‘person’ and ‘bike’ are presented in I, the combination of their features, hm_person ⊕ hm_bike ∈ R of 512×45×45, is propagated to all action subnets.
The largest response is more likely to be produced by one of the following action subnets, ‘ride’, ‘sit near’, and ‘stand near’.

2.4. Object-Pair Selection (OPS)

OPS merges features of the presented objects. It is in purple color at the left of the above figure.
For example, if object subnets of ‘person’, ’bike’, and ‘car’ have high responses, each pair of features among hm_person, hm_bike, and hm_car are summed together elementwisely, resulting in three combined features denoted as hm_person+bike, hm_person+car, and hm_bike+car.
Each merged feature is then forwarded to all 22 action subnets.

2.5. Refinement

The i-th object subnet produces a score (probability), and all 21 scores are concatenated as a vector.
It is treated as a filter to refine the segmentation map Is using convolution.

3. Training Approaches

Each image in IDW contains object interactions but without labelmap.
Each image in VOC12 has labelmap but no interactions.
IDW-CNN estimates a pseudo label for each sample and treats it as ground truth in BP.
For Seg-stream, a latent Is_idw is estimated as ‘pseudo ground truth’ by combining the predicted segmentation map, Is_idw, and the predicted object labels, lo_idw.
For Int-stream, a prior distribution is obtained with respect to actions between each pair of objects. For ‘bike’ and ‘person’, this prior produces high probabilities over the above four actions and low probabilities over the others. In the training stage, the loss function offers low penalty if the predicted action is among one the above, otherwise provides high penalty.

4. Experimental Results

ResNet-101: 74.2% mIoU.
IDW-CNN(10k): 10k IDW images for training, 81.8% mIoU.
IDW-CNN(20k): 20k IDW images for training, 85.2% mIoU.
IDW-CNN(40k): 40k IDW images for training, 86.3% mIoU. And it outperforms SOTA approaches such as FCN, CRF-RNN and DeepLabv2.

Similar on seg-IDW dataset, IDW-CNN(40k) has the best performance.

**Recall of object interaction prediction**

Recall-n (n = 5, 10), measuring the possibility that the true interaction is among the top 5 or 10 predicted interactions.
IDW-CNN outperforms the others by 3% at Recall-5.