Review: SegNet (Semantic Segmentation)

Encoder Decoder Architecture, Using Max Pooling Indices to Upsample, Outperforms FCN, DeepLabv1, DeconvNet

Sik-Ho Tsang
Towards Data Science

--

SegNet by Authors (https://www.youtube.com/watch?v=CxanE_W46ts)

In this story, SegNet, by University of Cambridge, is briefly reviewed. Originally, it was submitted to 2015 CVPR, but at last it is not being published in CVPR (But it’s 2015 arXiv tech report version and still got over 100 citations). Instead, it is published in 2017 TPAMI with more than 1800 citations. And right now the first author has become the Director of Deep Learning and AI in Magic Leap Inc. (Sik-Ho Tsang @ Medium)

Below is the demo from authors:

SegNet by Authors (https://www.youtube.com/watch?v=CxanE_W46ts)

There is also an interesting demo that we can choose a random image or even upload our own image to try the SegNet. I have tried as below:

The segmentation result for a road scene image that I found from internet

Outline

  1. Encoder Decoder Architecture
  2. Differences from DeconvNet and U-Net
  3. Results

1. Encoder Decoder Architecture

SegNet: Encoder Decoder Architecture
  • SegNet has an encoder network and a corresponding decoder network, followed by a final pixelwise classification layer.

1.1. Encoder

  • At the encoder, convolutions and max pooling are performed.
  • There are 13 convolutional layers from VGG-16. (The original fully connected layers are discarded.)
  • While doing 2×2 max pooling, the corresponding max pooling indices (locations) are stored.

1.2. Decoder

Upsampling Using Max-Pooling Indices
  • At the decoder, upsampling and convolutions are performed. At the end, there is softmax classifier for each pixel.
  • During upsampling, the max pooling indices at the corresponding encoder layer are recalled to upsample as shown above.
  • Finally, a K-class softmax classifier is used to predict the class for each pixel.

2. Differences from DeconvNet and U-Net

DeconvNet and U-Net have similar structures as SegNet.

2.1. Differences from DeconvNet

  • Similar upsampling approach called unpooling is used.
  • However, there are fully-connected layers which make the model larger.

2.2. Differences from U-Net

  • It is used for biomedical image segmentation.
  • Instead of using pooling indices, the entire feature maps are transfer from encoder to decoder, then with concatenation to perform convolution.
  • This makes the model larger and need more memory.

3. Results

  • Two datasets are tried. One is CamVid dataset for Road Scene Segmentation. One is SUN RGB-D dataset for Indoor Scene Segmentation.

3.1. CamVid dataset for Road Scene Segmentation

Compared With Conventional Approaches on CamVid dataset for Road Scene Segmentation
  • As shown above, SegNet obtains very good results for many classes. It also got the highest class average and global average.
Compared With Deep Learning Approaches on CamVid dataset for Road Scene Segmentation
  • SegNet obtains highest global average accuracy (G), class average accuracy (C), mIOU and Boundary F1-measure (BF). It outperforms FCN, DeepLabv1 and DeconvNet.
Qualitative Results

3.2. SUN RGB-D Dataset for Indoor Scene Segmentation

  • Only RGB is used, depth (D) information are not used.
Compared With Deep Learning Approaches on SUN RGB-D Dataset for Indoor Scene Segmentation
Class Average Accuracy for Different Classes
  • Higher accuracy for large-size classes.
  • Lower accuracy for small-size classes.
Qualitative Results

3.3. Memory and Inference Time

Memory and Inference Time
  • SegNet is slower than FCN and DeepLabv1 because SegNet contains the decoder architecture. And it is faster than DeconvNet because it does not have fully connected layers.
  • And SegNet has low memory requirement during both training and testing. And the model size is much smaller than FCN and DeconvNet.

--

--