Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+

Ethan Yanjia Li
Towards Data Science
15 min readJul 4, 2020

--

Image from Thalles’ blog

We are a team of experts to help your business adopt AI solutions or build new AI products. Contact us at info@imaginaire.ai or visit our website https://www.imaginaire.ai

Foreword

Semantic segmentation is a computer vision task to segment a target object or area from an image. This is very useful in many industries such as filming and augmented reality. In the deep learning era, Convolutional Neural Network becomes the most efficient method for semantic segmentation. Instead of trying to understand the boundary of an object through visual signals like contrast and sharpness, a deep convolutional neural network converts this task into a classification problem: if we know the class of every pixel in the image, we will get the boundary of objects for free. Under this assumption, we can just modify the final layers of a typical CNN classification network and have it output H*W values to represent classes of each pixel, instead of just outputting 1 value to represent the class for the whole picture. The outputs have class ids at each pixel location and usually encoded in a PNG-like mask. This idea comes from a paper called Fully Convolutional Network for Semantic Segmentation (FCN), and suddenly every researcher started to follow the suit.

DeepLab series is one of the followers of this FCN idea. From 2015 to 2018, the DeepLab series published four iterations called V1, V2, V3, and V3+. DeepLab V1 sets the foundation of this series, V2, V3, and V3+ each brings some improvement over the previous version. These four iterations borrowed innovations from image classification in recent years to improve semantic segmentation and also inspired lots of other research works in this area. Going into 2020, since there are no more updates on this DeepLab series of networks, it’s a good time to summarize this work that witnessed the progression of Deep Convolutional Neural Network in semantic segmentation.

DeepLab V1

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Multi-scale Feature Aggregation

Like I said before, semantic segmentation networks evolve from image classification networks. So usually there’s a backbone network that’s proven to be efficient in the image classification task. Back in 2014, the VGG network surpassed AlexNet by making the network deeper with 3x3 convolution kernels and was the State-of-the-art solution for image classification.

DeepLabV1 and FCN both use VGG-16 as the backbone network to extract features before some fine-grained classification over pixels. Unfortunately, the author of DeepLabV1 didn’t make it very clear how the network looks like in the paper, so we have to read the source code to have a full picture. To save you some time reading verbose Caffe prototxt, I’ve made a diagram below to visualize this architecture.

DeepLabV1 diagram by myself

DeepLabV1 proposed three variants in the paper: DeepLab, DeepLab-Msc, and DeepLab-Msc-LargeFOV. This is because the author wants to show the ablation of each component used in the DeepLab network. The diagram above shows the DeepLab-Msc variant. Although the author claims that simplicity is one of the virtues of this network, the best DeepLab-Msc version (multi-scale) is far more complicated than FCN. As we can see from the diagram, the green blocks on the left side are convolutional blocks from the VGG backbone. The yellow blocks in the middle are DeepLab pixel-wise classification layers. And the red blocks on the right side are some post-processing of output. Because semantic segmentation usually needs very fine-grained details and high-level global features in the meantime, merging features from multi-scale is a very common method to combat coarse classification prediction. DeepLab aggregates feature from many intermediate convolutional layers and the input, then interpolate the element-wise summed value back to original resolution as an output mask. Usually, the deeper the convolution layer is, the smaller scale of features it produces, so essentially, DeepLab is merging features of multiple scales.

A detailed view of classification head (yellow blocks from the diagram before). Diagram by myself.

Let take a closer look at those classification heads that process multi-scale features. The diagram above shows the internal look of one classification block. This design is similar to FCN where we use a 3x3 kernel to condense features first, and then use two 1x1 kernels to predict final classes. Dropout is also introduced to combat overfitting.

However, using multi-scale features isn’t really a big innovation over FCN, and DeepLab needs something else to standout. Next, let’s talk about another “issue” of FCN that DeepLab tried to fix.

Larger Feature Map

Stride 1 for pooling. Diagram by myself.

In the above diagram, FCN (left side) converts the last few layers of VGG into a classification head for segmentation. So instead of outputting 1 value and 4096 channels in the final fully-connected layer, FCN is outputting 7x7 mask with 4096 channels. However, the author of DeepLab (right side) thinks downsampling from an original 224x224 image to a 7x7 features is too aggressive and could lead to poor segmentation quality. Therefore, DeepLab sets the strides of MaxPooling layer of last VGG blocks to 1, so that the feature map can keep at 28x28 even for the last few layers.

This change ensure that we have features with bigger scale to predict on, but also introduced another problem: the computation is much more expensive now, and we also lost some global signals distilled from a 7x7 feature map. One way the author tried is to reduce number of channels for the last FC layers. Another way, which we will focus on, is the true innovation of DeepLab: Atrous Convolution.

Atrous Convolution

From “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”

The idea of atrous (hole) convolution is simple but effective: the regular convolution kernel combines values from a few neighboring pixels to calculate the new pixel value, like in (a) below. The downside of using such kernel in semantic segmentation is that more local relationships, instead of global relationships are considered when we extract features. To improve this, DeepLab V1 borrows an idea from a signal processing algorithm to introduce a stride inside the kernel window like (b) below. In this atrous version, we put holes in between each pixel we sample so that we can sample a wider range of input with the same kernel size. For example, when the atrous rate (dilation rate) is 2 and kernel size is 3x3, rather than taking pixels from a 3x3 area, we will take 9 pixels from a 5x5 area by skipping those pixels in between.

By using atrous convolution, our 3x3 kernel isn’t as expensive as before because we can use fewer kernels to cover a bigger area. Also, an atrous convolution over a 28x28 features map can bring a similar global signal from a regular convolution over a 7x7 features map. Moreover, if we increase the atrous rate, we can effectively use the same computation of a 3x3 kernel but achieving much larger field-of-view (FOV). This is also proven to be useful from the paper.

Fully-connected CRF

Finally, DeepLab V1 also uses a module called Fully-connected Conditional Random Field (CRF) to further polish the segmentation mask. CRF is a probabilistic model to predict values given the conditional input of other pixels around. To simply put, we are trying to learn a probability distribution to help us compute the probability of output pixels given the values of input pixels and their relationships. A fully-connect CRF uses all pixels of an image as conditional inputs.

From “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”

This is especially useful to polish the boundary of the inner body of an object mask based on pixel correlation. For example, if two pixels are very different in color (blue for the sky and black for the plane in the diagram below), it’s more likely that they belong to two different categories. And with a few rounds of iterations by comparing output and ground truth, the probability distribution we got can correctly reflect this phenomenon.

From “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”

This CRF isn’t used any more in recent segmentation networks because 1) our segmentation network becomes much better nowadays 2) CRF itself is not end-to-end trainable with the network and runs very slow.

DeepLab V2

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Now that you understand how DeepLabV1 works, it’s very easy to catch the idea of DeepLabV2. All the core ideas still remain the same, such as atrous convolution and CRF, so I will only talk about the incremental changes.

ASPP

Atrous Spatial Pyramid Pooling (ASPP) is a new module introduced in DeepLabV2. DeepLabV2 borrows the idea from “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. Spatial Pyramid is a very classic algorithm in Computer Vision even before the deep learning era. This is an important technique to achieve scale-invariant for many algorithms such as SIFT. SPP network introduced the spatial pyramid into convolutional networks, and DeepLabV2 created an atrous version of that SPP module.

From “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”

With an SPP module, our network can encode features from different scales into a fixed-size feature map. This is kind of like a bag-of-word approach, but instead of bagging features, you assign elements in a new feature vector for different scales. For example, you can have 1 feature value correspond to a global scale, 2 feature value correspond to large scale and 3 feature value corresponds to a small scale. Then you will get a feature map with a fixed-size of 6 values. In the real ASPP module, we have four different scales that defined by different atrous rates from 6 to 24.

DeepLab V2 ASPP. Diagram by myself

In FCN and DeepLabV1 as illustrated in the diagram above, our input is fixed to 224x224 in order to match the exact number of connections 28x28x1024 of the FC layer. Now that we have this SPP module, our network can handle input of various sizes without the need to changing the FC layers. Since this SPP module uses atrous convolution, it also inherits the advantages of a large field of view with small computation cost.

New Multi-scale Structure and Backbone

Another difference between V1 and V2 is the new multi-scale structure. Instead of classifying on a different scale of computed feature, DeepLabV2 runs three times over 1.0, 0.75, and 0.5 downscaled images in parallel to achieve multi-scale feature fusion.

DeepLab V2 Multi-scale. Diagram by myself.

It’s hard to say if this design is wise, but it indeed makes the inference and training really slow by running the same computation three times. But no doubt, this is very useful for the author to improve metrics on the PASCAL dataset.

To further improve the metrics and the converging speed, DeepLabV2 also starts to use ResNet as a backbone network, and ASPP modules are connected right after Conv5 blocks of ResNet. Note that the original ResNet blocks have strides 2 to downsample the features, but DeepLabV2 set the strides of Conv4 and Conv5 blocks to 1. This is due to the same concern of coarse features we’ve discussed in the DeepLabV1 chapter. To increase field-of-view, DeepLabV2 also replaces Conv4 with an atrous convolution kernel with rate 2 and replaces Conv5 with an atrous convolution kernel with rate 4. In practice, I found that only having Conv5 to use strides 1 and atrous rate 2 is more effective compared to the original proposal.

DeepLab V3

Rethinking Atrous Convolution for Semantic Image Segmentation

Upgraded ASPP

The use of atrous convolution and spatial pyramid pooling brings great success to DeepLabV1 and V2, so the author keeps exploring in this direction and made the V3 of this DeepLab series, with the focus on ASPP module.

From “Rethinking Atrous Convolution for Semantic Image Segmentation”

DeepLabV3 still uses the ResNet backbone as illustrated in the above diagram. In V2, ASPP merges features from 4 different scales into one fixed-size feature. However, the nature of atrous convolution makes it hard for the network to pick up small local features like tiny edges, and the very big global features that take into consideration of all pixels.

To fuse more information into one, DeepLabV3 redesigned ASPP module to have a separate channel of global image pooling to include global features, and then concatenate the result feature vector with a 1x1 convolution over the ASPP input, to use fine-grained details. Another change to the ASPP module is to use BatchNormalization after convolution and before ReLU.

DeepLabV3 ASPP. Diagram by myself.

Unlike the diagram from the paper, the official implementation also added a 50% dropout to ASPP output. And an additional 3x3 convolution with 10% dropout before predicting final classes.

Multi-grid, Multi-scale and Output Strides

There’re also some small changes to the atrous ResNet backbone network as well. Previously, we have the same atrous rate for convolution layers within the same part of the network, such as Conv4 and Conv5. DeepLabV3 introduced a new hyper-parameter called Multi-grid (MG) to adjust the atrous rate. For example, a multi-grid value of {1, 2, 4} means we will multiple the atrous rate of three convolution layers in the same bottleneck block by 1, 2, and 4 respectively. If the base atrous rate is 2, this will lead to 2, 4, and 8 atrous rates for each layer. According to the experiments, this adjustment can better fuse features from a residual block perspective.

The way that DeepLabV3 uses multi-scale is also different from V2. The author realized the inefficiency of DeepLabV2 training and moved multi-scale to inference time completely. So when you train a network, it will only use the original image scale and only one branch of computation instead of 3. At inference time, the network will run 6 times over input images scaled by {0.5, 0.75, 1.0, 1.25, 1.5, 1.75} times and then average out the 6 outputs (MS). To get even better metrics on the PASCAL dataset, the author also flips the query image and runs inference on these 6 scales again. This is the part that many people dislike, the author is clearly only trying to get a higher score in benchmark instead of proposing a universal solution, which makes the whole paper less convincing.

With unlimited computing power from Google, the author also experiments different output strides (OS) of the network. OS = 8 means the network will have at most 8 times reduction in resolution, which translates to a 28x28 feature map when the input is 224x224. The key to controlling OS is the pooling strides in the residual blocks. When strides = 2 for Conv4 and Conv5, the OS is 8. And if we set strides to 2 for Conv4, we will get OS = 16 and a 14x14 feature map. The experiments proved that the lower the OS, the better the result will be, but also the more computation is needed.

No More CRF

With various tricks used in DeepLabV3 to improve benchmark metrics, the network comes to a point where the additional CRF step can’t improve the result anymore. So CRF is removed in V3 and the following version. This result is very understandable, with crazy multi-scale and flipping inference, and large feature map from small Output Stride, the occasional uncertainty of network prediction is mostly eliminated in ensemble results.

DeepLab V3+

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

In 2018, DeepLab announced its final version DeepLabV3+ as a minor improvement over V3.

New Backbone Network

The most impactful change from DeepLabV3 is the use of new backbone network Xception. By using depthwise separable convolution, Xception can achieve a much better result when using less computation power compared with ResNet. It also achieved many state-of-the-art metrics on benchmarks like ImageNet.

From “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”
From “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”

DeepLabV3+ borrowed a modified version of Xception from MSRA’s Deformable convolutional networks and also created an atrous version of the depthwise separable convolution to attain feature resolution and field-of-view like it did to ResNet. As shown in the diagram © above, atrous separable convolution just change the depthwise conv part into a dilated version with holes in between. The new backbone is indeed very powerful. It drives up the metrics on PASCAL by 2% compared with a ResNet-101 backbone.

Encoder-Decoder Architecture

Encode-decoded is a very common design in semantic segmentation. FCN, U-Net, E-Net all adopt this idea: encoding image into a small feature vector first using a backbone network, and learn a decoder network to upsample the features by using ConvolutionTranspose layer. ConvolutionTranspose is the inverse of convolution, instead of condensing features from multiple locations into one, it upsamples the value from one location and fans out into a wider range of area. The goal of the decoder is to learn a transformation to recover lost information from the encoder stage. In DeepLabV1 and V2, however, the author criticized the usage of decoder because he thinks atrous convolution with a higher resolution feature can achieve a similar up-sampling effect without losing too much information. Even though the DeepLab series avoids using ConvolutionTranspose to decode feature, other network architectures using this decoder design actually got widely adopted because of their simple design. So in V3+, DeepLab finally incorporates decoder to further improve the metrics on common benchmark datasets.

From “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”

In the diagram above, the encoder is the original DeepLabV3 network. It consists of a backbone DCNN network and an ASPP module. With features from ASPP output and also an intermediate output of the backbone network, the decoder use convolution and bilinear upsampling to gradually increase output’s resolution by 8 times. Please note that the official implementation in TensorFlow is slightly different from the paper. For example, there’re two 3x3 Conv layer and one 1x1 Conv layer before the final 4x upsampling.

Ironically, the image-level feature (global pooling) fusion in the ASPP module introduced in V3 paper actually degrades the performance on the Cityscapes dataset when used together with the decoder. It probably means saturation of feature fusion used in DeepLab and none of them are really essential. The network architecture itself may have already overfitted to the PASCAL dataset.

Other Details

So far, we’ve discussed the evolution of DeepLab network over time. In this section, I want to briefly cover some details of training other than network architecture.

Loss Function

In DeepLabV1, the loss function is a simple Cross-Entropy loss over classes. In V2, since three branches work on different scales of input, the loss became the sum of three Cross-Entropy losses. In V3 and V3+, this is reverted back to a simple Cross-Entropy because the idea of multi-scale has been used in inference instead of training.

Learning Rate Policy

In DeepLabV1, the learning rate policy is a simple step decay policy that starts from 1e-3. In DeepLabV2, the author found that a new “ploy” decay policy works better. The learning rate will be (1-iteration/max_iteration)^power, where power is 0.9. This gradual decay learning rate helps the network to find the optimal solution more steadily. And this policy has also been employed by V3 and V3+.

Upsampling

To calculate loss, DeepLabV1 and V2 downsized ground truth to make it the same size as the network output. This contrasted FCN where the result is upsampled to match the original input resolution. Since V3, the author found that this could lead to no back-propagation of some finer details from ground truth, so the ground truth is no longer downsized during training.

Conclusion

DeepLab series is one of the most important semantic segmentation networks so far. It started from astrous convolution, and gradually adding more features from other networks, such as SPP and Decoder, to make itself even stronger. Therefore, DeepLab indeed witnessed the progression of image classification and semantic segmentation. It became a summary of all many cool techniques researchers discovered in recent years.

However, since DeepLabV3, it also starts to use tons of tricks to improve benchmark metrics so it’s also become harder to reproduce the result. Among those tricks, ensemble predictions from multiple scales of the same image, and increasing the feature resolution are the most useful ones, so it’s still a useful insight for engineering. Hopefully in the next decade when we have much more cheap computing power, we will eventually understand the essence of DeepLab after stripping out all the boosting effects from these tricks.

References

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille, Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam, Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam, Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

--

--