The Evolution of Deeplab for Semantic Segmentation

From classical image segmentation methods through Deep learning based semantic segmentation to the Deeplab and its variant.

Published in

Towards Data Science

7 min readJul 12, 2019

In computer vision, a simple image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels). Image segmentation is a long standing computer Vision problem. However, semantic segmentation is the technique of segmenting image with “understanding” of image in pixel level. In other words, semantic segmentation is analysis and classification of each pixel into multiple classes (labels).

Figure 1: a) Image segmentation using Thesholding b) Semantic Segmantation

There are many applications of semantic segmentation that include autonomous driving, human-machine interaction, computational photography, image search engines, and augmented reality to name a few.

A Brief History of Semantic Segmentation

Quite a few algorithms have been designed to solve this task, such as the Watershed algorithm, Image thresholding, K-means clustering, Graph partitioning methods, etc. The simplest method would be the thresholding method. In this method a gray-scale image is converted into a binary image based on a threshold value. In spite of many traditional image processing techniques however the deep learning methods has been the game changer.

In order to properly understand how semantic segmentation is tackled by modern deep learning architectures, it is important to know that it is not an isolated field. Rather it is a natural step in the progression from coarse to fine inference.

The origin could be located at classification, which consists of making a prediction for a whole input, i.e., predicting the object in an image. The next step towards fine-grained inference is localization or detection, providing not only the classes but also additional information regarding the spatial location of those classes. A semantic segmentation can be seen as a dense-prediction task. In dense prediction, the objective is to generate an output map of the same size as that of the input image. Now, it is obvious that semantic segmentation is the natural step to achieve fine-grained inference. Its goal is to make dense predictions inferring labels for every pixel. This way, each pixel is labeled with the class of its enclosing object or region. Further improvements can be made, such as instance segmentation (separate labels for different instances of the same class).

A general semantic segmentation architecture can be broadly thought of as an encoder network followed by a decoder network:

The encoder is usually is a pre-trained classification network like VGG/ResNet followed by a decoder network.
The task of the decoder is to semantically project the discriminative features (lower resolution) learnt by the encoder onto the pixel space (higher resolution) to get a dense classification.

One of the very early Deep Convolutional Neural Networks (DCNNs) used for semantic segmentation is Fully Convolutional network (FCN). The FCN network pipeline is an extension of the classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images. The restriction of CNNs to accept and produce labels only for specific sized inputs comes from the fully-connected layers which are fixed. Contrary to them, FCNs only have convolutional and pooling layers which give them the ability to make predictions on arbitrary-sized inputs.

Figure: FCN Architecture [ref: https://arxiv.org/abs/1411.4038]

One issue in this specific FCN is that by propagating through several alternated convolutional and pooling layers, the resolution of the output feature maps is down sampled. Therefore, the direct predictions of FCN are typically in low resolution, resulting in relatively fuzzy object boundaries. A variety of more advanced FCN-based approaches have been proposed to address this issue, including SegNet, UNet, DeepLab, and Dilated Convolutions.

In the following section we will discuss the Deeplab for semantic segmentation and its evolution.

The Evolution of Deeplab

DeepLab is a state-of-the-art semantic segmentation model designed and open-sourced by Google. The dense prediction is achieved by simply up-sampling the output of the last convolution layer and computing pixel-wise loss. The Deeplab applies atrous convolution for up-sample.

Atrous Convolution

The repeated combination of max-pooling and striding at consecutive layers in DCCN reduces significantly the spatial resolution of the resulting feature maps. One solution is to use deconvolution layers to up-sample the resulting map. However, it requires additional memory and time. Atrous convolution offers a simple yet powerful alternative to using deconvolutional. Atrous convolution allows to effectively enlarge the field of view of filters without increasing the number of parameters or the amount of computation.

Atrous convolution as a shorthand for convolution with up-sampled filters. Filter up-sampling amounts to inserting holes (‘trous’ in French) between nonzero filter taps.

Figure: (Top) regular convolution followed by up-sampling, (Bottom) atrous convolution resulting denser feature map. [ref: https://arxiv.org/abs/1606.00915]

Mathematically, atrous convolution y[i] for a one-dimensional signals x[i] with a filter w[k] of length K and stride rate r is defined as:

Deeplab (Deeplabv1)

Success of Deeplabv1 in the task of semantic segmentation is due to some advancements added to the previous state of the art models, specifically to the FCN model. These advancements addresses the following two challenges.

Challenge 1: reduced feature resolution

Due to multiple pooling and down-sampling (‘stride’) in DCNN, there is a significant reduction in spatial resolution. They remove the down-sampling operator from the last few max-pooling layers of DCNNs and instead up-sample the filters (atrous) in subsequent convolutional layers, resulting in feature maps computed at a higher sampling rate.

Challenge 2: reduced localization accuracy due to DCNN invariance

In order to capture fine details by employing a fully connected Conditional Random Field (CRF). The CRF potentials incorporate smoothness terms that maximize label agreement between similar pixels, and can integrate more elaborate terms that model contextual relationships between object classes. Following figure illustrates the improvement of segmentation map after few iterations of CRF.

Figure: [Top] Score map (input before softmax function), [Bottom] belief map (output map after mean field iterations) [ref: https://arxiv.org/abs/1606.00915]

The Deeplabv1 model takes the images as input and passes through usual DCCN layers followed by one or two atrous layer and results in a coarse score map. This map is then up-sampled to the original size of the image, using bi-linear interpolation. Finally, to improve the segmentation result fully connected CRF is applied.

Figure: Deeplabv1 flowchart [ref: https://arxiv.org/abs/1606.00915]

Deeplabv2

To further improve the performance of the Deeplabv1 architecture, the next challenge is existence of object at multiple scales.

Challenge: existence of objects at multiple scales

To represent the object in multiple scales, a standard way to deal with this is to present to the DCNN re-scaled versions of the same image and then aggregate the feature or score maps.

Solution: Use of Atrous Spatial Pyramid Pooling (ASPP). The idea is to apply multiple atrous convolution with different sampling rates to the input feature map, and fuse together. As objects of the same class can have different scales in the image, ASPP helps to account for different object scales which can improve the accuracy.

Figure: Atrous Spatial Pyramid Pooling (ASPP). [ref: https://arxiv.org/abs/1606.00915]

Deeplabv3

The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations (atrous convolution) at multiple rates and multiple effective fields-of-view (ASPP). The next challenge was to capture sharper object boundaries by gradually recovering the spatial information.

Challenge: capture sharper object boundaries

Deeplabv3 architecture adopts a novel encoder-decoder with atrous separable convolution to address the above issue. The encoder-decoder model is able to obtain sharp object boundaries. The general encoder-decoder networks have been successfully applied to many computer vision tasks, including object detection, human pose estimation, and also semantic segmentation.

Typically, the encoder-decoder networks contain:

An encoder module that gradually reduces the feature maps and captures higher semantic information.
A decoder module that gradually recovers the spatial information.

In addition to the above encoder-decoder network, it also applies depth-wise separable convolution to increases computational efficiency. This is achieved by factorizing a standard convolution into a depth-wise convolution followed by a point-wise convolution (i.e., 1 × 1 convolution). Specifically, the depth-wise convolution performs a spatial convolution independently for each input channel, while the point-wise convolution is employed to combine the output from the depth-wise convolution.

Figure: 3×3 Depth-wise separable convolution for atrous convolution. [ref: https://arxiv.org/abs/1802.02611]

Deeplabv3+

This, extends DeepLabv3 by adding a simple yet effective decoder module to further refine the segmentation results especially along object boundaries.

Encoder: Compared to Deeplabv3, it uses Aligned Xception instead of ResNet-101 as its main feature extractor (encoder), but with a significant modification. All max pooling operations are replaced by depth-wise separable convolution.

Decoder: The encoder is based on an output stride of 16, i.e. the input image is down-sampled by a factor of 16. So, instead of using bilinear up-sampling with a factor of 16, the encoded features are first up-sampled by a factor of 4 and concatenated with corresponding low level features from the encoder module having the same spatial dimensions. Before concatenating, 1 x 1 convolutions are applied on the low level features to reduce the number of channels. After concatenation, a few 3 x 3 convolutions are applied and the features are up-sampled by a factor of 4. This gives the output of the same size as that of the input image.