🤖 Deep Learning

Cityscape Image Segmentation With TensorFlow 2.0

Image Segmentation using UNet architecture.

Shubham Panchal

Published in

Towards Data Science

4 min readNov 3, 2019

Image Segmentation is a detection technique used in various computer vision applications. We actually “segment” a part of an image in which we are interested. In this story, we’ll be creating a UNet model for semantic segmentation ( not to be confused with instance segmentation 😕 ).

You can check out the implementation for this story here ->

Cityscape_Image_Segmentation

colab.research.google.com

How is the architecture of UNet?

😴 Implementing Faster RCNN from scratch! ( For me at least… )

Implementing UNet could be a bit easier if you are using TensorFlow Keras or PyTorch. In simple words,

UNet has an encoder-decoder type of structure. The encoder takes in the image, performs various convolutions and max-pooling operations on the image and builds a latent representation of it. Now, the decoder takes this representation and upsamples the image ( with the help of skip connections ), finally giving us the segmentation mask.

What makes UNet different from a Convolutional Autoencoder is that it has skip-connections 😎. A skip-connection has the name suggests ( maybe ;-) ) preserves the spatial information for the decoder. Generally, the encoder breaks down the image into high-dimensional tensors ( like of shape [8, 8, 256 ] ). This may result in the loss of some important features. While we feed an image to the encoder. After every max-pooling operation ( in the encoder ), we store the result of the operation. Now, when we are performing transposed convolutions in the decoder, we concatenate the previous output ( from the decoder ) and the corresponding tensor which we stored in the encoder part earlier. The decoder receives some metadata of the original image to construct the mask.

In Keras, we have Con2D, Con2DTranspose, MaxPooling2D and UpSampling2D layers to make your life easy. But we’ll be using raw TensorFlow 2.0 APIs to make the model.

Maybe the model was named “UNet” because of some U-shape derived using the skip connections. Let’s leave that to the inventors! 😅

Discussing the Data

Our dataset hails from Cityscapes Image Pairs by DanB on Kaggle. The images look like,

The right part is the mask and the left part is the actual image. We will split these images with ImageOps using Pillow.

The dataset has multiple masks of different classes with their respective colours. For our simplicity, we only try to segment the “road” present in the image. Note that the road has an RGB colour of (128, 63, 126 ). Our model will output a binary mask, consisting of 1s and 0s only. Our input image will be a (128, 128, 3 ) shape image and the target will be a mask of shape (128, 128, 1 ). So, the pixel having an RGB value of (128, 63, 126 ) will have a value of 1 in the target mask. All other pixels will hold a value of 0.

Getting the ops ready!

We will define methods for four operations:

conv2d_down: Regular Convolution along with Leaky ReLU activation.
maxpool_down: Max Pooling operation with valid padding.
conv2d_up: Transposed convolution for upsampling the image.
maxpool_up: Upsampling the input like the UpSampling2D Keras layer.

Snippet 1

We will create some weights for our UNet using the Glorot uniform initializer,

Snippet 2

We are now ready to assemble the final model.

Making the UNet model

Assembling the UNet model with all the ops we created earlier.

Snippet 3

Notice the tf.concat() operation? That’s the place where we are actually concatenating the output from the previous layer ( decoder ) and the skip-connection ( encoder ).

The model() takes in the input, passes it through the UNet and returns the sigmoid activated output. 🆒

Training and optimization

We optimize our model using Adam optimizer and the Binary Crossentropy loss function ( remember the sigmoid activation at the last layer? ).

Snippet 4

We are now ready to train the model. We will call the train() method for each batch we have created in our dataset.

Snippet 5

You will find a code cell to generate a mask for an image in the notebook. After training for 25 epochs, the results 😂 are here for the validation dataset.