UNet, evolved from the traditional convolutional neural network, was first designed and applied in 2015 to process biomedical images. As a general convolutional neural network focuses its task on image classification, where input is an image and output is one label, but in biomedical cases, it requires us not only to distinguish whether there is a disease, but also to localise the area of abnormality.
UNet is dedicated to solving this problem. The reason it is able to localise and distinguish borders is by doing classification on every pixel, so the input and output share the same size. For example, for an input image of size 2×2:
[[255, 230], [128, 12]] # each number is a pixel
the output will have the same size of 2×2:
[[1, 0], [1, 1]] # could be any number between [0, 1]
Now let’s get to the detail implementation of UNet. I will:
- Show the overview of UNet
- Breakdown the implementation line by line and further explain it
Overview
The network has basic foundation looks like:

First sight, it has a "U" shape. The architecture is symmetric and consists of two major parts – the left part is called contracting path, which is constituted by the general convolutional process; the right part is expansive path, which is constituted by transposed 2d convolutional layers(you can think it as an upsampling technic for now).
Now let’s have a quick look at the implementation:
The code is referred from a kernel of Kaggle competition, in general, most UNet follows the same structure.
Now let’s break down the implementation line by line and maps to the corresponding parts on the image of UNet architecture.
Line by Line Explanation
Contracting Path
The contracting path follows the formula:
conv_layer1 -> conv_layer2 -> max_pooling -> dropout(optional)
So the first part of our code is:
which matches to:

Notice that each process constitutes two convolutional layers, and the number of channel changes from 1 → 64, as convolution process will increase the depth of the image. The red arrow pointing down is the max pooling process which halves down size of image(the size reduced from 572×572 → 568×568 is due to padding issues, but the implementation here uses padding= "same").
The process is repeated 3 more times:

with code:
and now we reaches at the bottommost:

still 2 convolutional layers are built, but with no max pooling:
The image at this moment has been resized to 28x28x1024. Now let’s get to the expansive path.
Expansive Path
In the expansive path, the image is going to be upsized to its original size. The formula follows:
conv_2d_transpose -> concatenate -> conv_layer1 -> conv_layer2

Transposed convolution is an upsampling technic that expands the size of images. T[here](https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0) is a visualised demo here and an explanation here. Basically, it does some padding on the original image followed by a convolution operation.
After the transposed convolution, the image is upsized from 28x28x1024 → 56x56x512, and then, this image is concatenated with the corresponding image from the contracting path and together makes an image of size 56x56x1024. The reason here is to combine the information from the previous layers in order to get a more precise prediction.
In line 4 and line 5, 2 other convolution layers are added.
Same as before, this process is repeated 3 more times:
Now we’ve reached the uppermost of the architecture, the last step is to reshape the image to satisfy our prediction requirements.

The last layer is a convolution layer with 1 filter of size 1×1(notice that there is no dense layer in the whole network). And the rest left is the same for neural network training.
Conclusion
UNet is able to do image localisation by predicting the image pixel by pixel and the author of UNet claims in his paper that the network is strong enough to do good prediction based on even few data sets by using excessive data augmentation techniques. There are many applications of image segmentation using UNet and it also occurs in lots of competitions. One should try out on yourself and I hope this post could be a good starting point for you.
Reference:
- https://github.com/hlamba28/UNET-TGS/blob/master/TGS%20UNET.ipynb
- https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47
- https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
- https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0
- https://www.kaggle.com/phoenigs/u-net-dropout-augmentation-stratification