Review: U-Net (Biomedical Image Segmentation)

Published in

Towards Data Science

5 min readNov 5, 2018

In this story, U-Net is reviewed. U-Net is one of the famous Fully Convolutional Networks (FCN) in biomedical image segmentation, which has been published in 2015 MICCAI with more than 3000 citations while I was writing this story. (Sik-Ho Tsang @ Medium)

In the field of biomedical image annotation, we always need experts, who acquired the related knowledge, to annotate each image. And they also consume large amount of time to annotate. If the annotation process becomes automatic, less human efforts and lower cost can be achieved. Or it can be act as a assisted role to reduce the human mistake.

You may ask: “Is it too narrow to read about biomedical Image Segmentation?”
However, we may learn the techniques of it, and apply it to different industries. Say for example, quality control / automatic inspection / automatic robotics during construction / fabrication / manufacturing process, or any other stuffs we may think of. These activities involve quantitative diagnosis. If we can make it automatic, cost can be saved with even higher accuracy.

In this paper, they segment/annotate the Electron Microscopic (EM) Image. They also modify the network a little bit to segment/annotate the dental X-ray image in 2015 ISBI.

What Are Covered

A. EM Image Segmentation

U-Net Network Architecture
Overlap Tile Strategy
Elastic Deformation for Data Augmentation
Separation of Touching Objects
Results

B. Dental X-Ray Image Segmentation

Some Modifications of U-Net
Results

A.1. U-Net Network Architecture

The U-net architecture is as shown above. It consists of contraction path and expansion path.

Contraction path

Consecutive of two times of 3×3 Conv and 2×2 max pooling is done. This can help to extract more advanced features but it also reduce the size of feature maps.

Expansion path

Consecutive of 2×2 Up-conv and two times of 3×3 Conv is done to recover the size of segmentation map. However, the above process reduces the “where” though it increases the “what”. That means, we can get advanced features, but we also loss the localization information.
Thus, after each up-conv, we also have concatenation of feature maps (gray arrows) that are with the same level. This helps to give the localization information from contraction path to expansion path.
At the end, 1×1 conv to map the feature map size from 64 to 2 since the output feature map only have 2 classes, cell and membrane.

A.2. Overlap Tile Strategy

Since unpadded convolution is used, output size is smaller than input size. Instead of downsizing before network and upsampling after network, overlap tile strategy is used. Thereby, the whole image is predicted part by part as in the figure above. The yellow area in the image is predicted using the blue area. At the image boundary, image is extrapolated by mirroring.

A.3. Elastic Deformation for Data Augmentation

Since the training set can only be annotated by experts, the training set is small. To increase the size of training set, data augmentation is done by randomly deformed the input image and output segmentation map.

A.4. Separation of Touching Objects

**Segmentation Map (Left) and Weight Map (Right)**

Since the touching objects are closely placed each other, they are easily merged by the network, to separate them, a weight map is applied to the output of network.

To compute the weight map as above, d1(x) is the distance to the nearest cell border at position x, d2(x) is the distance to the second nearest cell border. Thus, at the border, weight is much higher as in the figure.

Thus, the cross entropy function is penalized at each position by the weight map. And it help to force the network to learn the small separation borders between touching cells.

A.5. Results

A.5.1. ISBI 2012 Challenge

**U-Net has the Rank 1 Result at that moment**

Warping Error: A segmentation metric that penalizes topological disagreements.
Rand Error: A measure of similarity between two clusters or segmentations.
Pixel Error: A standard pixel-wise error.
Training Hour: 10 Hours
Testing speed: around 1s per image

A.5.2. PhC-U373 and DIC-HeLa Datasets

U-Net got the highest IoU for these two datasets.

B.1. Some Modifications of U-Net

This time, 4×4 Up-conv is used, and 1×1 Conv to map feature maps from 64 to 7 because the output for each location has 7 classes.

**Zero padding instead of mirroring at the image boundary**

At the Overlap Tile Strategy, zero padding is used instead of mirroring at the image boundary. Because mirroring isn’t making any sense for teeth.

There are additional loss layers to the low-resolution feature maps using softmax loss, in order to guide the deep layers to directly learn the segmentation classes.