Review: 3D U-Net+ResNet — Volumetric Convolutions + Long & Short Residual Connections (Biomedical Image Segmentation)

Outperforms V-Net-like Network

Published in

Towards Data Science

6 min readMay 6, 2019

**Example of prostate MR images displaying large variations (Only centre part)**

In this story, a paper “Volumetric ConvNets with Mixed Residual Connections for Automated Prostate Segmentation from 3D MR Images” is reviewed. This is a network using concepts of 3D U-Net+ResNet. For 3D Magnetic Resonance (MR) images, manual segmentation from 3D MR images is time-consuming and subjective with limited reproducibility. It heavily depends on experience and has large inter- and intra-observer variations. On the other hand, automated segmentation is very challenging:

First, different MR images have global inter-scan variability and intra-scan intensity variation due to different MR scanning protocols, such as with/without endorectal coil.
Second, the lack of clear prostate boundaries due to similar appearance of prostate and surrounding tissues.
Third, prostate has a wide variation in size and shape among different subjects due to pathological changes or different resolutions of images.

In this work, U-Net-like network with volumetric convolutions is proposed with the mixed use of short and long residual connections. This is the work by the team CUMED in The Chinese University of Hong Kong (CUHK) on the MICCAI Prostate MR Image Segmentation (PROMISE12) challenge dataset. This is a 2017 AAAI paper with more that 90 citations. (Sik-Ho Tsang @ Medium)

Outline

Network Architecture
Mixed Short and Long Residual Connections
Ablation Study
Comparisons with State-of-the-art Approaches

1. Network Architecture and Mixed Short and Long Residual Connections

1.1. Basic Volumetric ConvNet

A 2D fully ConvNet (FCN) is extended into a volumetric ConvNet to enable volume-to-volume prediction.
As from the down-sampling path, we can only obtain a coarse prediction, which is sufficient for some detection and classification tasks but is unfit for voxel-wise semantic segmentation.
Three 2×2×2 max pooling layers with stride of 2 are applied between ResBlocks.
An up-sampling path, consisting of deconvolutional and convolutional layers, is implemented to generate dense predictions with much higher resolutions.
With convolutions, deconvolutions and pooling work in a 3D manner, the network can fully preserve and exploit the 3D spatial information when extracting features and making predictions.

1.2. Deep Supervision Mechanism

Deep supervision mechanism in the network is used to accelerate its convergence speed.
One convolutional layer (kernel size 1×1×1) is added at the end of the network to generate the main prediction.
Besides, several convolutional layers (kernel size 1×1×1) is employed followed by hidden feature maps in the up-sampling path to obtain auxiliary coarse predictions, and then use deconvolutional layers to get auxiliary dense predictions with the same size of input.
The weighted sum of cross-entropy losses of the main prediction and auxiliary predictions is minimized when training the volumetric ConvNet.
In principle, the deep supervision mechanism can function as a strong “regularization” during training and thus it is important for training ConvNet with limited training data.

2. Mixed Short and Long Residual Connections

Long and short residual connections are used just like U-Net+ResNet.

2.1. Short Residual Connections

The first kind of residual connections are employed to construct the local residual blocks (ResBlocks), as in the (b) part of the figure.
In the ResBlock, it is composed of two convolutional layers and two rectified linear units (ReLUs).

2.2. Long Residual Connections

Long Residual Connections: The second kind of residual connections are applied to connect the residual blocks with the same resolution in the downsampling and up-sampling paths, as in the (a) part of the figure.
These residual connections can explicitly propagate two kinds of important information within the ConvNet.
First, they can propagate the spatial location information forward to the up-sampling path in order to recover the spatial information loss caused by down-sampling operations for more accurate segmentation.
Second, summation operations are employed to construct the residual connections, the architecture can more smoothly propagate the gradient flow backward, and hence improve the training efficiency and network performance.
Thus, these connections can effectively propagate context and gradient information both forward and backward during the end-to-end training process.

3. Ablation Study

3.1. Dataset

MICCAI Prostate MR Image Segmentation (PROMISE12) challenge dataset is used.
The training dataset contains 50 transversal T2-weighted MR images of the prostate and corresponding segmentation ground truth.
The testing dataset consists of 30 MR images and the ground truth is held out by the organizer for independent evaluation.
All MR volumes into a fixed resolution of 0.625×0.625×1.5 mm and then normalized them as zero mean and unit variance.
The augmentation operations included rotation (90, 180 and 270 degrees) and flip in axial plane.

3.2. Training & Testing

Caffe is used.
With NVIDIA TITAN X GPU, due to limited GPU memory, 64×64×16 sub-volumes are randomly cropped from every sample as input when training the network.
During testing, overlapped sliding windows strategy is used to crop sub-volumes. And the average of the probability maps of these sub-volumes is used to get the whole volume prediction.
The sub-volume size was also 64×64×16 and the stride was 50×50×12. Generally, it took about 4 hours to train the network and about 12 seconds for processing one MR images with size of 320×320×60.
10-fold cross validation is used for ablation study.

**Training and validation loss of networks with and without mixed residual connections**

**Cross validation performance with different configurations.**

Of course, with both long and short residual connections, the Dice Coefficient is the highest.

4. Comparisons with State-of-the-art Approaches

**Quantitative comparison between the proposed method and other methods**

The evaluation metrics used in the PROMISE12 challenge include the Dice coefficient (DSC), the percentage of the absolute difference between the volumes (aRVD), the average over the shortest distance between the boundary points of the volumes (ABD) and the 95% Hausdorff distance (95HD). Then the organizers calculated a total score, as shown above.
There were in total 21 teams submitting their results until the paper submission and only top 10 teams are listed in the table.
Seven of the top ten teams employed various hand-crafted features. Besides team (CUMED), the other two teams that utilized ConvNet are SIATMIDS and CAMP-TUM2.
The team CAMP-TUM2 has a V-Net-like network.
Again, of course, the team CUMED, with the use of 3D U-Net-like network and long and short residual connections like U-Net+ResNet, it outperforms others with score of 86.65.