The world’s leading publication for data science, AI, and ML professionals.

V-Net, U-Net’s big brother in Image Segmentation

Welcome to this guide about the V-Net, the cousin of the well known U-Net, for 3D images segmentations. You will know it inside out!

Welcome to an exciting journey through the world of deep learning architectures! You may already be familiar with U-Net, a game-changer in computer vision that has significantly reshaped the landscape of image segmentation.

Today, let’s turn the spotlight onto U-Net’s big brother, the V-Net.

Published by researchers Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi, the paper "VNet: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation" introduced a breakthrough methodology for 3D image analysis.

This article will take you on a tour of this groundbreaking paper, shedding light on its unique contributions and architectural advancements. Whether you’re a seasoned data scientist, a budding AI enthusiast, or just someone interested in the latest tech trends, there’s something here for you!

A short reminder about U-Net

Before diving into the heart of V-Net, let’s take a moment to appreciate its architectural inspiration – U-Net. Don’t worry if this is your first introduction to U-Net; I’ve got you covered with a quick and easy tutorial on the U-Net architecture. It’s so concise that you’ll grasp the concept in no more than five minutes!

Here’s a brief overview of U-Net for a refresher:

U Net Architecture, from U Net article
U Net Architecture, from U Net article

U-Net is famed for its symmetrical structure, taking the form of a ‘U’. This architecture is composed of two distinct pathways:

  1. Contracting Pathway (Left): Here, we progressively decrease the resolution of the image while increasing the number of filters.
  2. Expanding Pathway (Right): This pathway acts as the mirror image of the contracting pathway. We gradually decrease the number of filters while increase the resolution until it aligns with the original image size.

The beauty of U-Net lies in its innovative use of ‘residual connections’ or ‘skip connections‘. These connect corresponding layers in the contracting and expanding paths, allowing the network to retain high-resolution details that are usually lost in the contracting process.

Residual Connection, from U Net paper
Residual Connection, from U Net paper

Why does this matter? Because it eases the gradient flow during backpropagation, particularly in the early layers. In essence, we circumvent the risk of vanishing gradients – a common problem where gradients approach zero, hindering the learning process:

Image from Author
Image from Author

Now, bearing this understanding of U-Net in mind, let’s transition into the world of V-Net. At its core, V-Net shares a similar encoder-decoder philosophy. But as you’ll soon discover, it comes with its own set of unique traits that set it apart from its sibling, U-Net.

V-Net Architecture, from VNet paper
V-Net Architecture, from VNet paper

What sets V-Net apart from the U-Net?

Let’s dive in!

Difference 1: 3D Convolutions instead of 2D convolutions

The first difference is as clear as day. While U-Net was tailored for 2D image segmentation, medical images often require a 3D perspective (think of volumetric brain scans, CT scans, etc.).

This is where V-Net comes into play. The ‘V’ in V-Net stands for ‘Volumetric,’ and this dimensionality shift requires the replacement of 2D convolutions with 3D convolutions.

Difference 2: Activation Functions, PreLU instead of ReLU

The realm of Deep Learning has fallen in love with the ReLU function, owing to its simplicity and computational efficiency. Compared to other functions like sigmoid or tanh, ReLU is "non-saturating," meaning it reduces the issue of vanishing gradients.

(Left) ReLU, (Middle) LeakyReLU and (Last) PReLU, , from PReLU paper
(Left) ReLU, (Middle) LeakyReLU and (Last) PReLU, , from PReLU paper

But ReLU isn’t perfect. It’s notorious for a phenomenon known as the ‘Dying ReLU problem,’ where many neurons always output zero, becoming ‘dead neurons.’ To counter this, LeakyReLU was introduced, which has a small but nonzero slope on the left side of zero.

Pushing the reasoning even further, V-Net leverages the Parametric ReLU (PReLU). Instead of hardcoding the slope of LeakyReLU, why not let the network learn it?

After all this is a core philosophy of Deep Learning, we want to put as little inductive bias as possible and let the model learn everything by itself, assuming we have enough data.

Difference 3: Different loss function based on the Dice Score

Now, we arrive at perhaps the most impactful contribution of V-Net – a shift in the loss function. Unlike U-Net’s cross entropy loss function, V-Net uses the Dice loss function.

Cross Entropy Function, Image from Author
Cross Entropy Function, Image from Author

But the main issue with this function is that it does not handle well unbalanced classes. And this issue is very frequent in medical images because most of the time the background is much more present than zone of interest.

For example consider this picture:

Background is omnipresent, Image from Author
Background is omnipresent, Image from Author

As a result some models can get "lazy" and predict background everywhere because they will still get a small loss.

So the V-Net uses a loss function that is much more effective for this matter: the Dice coefficient.

The reason why it is better is that it measures the overlap between the predicted zone and the ground truth as a proportion, so the size of the class is taken into account.

Even though the background is almost everywhere, the Dice score measures the overlap between the prediction and the ground truth, so we still get a number between 0 and 1 even though the class is preponderant.

Dice Coefficient, from VNet paper
Dice Coefficient, from VNet paper

I am saying that this is maybe the main contribution of the article because going from 2D to 3D Convolutions is a very natural idea to handle 3D images. However this loss function has been very widely adopted in the Image Segmentation tasks.

In practice, a hybrid approach often proves effective, combining the Cross Entropy Loss and Dice Loss to leverage the strengths of both.

The Performance of the V-Net

So, we’ve journeyed through the unique aspects of V-Net, but you’re probably thinking, "All this theory is great, but does V-Net really deliver in practice?" Well, let’s put V-Net to the test!

The authors evaluated the V-Net performance on the PROMISE12 dataset.

The PROMISE12 dataset was made available for the MICCAI 2012 prostate segmentation challenge.

The V-Net was trained on 50 Magnetic Resonance (MR) images, this is not a lot!

Segmentation of VNet on the PROMISE 2012 challenge dataset, from VNet paper
Segmentation of VNet on the PROMISE 2012 challenge dataset, from VNet paper
Quantitative Metrics on the PROMISE 2012 Challenge dataset, from VNet paper
Quantitative Metrics on the PROMISE 2012 Challenge dataset, from VNet paper

As we can see, even with few labels, the V-Net is able to produce good qualitative segmentations and obtain a very good Dice Score.

Main Limitations of the V-Net

Indeed, V-Net has set a new benchmark in the realm of image segmentation, particularly in medical imaging. However, every innovation has room for growth. Here, we’ll discuss some of the prominent areas where V-Net could improve:

Limitation 1: Size of the model

Transitioning from 2D to 3D brings with it a significant increase in memory consumption. The ripple effects of this increase are multifold:

  • The model demands substantial memory space.
  • It severely restricts the batch size (as loading multiple 3D tensors into GPU memory becomes challenging).
  • Medical imaging data is sparse and expensive to label, making it harder to fit a model with so many parameters.

Limitation 2: Does not use unsupervised learning or Self supervised learning

  • V-Net operates purely in a supervised learning context, neglecting the potential of unsupervised learning. In a field where unlabelled scans significantly outnumber the annotated ones, incorporating unsupervised learning could be a game-changer.

Limitation 3: No uncertainty estimation

  • V-Net doesn’t estimate uncertainties, meaning it cannot assess its own confidence in its predictions. This is an area where Bayesian Deep Learning shines. (Refer to this post for a Gentle Introduction to Bayesian Deep Learning).

Limitation 4: Lack of Robustness

  • Convolutional Neural Networks (CNNs) traditionally struggle with generalization. They are not robust against variations like contrast change, multimodal distributions, or different resolutions. This is another area where V-Net could improve.

Conclusion

V-Net, the lesser-known yet powerful counterpart to U-Net, has revolutionized computer vision, especially image segmentation. Its transition from 2D to 3D images and the introduction of the Dice Coefficient, now a ubiquitous tool, set new standards in the field.

Despite its limitations, V-Net should be the go-to model for anyone embarking on a 3D image segmentation task. For further improvement, exploring unsupervised learning and integrating attention mechanisms seems like promising avenues.

Thanks for reading! Before you go:

GitHub – FrancoisPorcher/awesome-ai-tutorials: The best collection of AI tutorials to make you a…

You should get my articles in your inbox. Subscribe here.

If you want to have access to premium articles on Medium, you only need a membership for $5 a month. If you sign up with my link, you support me with a part of your fee without additional costs.


If you found this article insightful and beneficial, please consider following me and leaving a clap for more in-depth content! Your support helps me continue producing content that aids our collective understanding.

References


Related Articles