Original Photo by Adrian Williams on Unsplash

Image Segmentation — Choosing the Correct Metric

Comparing Metrics and Loss Functions for Image Segmentation tasks as an example of imbalanced data problems.

Laurenz Reitsam
Towards Data Science
7 min readAug 12, 2020

--

From robotics to autonomous driving, there are various applications for image segmentation tasks, which makes it a current field of research in computer vision and machine learning. Because of their flexibility in architecture, convolutional neural networks (CNNs) have proven to be the state of the art algorithms in this field. In comparison to common classification, supervised image segmentation has some special characteristics based on imbalanced class distributions in the data. This article illustrates that it might be useful to have a second glance at the used scoring metric for model evaluation and introduces the Jaccard index and the F1 score as alternatives to the most frequently used accuracy score. Also, it will show how to develop loss functions, which directly optimize these scores. Finally, we will compare the results of the developed loss functions to cross-entropy based on the introduced metrics.

Image Segmentation

Basically, Image Segmentation is nothing else than just classification. But, instead of having one label for a given input image, there is a label for every individual pixel in this image. Consequently, the classifier needs to output a matrix with the same dimensions as the input image.

In usual classification tasks, the classes which should be separated by the classifier are usually equally distributed in the training set. If this is not the case, it is easy to reduce samples of one class or collect/generate more samples of another so that the classes are equally distributed, afterwards. This is not possible in image classification tasks using CNNs. Here, to reduce the frequency of one class, pixels would have to be dropped, which results in incomplete images.

What’s wrong with Accuracy?

Think about an image containing a bustling city scene. The objective of a classifier is to recognize all pixels showing a street sign. In this picture, there might be a lot of other objects like cars, humans, houses, etc. Therefore, the total area of the street sign might be very small compared to the total picture size. If we would label all other objects than the street sign as ‘background’, so that we only have two labels left, the number of pixels of the ‘background’ class would be much bigger than the number of pixels of the ‘street sign’ class.

Original Photo by Adrian Williams on Unsplash

Now let the area of the street sign be 10% of the total image area only and think about a simple classifier, which is always predicting ‘background’ for every pixel in the whole image. So, this classifier doesn’t recognize any street sign pixel, at all.

For this binary problem, accuracy is defined as

So, the resulting accuracy for this classifier would be 90%. This sounds like a pretty good score, although, regarded intuitively this is definitely not a value ideally describing the classification result.

So accuracy does not really seem to coincide with the objective of correctly labeling objects. At least, if these objects are very small compared to the image size. This means we have to think about other scoring metrics, instead.

Alternative Metrics

As an alternative to accuracy, the Jaccard index, or the F1 score can be used as scoring metrics:

The Jaccard index, also called the IoU score (Intersection over Union) is defined as the intersection of two sets defined by their union.

The basic idea is to regard the image masks as sets. These sets can overlap within the picture. If both masks are completely identical, both sets have exactly the same size and do overlap to 100%, so that intersection equals union. In this case, the IoU score is 1 and optimal. On the other hand, if the predicted mask is shifted or changed in size compared to the original mask, then the union gets bigger than the intersection. The IoU score decreases.

Regarding the street sign example again, with 10% of an image showing a street sign and the residual area is just background. For this binary example, the main goal is to correctly classify the street sign. So, let the street sign pixels be the positive and background be the negative values. Because the classifier labels every pixel in the image as ‘background’, the sum of true positives (TP) must be 0. Therefore, the resulting Jaccard index is also 0.

Compared to the accuracy score, which is 90%, this score says that the classification result is completely wrong. Regarding the fact, that the classifier did not classify any pixel of the street sign correctly, the score of the Jaccard index intuitively describes the result of this classifier better than accuracy does.

The F1 score, also called the dice score is related to the Jaccard index and defined as

The F1 score, being the harmonic mean of precision and recall is by its definition well suited for unbalanced datasets. Regarding the formula, it can be seen, that the result of the F1 score must also be 0 for the given example.

For more information about the F1 score and other scoring metrics for imbalanced data see https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

Developing a cost function

The previous section has introduced two possible metrics for image segmentation tasks. In this section, we will show how a cost function can be developed, which directly optimizes these scores. Here, the procedure is shown for the Jaccard index, but the steps are the same for an F1 score based cost function.

If we bring the Jaccard index in a differentiable form, needed for backpropagation and let 𝑦, 𝑦̂ be one-hot encoded vectors with the number of classes C being their length, we come to the formula

Sadly, the resulting gradients are not that simple than the one resulting from cross-entropy and prone to ‘gradient exploding’. Therefore, it is useful to add an additional smoothing factor to the equation to achieve more stable training results. Also, we want to have a cost function that should be minimized. So, because the Jaccard index naturally scales from 0 to 1, we subtract the result from 1. This results in the following loss function:

Implemented in Python and TensorFlow, this results in the following function:

Experiment — Comparison of Loss Functions

In the following, we will train 3 identical CNN models. One will be trained using cross-entropy and the others using Jaccard loss and dice loss, instead. Afterwards, we will have a look at the results using the accuracy score and the F1 score as the evaluation metrics.

The neural network architecture which is used here basically comes from https://www.tensorflow.org/tutorials/images/segmentation with some slight changes. The basic architecture is called U-Net, which is an encoder-decoder based network with additional shortcuts between the individual layers. The main idea is to first compress the image information and then apply this compressed information in combination with pixel specific knowledge to create an output mask.

The dataset we will use for this experiment is the Oxford-IIIT Pets dataset. It consists of 7,349 images showing different kinds of pets. For every image, there is a given mask, classifying each pixel as a pet, background, or boundary. So, there are 3 classes in total.

after training the CNNs for 20 epochs under completely the same circumstances, we can see to the following results:

Having a look at the bar chart shows that it can really matter which metric to consider. If the models would be compared using the accuracy score only, the model based on F1-loss would be judged as the worst of all 3 variants. Using the F1 score instead, the F1-loss model achieves significantly better results than the model trained with cross-entropy.

Conclusion

We have seen that for useful model evaluation, the choice of the scoring metric can be crucial, especially when dealing with imbalanced datasets. So, before starting to tune hyperparameters and optimizing the result for a given metric, it might be useful to question this metric and think about whether it is used because it is the best suited one or because it is the default option.

Also, we have seen how to create one's own cost-function to directly optimize a special metric. This might achieve better results than cross-entropy for suited problems and datasets.

A Jupyter Notebook containing the whole code can be found here: https://gist.github.com/LaurenzReitsam/05e3bb42024ff76955adbf92356d79f2

You might find this interesting:

  • A great overview of different imbalance problems in image data: https://neptune.ai/blog/imbalanced-data-in-object-detection-computer-vision

References

  • van Beers, F. & Lindström, Arvid & Okafor, Emmanuel & Wiering, Marco, Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation, 2019.
  • https://www.tensorflow.org/tutorials/images/segmentation
  • https://www.robots.ox.ac.uk/~vgg/data/pets/

--

--