Image segmentation using fastai

Learn how to color code every pixel of an image using a U-net

Dipam Vasani
Towards Data Science

--

Introduction

Image segmentation is an application of computer vision wherein we color-code every pixel in an image. Each pixel then represents a particular object in that image. If you look at the images above, every street is coded in violet, every building is orange, every tree is green and so on. Why do we do this and how is it different from object detection?

Image segmentation is usually used when we care about edges and regions, when we want to separate important objects from the background. We want to know the specifics of an object and conduct further analysis on it from there on. Think about it in terms of a self driving car. A self driving car will not only want to identify a street, but also know it’s edges or curves in order to make the correct turn.

Image segmentation has a lot of significance in the field of medicine. Parts that need to be studied are color coded and viewed in scans taken from different angles. They are then used for things like automatic measurement of organs, cell counting, or simulations based on the extracted boundary information.

The process

We treat image segmentation as a classification problem, where for every pixel in the image, we try to predict what it is. Is it a bicycle, road line, sidewalk, or a building? In this way we produce a color coded image where every object has the same color.

The code

As usual we start by importing the fastai libraries.

Let’s first take a look at one of the images.

Next, we take a look at what the image looks like after segmentation. Since the values in the labelled image are integers, we cannot use the same functions to open it. Instead, we use open_mask with show to display the image.

Notice the function get_y_fn inside open_mask . In every segmentation problem, we are given 2 sets of images, original ones and labelled ones. We need to match the labelled images with the normal ones. We do this using the filenames. Let’s take a look at the filenames of some of the images.

We see that the filenames of the normal images and the labelled images are the same except, the corresponding labelled image has an _P towards the end. Hence we write a function which for every image, identifies its corresponding labelled counterpart.

We also have a file called codes.txt that tells us what object the integers in our labelled image correspond to. Let’s open the data for our labelled image.

Now let’s check the codes file for the meaning of these integers.

The labelled data had a lot of 26s in it. Counting from index 0 in our codes file we see that the object referred by the integer 26 is a tree.

Now that we’ve understood our data, we can move on to creating a data bunch and training our model.

We will not use the whole dataset, and we will also keep the batch size relatively small since classifying every pixel in every image is a resource intensive task.

As usual we create our data bunch. Reading the above code:

  • Create the data bunch from a folder
  • Split the data into training and testing based on filenames mentioned in valid.txt
  • Find the labelled images using the function get_y_fn and use the codes as classes to be predicted.
  • Apply transforms on the image (note the tfm_y = True here. This means that whatever transform we apply on our dependent images should also be applied on our target image. (Example: If we flip an image horizontally, we should also flip the corresponding labelled image))

For training, we will use a CNN architecture called U-Net since they are good at recreating images.

Before explaining what a U-Net is, notice the metrics used in the above code. What’s acc_camvid ?

The accuracy in an image segmentation problem is the same as that in any classification problem.

Accuracy = no. of correctly classified pixels / total no. of pixels

However in this case, some pixels are labelled as Void (this label also exists in codes.txt) and shouldn’t be considered when calculating the accuracy. Hence we make a new function for accuracy where we avoid those labels.

The way a CNN works is it breaks down an image into smaller and smaller parts until is has just one thing to predict (left part of the U-Net architecture shown below). A U-Net then takes this and makes it bigger and bigger again and it does this for every stage of the CNN. However, constructing an image from a small vector is a difficult job. Hence we have connections from the original convolution layers to our deconvolution network.

As always we find the learning rate and train our model. Even with half the data set, we get a pretty good accuracy of 92%.

Checking some of the results.

The ground truths are the actual targets and the predictions are what our model labelled.

We can now train on the full data set.

Conclusion

That will be it for this article. In this article we saw how to color code every pixel of an image using a U-net. U-nets are gaining popularity because they’ve performed better than GANs on applications like generating high resolution images from blurry images. Hence it will be really useful to know what they are and how to use them.

The full notebook can be found here.

If you want to learn more about deep learning check out my series of articles on the same:

~happy learning

--

--