The world’s leading publication for data science, AI, and ML professionals.

Semantic Segmentation of Aerial Imagery captured by a Drone using Different U-Net Approaches

Implementing configured U-Net architecture from scratch in python and semantic segmentation of the aerial imagery captured by a drone using…

Photo by Jaromír Kavan on Unsplash
Photo by Jaromír Kavan on Unsplash

In machine learning, models are trained with various applications, especially on deep learning and image datasets. With the methods based on convolutional operation, many studies are carried out in many fields, especially hand-arm detection in Augmented Reality, self-driving cars, aerial images with drones, war technologies. The human eye has the ability to easily classify and distinguish what it sees. However, the equivalent of this ability in artificial intelligence technologies, that is, the problem of understanding images, is discussed under the title of Computer Vision. As the name suggests (computer vision), it is to introduce (classify) images in a way that the computer can understand, and the next step is to make operations on these images possible by using different methods. This article explains one of the segmentation methods that is the U-Net architecture which is developed for biomedical image segmentation and includes a real-world project that segments aerial imagery captured by a drone using U-Net.

Table of Contents
1. Semantic Segmentation
2. U-Net Architecture from 
3. Tutorial
3.1. Data Preprocessing
3.2. Semantic Segmentation using U-Net from scratch
3.3. Semantic Segmentation using U-Net with Transfer Learning
4. Conclusion
5. References

Semantic Segmentation

Images are matrixes of pixels formed by mathematical numbers. In image processing techniques, some adjustments are made to these mathematical numbers then the images are expressed in a different way and made suitable for the relevant study or interpretation. The convolution process which is a fundamental mathematical pixel operation provides the opportunity to evaluate images from different perspectives. For example, edge detection of the image can be done with the applied filter, or the image can be interpreted & used from a different angle by converting it from RGB format to grayscale. Based on Deep Learning models and convolutional layers, more comprehensive studies such as feature extraction and classification on the content of images have been carried out.

Figure 1. Object Detection, Semantic Segmentation, Instance Segmentation, [source](http://A. Arnab et al., "Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation," IEEE Signal Process. Mag., vol. XX, 2018.)
Figure 1. Object Detection, Semantic Segmentation, Instance Segmentation, [source](http://A. Arnab et al., "Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation," IEEE Signal Process. Mag., vol. XX, 2018.)

As seen in the picture above, detecting the objects in the content of the image with the bounding box is called object detection. Semantic segmentation which is a pixel-wise labeling operation is to display the same type of objects (sky, cat, dog, human, road, car, mountain, sea, etc) in the picture with a label, that is, color. Instant segmentation which is every instance is labeled individually separates each object by displaying it in a different color. As mentioned above, various CNN models, different from each other and complex models have been developed for different uses in the background of these operations. PSPNet, DeepLab, LinkNet, U-Net, Mask R-CNN are just some of these models. We can say that the segmentation process is the eye of the project in Machine Learning-based applications such as self-driving cars. The video below contains the real-time semantic segmentation process that compares the human view perspective and the PSPNet perspective.

In a nutshell, semantic segmentation in computer vision is a pixel-wise labeling method. If objects of the same type are expressed with a single color, it is called semantic segmentation, and if each object is expressed with a unique color (label), it is called instance segmentation.

U-Net Architecture

U-Net is a specific type of Convolutional Neural Network architecture that was developed for biomedical images (Computer Tomography, microscopic images, MRI scans, etc) at the Computer Science Department and BIOSS Center for Biological Signaling Studies, University of Freiburg, Germany in 2015. The article – "The U-Net: Convolutional Networks for Biomedical Image Segmentation" – can be accessed at the link here. When we consider the technical idea, the model consists of the encoder (contraction) which is the down-sampling (mostly with the pre-trained weights in Transfer Learning) and the decoder (extraction) which is the up-sampling parts and it is named as U-Net because its scheme is U-shaped as shown in Figure 2. This model can be configured according to different studies.

Figure 2. U-Net Architecture, [source](http://O. Ronneberger, P. Fischer, and T. Brox, "LNCS 9351 - U-Net: Convolutional Networks for Biomedical Image Segmentation," 2015, doi: 10.1007/978-3-319-24574-4_28.)
Figure 2. U-Net Architecture, [source](http://O. Ronneberger, P. Fischer, and T. Brox, "LNCS 9351 – U-Net: Convolutional Networks for Biomedical Image Segmentation," 2015, doi: 10.1007/978-3-319-24574-4_28.)

The U-Net model is configured for semantic segmentation of aerial imagery in the following tutorial is as follows:

If we take the above code block and Figure 2 (flow of the from top left to top right following the ‘U’ letter) step by step:

  1. Input is defined as 256x256x3 dimensions.
  2. As a result of conv_1 with 16 filters, 256x256x16 dimensions are obtained. It is reduced to 128x128x16 with Maxpooling in pool_1.
  3. With conv_2 with Filter numbers 32, the size of 128x128x32 is obtained, and similarly, with pool_2, the size of 64x64x32 is obtained.
  4. The size of 64x64x64 is obtained by conv_3 with filter numbers 64, and 32x32x64 is obtained with pool_3.
  5. The size of 32x32x128 is obtained with conv_4 with a filter number of 128, and as a result of pool_4, 16x16x128 is obtained.
  6. With conv_5 with filter numbers of 256, the size of 16x16x256 is obtained, and upsampling starts from at this point. In u6 with a filter number of 128 and (2×2), conv_5 is converted to 32x32x128 with Conv2DTranspose, and concatenate layer is performed with u6, conv_4. As a result, u6 is updated to 32x32x256. With conv_6 with 128 filters, it becomes 32x32x128.
  7. u7 with filter number 64 and (2×2) becomes 64x64x64 by applying to conv_6 and concatenating u7 with conv_3. As a result of this operation, u7 is defined as 64x64x128 and becomes 64x64x64 with conv_7.
  8. u8 with filter number 32 and (2×2) becomes 128x128x32 by applying to conv_7 and concatenating u7 with conv_2. As a result of this operation, u8 is defined as 128x128x64 and becomes 128x128x32 with conv_8.
  9. u9 with filter number 16 and (2×2) becomes 256x256x16 by applying to conv_8 and concatenating u9 with conv_1. As a result of this operation, u9 is defined as 256x256x32 and becomes 256x256x16 with conv_9.
  10. Output completes the classification process using softmax activation and the final output takes the form of 256x256x1.

Dropout at various rates is used to prevent overfitting.

Tutorial

In the coding part, the dataset can be trained with different approaches. In this study, while RGB (raw image) dataset is defined as x, the model is trained by using ground truth (segmented-labeled image) as y. In future articles, approaches using the mask dataset will also be discussed. RGB image and ground truth are shown in Figure 3. The study aims to train the dataset with this approach and to enable the externally presented images to perform segmentation as in the training data.

Figure 3. raw RGB image(left) and ground truth(right), Image by Author
Figure 3. raw RGB image(left) and ground truth(right), Image by Author

It is focused on the coding architecture part rather than achieving high performance. This is due to the computational complexity involved when working with image datasets. For example, while the raw image is 6000×4000 pixels, it has been converted to 256×256 pixels to avoid computational complexity. With such operations, it is aimed that the coding architecture works correctly by waiving accuracy.

Dataset link: https://www.kaggle.com/awsaf49/semantic-drone-dataset

License: CC0: Public Domain

Data Preprocessing

1- Libraries are imported. from architecture import multiclass_unet_architecture, jacard, jacard_loss is defined and imported from the section above.

2- RGB raw images with 6000×4000 pixels and corresponding labels are resized to 256×256 pixels.

3- MinMaxScaler is used to scale RGB images.

4- Labels of the ground truth are imported. 23 labels are detected in the ground truth dataset and labels are assigned to the content of the images based on the pixel values.

5- Labels dataset is one-hot-encoded for classification and data is separated as training set and test set.

Semantic Segmentation using U-Net (from scratch)

6- Accuracy and Jaccard index are used in the training process. Optimizer is set as 'adam', loss is set as 'categorical_crossentropy' since it is just a complex classification problem. The model is fitted with these settings.

7- validation_jaccard and loss of the training process are visualized. Figure 4 illustrates the val_jaccard.

Figure 4. Jaccard value by epochs, Image by Author
Figure 4. Jaccard value by epochs, Image by Author

8- The Jaccard index value of the test dataset was calculated as 0.5532.

9- 5 random images are selected from the test dataset and prediction is made with the trained algorithm and the results are as in figure 5.

Figure 5. Prediction results of 5 random test images, Image by Author
Figure 5. Prediction results of 5 random test images, Image by Author

Semantic Segmentation using U-Net with Transfer Learning

10- The dataset is re-prepared using resnet34. "Adam" is set as the optimizer, "categorical_crossentropy" is set as the loss function, and the model is trained.

11- validation_jaccard and loss of the training process are visualized. Figure 6 illustrates the val_jaccard.

Figure 6. Jaccard value by epochs, Image by Author
Figure 6. Jaccard value by epochs, Image by Author

12- The Jaccard index value of the test dataset is calculated as 0.6545.

13- 5 random images are selected from the test dataset and prediction is made with the trained algorithm and the results are as in Figure 7.

Figure 7. Prediction results of 5 random test images, Image by Author
Figure 7. Prediction results of 5 random test images, Image by Author

Conclusion

This article proposes a semantic segmentation of satellite images using U-Net which is developed for biomedical image segmentation. Two main approaches are considered in the study. The first approach involves training the configured u-net model with from-scratch implementation. The second approach involves training the model with the transfer learning technique, that is, pre-trained weights. In the implementation part, corresponding ground truth images are one-hot-encoded and the model is trained like a classification process. Jaccard index is used for metrics.

Resize process is not a recommended method since there will be undesired shifting in the size change in the segmentation operation, but due to the computational complexity dataset is resized to 256×256 from 6000×4000. Therefore, the success rate of the models is dramatically low. Some of the main things that can be done to prevent this situation are using a high-resolution dataset and/or using patchfying (cropping the images and corresponding ground truth images).

Figure 8. Comparison of two approaches, Image by Author
Figure 8. Comparison of two approaches, Image by Author

With the resized dataset, 2 different approaches are evaluated and the results are shown in Figure 8. Looking at the Jaccard index values, 0.6545 is obtained with the transfer learning method, while 0.5532 is obtained with the scratch-built model. It is seen that the segmentation process obtained with the pre-trained model is more successful.

Different methods will be covered with different coding approaches in the further articles.

Machine Learning Guideline

References

O. Ronneberger, P. Fischer, and T. Brox, "LNCS 9351 – U-Net: Convolutional Networks for Biomedical Image Segmentation," 2015, doi: 10.1007/978–3–319–24574–4_28.

A. Arnab et al., "Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation," IEEE Signal Process. Mag., vol. XX, 2018.

J. Y. C. Chen, G. F. Eds, and G. Goos, 2020_Book_VirtualAugmentedAndMixedReality. 2020.

J. Maurya, R. Hebbalaguppe, and P. Gupta, "Real-Time Hand Segmentation on Frugal Head-mounted Device for Gestural Interface," Proc. – Int. Conf. Image Process. ICIP, pp. 4023–4027, 2018, doi: 10.1109/ICIP.2018.8451213.


Related Articles