In machine learning, models are trained with various applications, especially on deep learning and image datasets. With the methods based on convolutional operation, many studies are carried out in many fields, especially hand-arm detection in Augmented Reality, self-driving cars, aerial images with drones, war technologies. The human eye has the ability to easily classify and distinguish what it sees. However, the equivalent of this ability in artificial intelligence technologies, that is, the problem of understanding images, is discussed under the title of Computer Vision. As the name suggests (computer vision), it is to introduce (classify) images in a way that the computer can understand, and the next step is to make operations on these images possible by using different methods. This article explains one of the segmentation methods that is the U-Net architecture which is developed for biomedical image segmentation and includes a real-world project that segments aerial imagery captured by a drone using U-Net.
Table of Contents
1. Semantic Segmentation
2. U-Net Architecture from
3. Tutorial
3.1. Data Preprocessing
3.2. Semantic Segmentation using U-Net from scratch
3.3. Semantic Segmentation using U-Net with Transfer Learning
4. Conclusion
5. References
Semantic Segmentation
Images are matrixes of pixels formed by mathematical numbers. In image processing techniques, some adjustments are made to these mathematical numbers then the images are expressed in a different way and made suitable for the relevant study or interpretation. The convolution process which is a fundamental mathematical pixel operation provides the opportunity to evaluate images from different perspectives. For example, edge detection of the image can be done with the applied filter, or the image can be interpreted & used from a different angle by converting it from RGB format to grayscale. Based on Deep Learning models and convolutional layers, more comprehensive studies such as feature extraction and classification on the content of images have been carried out.
](https://towardsdatascience.com/wp-content/uploads/2022/01/1aDyq6AKU28HY4-A1IH_wng.png)
As seen in the picture above, detecting the objects in the content of the image with the bounding box is called object detection. Semantic segmentation which is a pixel-wise labeling operation is to display the same type of objects (sky, cat, dog, human, road, car, mountain, sea, etc) in the picture with a label, that is, color. Instant segmentation which is every instance is labeled individually separates each object by displaying it in a different color. As mentioned above, various CNN models, different from each other and complex models have been developed for different uses in the background of these operations. PSPNet, DeepLab, LinkNet, U-Net, Mask R-CNN are just some of these models. We can say that the segmentation process is the eye of the project in Machine Learning-based applications such as self-driving cars. The video below contains the real-time semantic segmentation process that compares the human view perspective and the PSPNet perspective.
In a nutshell, semantic segmentation in computer vision is a pixel-wise labeling method. If objects of the same type are expressed with a single color, it is called semantic segmentation, and if each object is expressed with a unique color (label), it is called instance segmentation.
U-Net Architecture
U-Net is a specific type of Convolutional Neural Network architecture that was developed for biomedical images (Computer Tomography, microscopic images, MRI scans, etc) at the Computer Science Department and BIOSS Center for Biological Signaling Studies, University of Freiburg, Germany in 2015. The article – "The U-Net: Convolutional Networks for Biomedical Image Segmentation" – can be accessed at the link here. When we consider the technical idea, the model consists of the encoder (contraction) which is the down-sampling (mostly with the pre-trained weights in Transfer Learning) and the decoder (extraction) which is the up-sampling parts and it is named as U-Net because its scheme is U-shaped as shown in Figure 2. This model can be configured according to different studies.
](https://towardsdatascience.com/wp-content/uploads/2022/01/1m5kH0lsf9awWZU3dLh9vKA.png)
The U-Net model is configured for semantic segmentation of aerial imagery in the following tutorial is as follows:
If we take the above code block and Figure 2 (flow of the from top left to top right following the ‘U’ letter) step by step:
- Input is defined as 256x256x3 dimensions.
- As a result of
conv_1
with 16 filters, 256x256x16 dimensions are obtained. It is reduced to 128x128x16 with Maxpooling inpool_1
. - With
conv_2
with Filter numbers 32, the size of 128x128x32 is obtained, and similarly, withpool_2
, the size of 64x64x32 is obtained. - The size of 64x64x64 is obtained by
conv_3
with filter numbers 64, and 32x32x64 is obtained withpool_3
. - The size of 32x32x128 is obtained with
conv_4
with a filter number of 128, and as a result ofpool_4
, 16x16x128 is obtained. - With
conv_5
with filter numbers of 256, the size of 16x16x256 is obtained, and upsampling starts from at this point. Inu6
with a filter number of 128 and (2×2),conv_5
is converted to 32x32x128 withConv2DTranspose
, and concatenate layer is performed withu6
,conv_4
. As a result,u6
is updated to 32x32x256. Withconv_6
with 128 filters, it becomes 32x32x128. u7
with filter number 64 and (2×2) becomes 64x64x64 by applying toconv_6
and concatenatingu7
withconv_3
. As a result of this operation,u7
is defined as 64x64x128 and becomes 64x64x64 withconv_7
.u8
with filter number 32 and (2×2) becomes 128x128x32 by applying toconv_7
and concatenatingu7
withconv_2
. As a result of this operation,u8
is defined as 128x128x64 and becomes 128x128x32 withconv_8
.u9
with filter number 16 and (2×2) becomes 256x256x16 by applying toconv_8
and concatenatingu9
withconv_1
. As a result of this operation,u9
is defined as 256x256x32 and becomes 256x256x16 withconv_9
.- Output completes the classification process using softmax activation and the final output takes the form of 256x256x1.
Dropout at various rates is used to prevent overfitting.
Tutorial
In the coding part, the dataset can be trained with different approaches. In this study, while RGB (raw image) dataset is defined as x, the model is trained by using ground truth (segmented-labeled image) as y. In future articles, approaches using the mask dataset will also be discussed. RGB image and ground truth are shown in Figure 3. The study aims to train the dataset with this approach and to enable the externally presented images to perform segmentation as in the training data.

It is focused on the coding architecture part rather than achieving high performance. This is due to the computational complexity involved when working with image datasets. For example, while the raw image is 6000×4000 pixels, it has been converted to 256×256 pixels to avoid computational complexity. With such operations, it is aimed that the coding architecture works correctly by waiving accuracy.
Dataset link: https://www.kaggle.com/awsaf49/semantic-drone-dataset
License: CC0: Public Domain
Data Preprocessing
1- Libraries are imported. from architecture import multiclass_unet_architecture, jacard, jacard_loss
is defined and imported from the section above.
2- RGB raw images with 6000×4000 pixels and corresponding labels are resized to 256×256 pixels.
3- MinMaxScaler
is used to scale RGB images.
4- Labels of the ground truth are imported. 23 labels are detected in the ground truth dataset and labels are assigned to the content of the images based on the pixel values.
5- Labels dataset is one-hot-encoded
for classification and data is separated as training set and test set.
Semantic Segmentation using U-Net (from scratch)
6- Accuracy and Jaccard index are used in the training process. Optimizer is set as 'adam'
, loss is set as 'categorical_crossentropy'
since it is just a complex classification problem. The model is fitted with these settings.
7- validation_jaccard
and loss of the training process are visualized. Figure 4 illustrates the val_jaccard.

8- The Jaccard index value of the test dataset was calculated as 0.5532.
9- 5 random images are selected from the test dataset and prediction is made with the trained algorithm and the results are as in figure 5.

Semantic Segmentation using U-Net with Transfer Learning
10- The dataset is re-prepared using resnet34. "Adam"
is set as the optimizer, "categorical_crossentropy"
is set as the loss function, and the model is trained.
11- validation_jaccard
and loss of the training process are visualized. Figure 6 illustrates the val_jaccard.

12- The Jaccard index value of the test dataset is calculated as 0.6545.
13- 5 random images are selected from the test dataset and prediction is made with the trained algorithm and the results are as in Figure 7.

Conclusion
This article proposes a semantic segmentation of satellite images using U-Net which is developed for biomedical image segmentation. Two main approaches are considered in the study. The first approach involves training the configured u-net model with from-scratch implementation. The second approach involves training the model with the transfer learning technique, that is, pre-trained weights. In the implementation part, corresponding ground truth images are one-hot-encoded and the model is trained like a classification process. Jaccard index is used for metrics.
Resize process is not a recommended method since there will be undesired shifting in the size change in the segmentation operation, but due to the computational complexity dataset is resized to 256×256 from 6000×4000. Therefore, the success rate of the models is dramatically low. Some of the main things that can be done to prevent this situation are using a high-resolution dataset and/or using patchfying
(cropping the images and corresponding ground truth images).

With the resized dataset, 2 different approaches are evaluated and the results are shown in Figure 8. Looking at the Jaccard index values, 0.6545 is obtained with the transfer learning method, while 0.5532 is obtained with the scratch-built model. It is seen that the segmentation process obtained with the pre-trained model is more successful.
Different methods will be covered with different coding approaches in the further articles.
References
O. Ronneberger, P. Fischer, and T. Brox, "LNCS 9351 – U-Net: Convolutional Networks for Biomedical Image Segmentation," 2015, doi: 10.1007/978–3–319–24574–4_28.
A. Arnab et al., "Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation," IEEE Signal Process. Mag., vol. XX, 2018.
J. Y. C. Chen, G. F. Eds, and G. Goos, 2020_Book_VirtualAugmentedAndMixedReality. 2020.
J. Maurya, R. Hebbalaguppe, and P. Gupta, "Real-Time Hand Segmentation on Frugal Head-mounted Device for Gestural Interface," Proc. – Int. Conf. Image Process. ICIP, pp. 4023–4027, 2018, doi: 10.1109/ICIP.2018.8451213.