EagleView high-resolution image semantic segmentation with Mask-RCNN/DeepLabV3+ using Keras and ArcGIS Pro

Published in

Towards Data Science

8 min readDec 10, 2019

Computer vision in Machine Learning provides enormous opportunities for GIS. Its tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions.[1][2][3][4] In the last several years, computer vision is increasingly shifting from traditional statistical methods to the state-of-art deep learning neural network techniques.

In this blog, I will share several empirical practices using Keras and ESRI ArcGIS Pro tools with deep learning and transfer learning techniques to build a building footprint image segmentation network model from a super-high-resolution 3-inch of EagleView (Pictometry) imagery.

In 2018, ESRI and Microsoft collaborated with the Chesapeake Conservancy to train a deep neural network model to predict land cover from 1-meter NAIP resolution aerial imagery data source. The neural network similar in architecture to Ronnenberger et al.’s U-net (2015), a commonly-used semantic segmentation model was used in that case. Each year, the GIS Core group in Cobb County Georgia receives 3-inch super-high-resolution ortho imagery from EagleView(Pictometry). Could deep learning models be applied to this super-high-resolution ortho imagery to classify land cover or extract building footprints? There are several challenges — Super-high-resolution imagery usually presents varieties of vegetation types and overlaps; buildings and trees creating heavy shadows in the images, which could potentially misclassify the true ground objects.

In the beginning, I was very conservative as I decided to use a CPU-only laptop to train roughly 3800 images. Considering the complexity of land cover and building footprints, this is quite a small dataset for deep learning because if you read textbooks, often say deep learning requires a huge amount of training data for better performance. But it is also a realistic classification problem: in a real world-cases, even small-scale image data can be extremely hard to collect and expensive or sometimes almost impossible. Being able to use small datasets and train a powerful classifier is a key skill for a competent data scientist. After many tries and runs, the results turn out very promising especially with state-of-the-art Deeplabv3+ and Mask-RCNN models.

Study Area and training image dataset preparation

fig.1 — Cobb County 2018 3in EagleView imagery covers with 433 1x1 mile tiles.

The geographical area of Cobb County covers with 433 of 1 x 1 mile Pictometry image tiles at a resolution of 3-inch. The county GIS group has a building footprint polygon layer in certain areas. For training purposes, one image tile close to the center of the County was chosen for the image training dataset(fig. 1). The building footprint polygon feature layer was used to process as ground truth mask labels. The “Export Training Data for Deep Learning” in ArcGIS Pro 2.4 ver. of Geoprocessing tool was used to export images and masks for instance segmentation datasets(fig.2). The dimension of the output images are 512x512x3 and rotation is set to 90 degrees to generate more images to prevent overfitting and help the model generalize better.

Fig. 2 — ArcGIS “Export Training Data for Deep Learning”

1. Training with Mask-RCNN model

The resulting training datasets contain over 18000 images and labels. With further data processing to remove no labeling images, final datasets had over 15000 training images and labels. However, with CPU only 32-GB memory laptop, it is impossible to feed such large datasets into the Mask-RCNN model which requires huge memory for the training.

The training strategy is to see how the proof of concept will work, so I gradually increased the datasets to feed into CNN with a trial of 3800 datasets.

I used the impressive open-source implementation Mask-RCNN library that MatterPort built on Github here to train the model.

Mask-RCNN efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition[5]. You can read the research paper to better understand the model. (fig. 3).

fig. 3. — Mask R-CNN framework for instance segmentation. Source: https://arxiv.org/abs/1703.06870

There are three main functions need to be modified in the class (Utils.dataset) to load your own datasets into the framework. See below for the data loading implementation. The anchor's ratio is setup (16,32,64,128,256) to predict a smaller size of residential buildings. IMAGES_PER_GPU set to =1 so CPU can be used to train the model (fig. 4). An example of an image and mask (fig. 5).

fig. 4 — Load Cobb Pictometry datasets to Mask-RCNN framework.

fig. 5 — An example of a random image and mask from datasets.

Here, the transfer learning technique was applied with model backbone ResNet-101. I trained the last fully connected layers first with epoch =5 to adapt the residential building class, then trained the full network for 35 epochs.

At a 32-GB CPU, it took nearly 48 hours to finish the training process (fig. 6 and fig.7).

fig. 6 — Model training result. The loss is reasonable good.

Here are two inferences that original images were not used in training (fig.8 and fig.9). Interestingly to see the inference mask is more accurately delineating the building than original mask.

fig. 8 — Original image that was not used in training

fig. 9 — Inference mask more accurately delineating the building than the original mask.

Another interesting example of one image that was not used in training and mask inference result (fig.10 and fig.11).

fig. 10 — An image that was not used in training.

fig. 12 — This is a cropped image and inference mask not used in the training. The orange line indicates the image cropped position. With 3000 training datasets, the result is very promising.

2. Training with Deeplabv3+ model

Deeplabv3+ is the latest state-of-art semantic image segmentation model developed by google research team. The distinctive of this model is to employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous Rates (fig.13).

fig.13 — https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html

With the same training datasets from ArcGIS export training data for a deep learning tool, the images and masks were processed with augmentations and saved to a HDF5 compressed file for conveniently loading to training model (fig.14).

fig. 14 — A random example of image and mask.

I used Keras implementation of Deeplabv3+ on Github here. Below is the Keras training model with backbone Mobilenetv2 which has fewer parameters than the Xception model (fig.15).

With only 5 epoch training runs, the result turns out very promising(fig.16).

fig. 18 — Image and inference from trained year 2018 imagery of Deeplabv3+ model

I run Python scripts for inference of an arbitrary cropped 2064x1463 dimension image that crop and process 16 (512x512x3 dim) images to obtain the inference rasters. (fig.19). With further inspection of the images and inference, we can see the effect of the building shadow can lower the accuracy of the edge of the buildings.

fig. 19 — A cropped 2018 EagleView image with inference raster (25 rasters of 512x512 dim)

With the same trained model to predict the 2019 same area cropped the image, the result is very similar with only minor localized differences(fig.20). The model can really help in future year image inference.

fig. 20 — A cropped 2019 EagleView image with inference raster (25 rasters of 512x512 dim)

fig. 21 — 2018 inference raster overlay the original image.

After adding the above image and inference to the ArcGIS Pro. (fig.21)

The above image raster was converted to the polygon feature and then use Regularize Building Footprint in ArcGIS Pro 3D analysis with appropriate parameters to regularize raw detection. (fig.22)

fig. 22 — Use ArcGIS Pro Regularize Building Footprint tool to clean up building polygons.

Then I run python inference scripts with two complete 2018 tile images of 20,000 x 20,000 dimension about 3-mile away from each other. The scripts crop and process 1600 (512x512x3 dim) images for the inference. It took approximately one hour to finish each tile using 32GB RAM of CPU only laptop. see (fig. 23). There are missed classified buildings mostly because of using very small training dataset and trees covering on top of the buildings. Choosing several representative tiles as training dataset from different locations of the County could improve the accuracy of the result.

fig. 23 — Two complete tiles of EagleView 2018 image with inference raster (1600 rasters of 512x512 dim on each)

fig.24 inference raster overlay the image tile in ArcGIS Pro.

Conclusion:

Although it is a relatively small dataset, Mask-RCNN, and Deeplabv3+ deep learning models both present promising results for super-high-resolution image segmentation using transferred learning techniques. Due to the less accuracy of original building footprints ground truth feature polygons and the laptop CPU and memory limitation, the result of the performance may not surpass human digitizer in some image classifications and instance segmentation. However, the accuracy of this Deep learning training process can be further enhanced by increasing high-quality training datasets from different locations of the county and applying data variation augmentation methods. The model can be used in multi-year imagery to infer feature detection for comparison or even used for low-cost feature delineation with ArcGIS tools ModelBuilder to automate the business tasks. More importantly, the above deep learning training process can be applied to other types of image instances or segmentation cases. (please see my next blog)

1.Reinhard Klette (2014). Concise Computer Vision. Springer. ISBN 978–1–4471–6320–6.

2.Linda G. Shapiro; George C. Stockman (2001). Computer Vision. Prentice Hall. ISBN 978–0–13–030796–5.

3.Tim Morris (2004). Computer Vision and Image Processing. Palgrave Macmillan. ISBN 978–0–333–99451–1.

4.Bernd Jähne; Horst Haußecker (2000). Computer Vision and Applications, A Guide for Students and Practitioners. Academic Press. ISBN 978–013085198–7.

5. Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick.(2018). Mask-RCNN, https://arxiv.org/abs/1703.06870v3

6. Deeplabv3+ model, https://github.com/tensorflow/models/tree/master/research/deeplab

7. https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

8. http://pro.arcgis.com/en/pro-app/tool-reference/image-analyst/export-training-data-for-deep-learning.htm

9. https://pro.arcgis.com/en/pro-app/tool-reference/3d-analyst/regularize-building-footprint.htm

10.https://blogs.technet.microsoft.com/machinelearning/2018/03/12/pixel-level-land-cover-classification-using-the-geo-ai-data-science-virtual-machine-and-batch-ai/

11. U-Net: Convolutional Networks for Biomedical Image segmentation:https//lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

EagleView high-resolution image semantic segmentation with Mask-RCNN/DeepLabV3+ using Keras and ArcGIS Pro

Written by Chunguang (Wayne) Zhang