Towards accelerating disaster response with automated analysis of overhead imagery
A review of the SpaceNet Challenge for off-nadir building footprint extraction
SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.
In this post, we’ll describe the challenges associated with automated mapping from overhead imagery. The key takeaway: though algorithms are fantastic at mapping from “ideal” imagery taken directly overhead, the types of imagery that exist in urgent collection environments — such as after natural disasters — pose a currently unsolved problem for state-of-the-art computer vision algorithms.
The current state of affairs
As computer vision methods improve and overhead imagery becomes more accessible, scientists are exploring ways to unite these domains for many applications: among them, monitoring de-forestation and tracking population dynamics in refugee situations. Rapid disaster response could be aided by automated overhead image analysis: new maps are often essential after disasters, as infrastructure (e.g. roads) needed to coordinate response efforts may be disrupted. At present, this is done manually by teams in government, the private sector, or volunteer organizations like the Humanitarian Open Street Maps (HOT-OSM) team, who created a base map (roads and buildings) of Puerto Rico after Hurricane Maria at the request of the USA’s Federal Emergency Management Agency (FEMA). But manual labeling is time-consuming and labor-intensive: even with 5,300 mappers working on the project, the first base map of Puerto Rico was not delivered until over a month after the hurricane hit, and the project was not officially concluded for another month. This is by no means a criticism of the HOT-OSM team or their fantastic community of labelers — they had 950,000 buildings and 30,000 km of roads to label! Even a preliminary automated labeling step, corrected manually afterward, could improve map delivery time.
Computer vision-based map creation from overhead imagery has come a long way as deep learning models have grown from the nascency of AlexNet, implemented before TensorFlow even existed, to today’s advanced model architectures including Squeeze and Excitation Networks, Dual Path Networks, and advanced model training methods implemented in easy-to-use packages like Keras. Alongside these developments, automated mapping challenges have seen steadily improving performance, as evidenced by the SpaceNet competition series: building extraction scores on imagery taken directly overhead improved almost three-fold from the first challenge in 2016 to the most recent challenge (discussed here) at the end of 2018.
Why don’t we automatically map after natural disasters?
A major barrier still exists to automated analysis of overhead imagery in disaster response scenarios: look angle.
A satellite’s location — and therefore the area that can be visualized by satellite imagery — is constrained by satellite orbits. In urgent collection situations where there isn’t time for the satellite to get directly overhead, this means taking an image at an angle — sometimes a substantial one. The first cloud-free publicly available collection over San Juan, Puerto Rico after Hurricane Maria, taken 2 days after the hurricane, was at a 52 degree nadir angle according to DigitalGlobe’s Discover platform. For comparison, the most off-angle image in the animation at the top of this story is taken at 54 degrees.
There are a lot of features of so-called “off-nadir” imagery that pose a challenge to automated analysis:
Displacement and distortion
In off-nadir imagery, the tops of tall objects (trees, buildings, etc.) are displaced from their ground footprint. This makes segmentation challenging!
Shadows and varied lighting
In overhead imagery, views of an area can vary dramatically in appearance depending on shadows. See the example below, where buildings are obvious when sunlight reflects back at the satellite, but much less so when the image is collected from the shadowed side of buildings.
Occlusion of objects
It’s hard to identify an object if you can’t even see it in the image! In off-nadir looks, tall objects block out other structures:
Resolution
Images taken at a greater angle cover more ground, but still contain the same number of pixels, reducing ground resolution. In the SpaceNet Multi-View Overhead Imagery (MVOI) dataset described below, images taken “at nadir” had a resolution of 0.51 m/pixel, whereas resolution very off-nadir dropped to 1.67 m/pixel. This is obvious in the animation at the top of this story — as the image gets more off-nadir, resolution gets worse. To find out how much these factors influence model performance, we need a well-labeled dataset that controls for all variables except look angle.
The SpaceNet Multi-View Overhead Imagery (MVOI) Dataset
To explore how much look angle influences model performance, we released the SpaceNet MVOI Dataset, which is open source and freely available on AWS S3 — all you need is an AWS account! Download instructions and additional metadata are available here.
The imagery
The dataset is derived from a unique series of collections from a DigitalGlobe WorldView-2 satellite during a single pass over Atlanta, GA USA. The satellite took 27 images, ranging from 7 degrees to 54 degrees off-nadir, including both North- and South-facing looks:
Each of these collects covers the same 665 square kilometers in and around Atlanta. Because they were all acquired within 5 minutes of one another, changes to lighting, land usage, and other temporal factors are minimized. Really, the only thing that varies from collect to collect is look angle.
The exact same geography looks very different depending upon the look direction and angle. Understanding the effect this has upon algorithm performance has applications not only in remote sensing analysis, but also in the general computer vision domain.
The labels
Radiant Solutions, a SpaceNet partner, undertook a rigorous labeling campaign to draw polygon footprints around 126,747 buildings in the dataset. We used footprints instead of bounding boxes because bounding boxes are often insufficient for foundational mapping — for example, if a building footprint overlaps with a road, it’s important to know that’s actually true, not an artifact of the labeling. We performed careful quality control to ensure that all buildings were labeled, as high-quality labels are critical for training computer vision models (i.e., how can an algorithm learn to find buildings if only half of them are marked as such in the dataset?) The building footprints range in size from 20 square meters to over 2,000 square meters, and density varies dramatically across different geographies within the dataset:
This variability poses a challenge of its own, as the algorithm can’t learn that roughly the same number of objects should be present in each image. This is particularly difficult for object detection (bounding box prediction) algorithms, which often require an estimate of the number of objects per image as a hyperparameter.
After labeling was complete we re-sampled the images so that they all have the same ground area covered per pixel, then broke up the images and labels into 900-by-900 pixel pieces to enable easier creation of model inputs. Once processing was complete, we split it into three parts: a large training set and two smaller test sets, one for public testing and one for final competition validation. We released the imagery and labels for the training set, the imagery alone for the public testing set, and held back everything from the final test set.
The SpaceNet 4: Off-Nadir Building Footprint Extraction Challenge
Competition summary
Once the dataset was complete, we put algorithms to the test through the public SpaceNet Off-Nadir Building Footprint Extraction Challenge hosted by TopCoder. There, competitors trained algorithms to identify building footprints in images at every different look angle, competing for $50,000 of prize money.
Scoring
A common assessment tool for segmentation tasks is pixel-by-pixel scoring, but we asked competitors for more: actual polygon footprints of the buildings. These are much more useful in a deployment context, where one often needs to know where individual buildings are. To score their polygon predictions, we used an Intersection over Union (IoU) metric:
We set an IoU threshold of 0.5 for positive predictions, and after scoring, calculated the number of true positives, false positives, and false negatives in each look angle group (Nadir/Off-nadir/Very off-nadir). We calculated F1 score for each, as that metric penalizes both false positive and false negative predictions. We then averaged those three F1 scores for the final competition score.
Nearly 250 competitors registered for the competition, producing over 500 unique submissions. The results are below!
The winning algorithms
Score summary
First, let’s take a look at the scores from the top five competitors:
There are a few key takeaways from these results that are worth highlighting:
- Building footprint extraction scores have improved dramatically since the first SpaceNet building identification competition two years ago, where the winning algorithm achieved an IoU-F1 of ~0.3. Geospatial computer vision algorithms come a long way, and fast!
- Building footprint extraction from off-nadir (25–40 degree look angles) can be achieved nearly as well as nadir imagery. This type of knowledge can inform imagery acquisition decisions, since we now know that perfect looks aren’t required for this task. That said,
- Building footprint extraction is still challenging very off-nadir (>40 degree look angles). Performance was about 25% lower on those images. We’ll come back to why that might be soon.
Summary of the winning algorithms
The top competitors provided Dockerized versions of their algorithms, including full solution code and written summaries, which have been open sourced here. Along with those summaries, we’ve examined the algorithms each competitor used and found a few interesting details to highlight.
Let’s break down a few key details here:
A variety of deep learning models (but no classic machine learning):
We were struck by how many different deep learning model architectures were used, and how similar the scores were among them. Some competitors used substantial ensembles — over 20 independently trained models — to generate predictions of which pixels corresponded to buildings, and then averaged those predictions for their final results. Another competitor (number13) trained a different set of weights for each individual collect, then generated predictions for each image using the corresponding look’s model weights.
We weren’t terribly surprised that every winning algorithm used deep learning. This is consistent with general trends in computer vision: almost all high-performance segmentation algorithms utilize deep learning. The only “classical” machine learning algorithms used were Gradient Boosted Trees that the top two competitors — cannab and selim_sef — used to filter out “bad” building footprint predictions from the neural nets.
Model tailoring to geospatial-specific (and related) problems
These algorithms taught us a lot about tailoring models to overhead imagery. Building pixels only account for 9.5% of the entire training dataset. Segmentation algorithms are trained to classify individual pixels as belonging to an object (here, a building) or background. In this case, an algorithm can achieve high pixel-wise accuracy by predicting “non-building” for everything! This causes algorithms trained with “standard” loss functions — such as binary cross-entropy — to collapse to predicting zeros (background) everywhere. Competitors overcame this through two approaches: 1. using the relatively new Focal Loss, which is a cross-entropy variant that hyper-penalizes low-confidence predictions, and 2. combining this loss function with an IoU-based loss such as Jaccard Index or Dice Coefficient. These loss functions guard against the “all-zero valley” by strongly penalizing under-prediction.
An additional challenge with overhead imagery (and related problems like instance segmentation of small, densely packed objects) is object merging. The semantic segmentation approach described above does nothing to separate individual objects (a task called “instance segmentation”, which is what competitors were asked to do in this challenge). Instances are usually extracted from semantic masks by labeling contiguous objects as a single instance; however, semantic segmentation can produce pixel masks where nearby objects are erroneously connected to one another (see the example below). This can cause problems:
This is a problem if the use case requires an understanding of how many objects exist in an image or where their precise boundaries are. Several competitors addressed this challenge by creating multi-channel learning objective masks, like the one below:
Rather than just predicting building/no building for each pixel, the algorithm is now effectively predicting three things: 1. building/no building, 2. edge of a building/no edge, 3. contact point between buildings/no contact point. Post-processing to subtract the edges and contact points can allow “cleanup” of apposed objects, improving instance segmentation.
Training and test time varied
The competition rules required that competitors’ algorithms could train in 7 days on 4 Titan Xp GPUs, and complete inference in no more than 1 day. The table above breaks down training and testing time. It’s noteworthy that many of these solutions are likely too slow to deploy in a product environment that requires constant, timely updates. Interestingly, individual models from the large ensembles could perhaps be used on their own without substantial degradation in performance (and with a dramatic increase in speed) — for example, cannab noted in his solution description that his best individual model scored nearly as well as the prize-winning ensemble.
Strengths and weaknesses of algorithms for off-nadir imagery analysis
We asked a few questions about the SpaceNet Off-Nadir Challenge winning algorithms:
- What fraction of each building did winning algorithms identify? I.e., how precise were the footprints?
- How did each algorithm perform across different look angles?
- How similar were the predictions from the different algorithms?
- Did building size influence the likelihood that a building would be identified?
These questions are explored in more detail at the CosmiQ Works blog, The DownlinQ. A summary of interesting points is below.
How precise were the footprints?
When we ran the SpaceNet Off-Nadir Challenge, we set an IoU threshold of 0.5 for building detection — meaning that of all of the pixels present between a ground truth footprint and a prediction, >50% had to overlap to be counted as a success. Depending upon the use case, this threshold may be higher or lower than is actually necessary. A low IoU threshold means that you don’t care how much of a building is labeled correctly, only that some part of it is identified. This works for counting objects, but doesn’t work if you need precise outlines (for example, to localize damage after a disaster). It’s important to consider this threshold when evaluating computer vision algorithms for product deployment: how precisely must objects be labeled for the use case?
We asked what would have happened to algorithms’ building recall — the fraction of ground truth buildings they identified — if we had changed this threshold. The results were striking:
There is little change in competitor performance if the threshold is set below 0.3 or so — of the buildings competitors found, most achieved this score, if not better. However, performance begins to drop at this point, and once the threshold reaches ~0.75, scores have dropped by 50%. This stark decline highlights another area where computer algorithms could be improved: instance-level segmentation accuracy for small objects.
Performance by look angle
Next, let’s examine how each competitor’s algorithm performed at every different look angle. We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:
Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their tight packing at the top of the leaderboard. Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees). Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees. cannab seems to have won on his algorithm’s performance on very off-nadir imagery!
One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges. The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings (below figure reproduced from earlier as a reminder):
This pattern was even stronger in our baseline model. Look angle isn’t all that matters — look direction is also important!
Similarity between winning algorithms
We examined each building in the imagery and asked how many competitors successfully identified it. The results were striking:
Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins! This means that the algorithms only differed in their ability to identify about 20% of the buildings. Given the substantial difference in neural network architecture (and computing time needed to train and generate predictions from the different algorithms), we found this notable.
Performance vs. building size
The size of building footprints in this dataset varied dramatically. We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range? The graph below answers that question.
Even the best algorithm performed relatively poorly on small buildings. cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir. This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset. It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.
Conclusion
The top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions. Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks — developing better loss functions, pre- and post-processing techniques, and optimizing solutions to specific challenges may provide more value. Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance. Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!
What’s next?
We hope you enjoyed learning about off-nadir building footprint extraction from this challenge, and we hope you will explore the dataset for yourselves! There will be more SpaceNet Challenges coming soon — follow us for updates, and thank you for reading.
Model references:
- https://arxiv.org/abs/1505.04597
- https://arxiv.org/abs/1709.01507
- https://arxiv.org/abs/1707.01629
- https://arxiv.org/abs/1409.0575
- https://arxiv.org/abs/1512.03385
- https://arxiv.org/abs/1608.06993
- https://arxiv.org/abs/1602.07261
- https://arxiv.org/abs/1612.03144
- https://arxiv.org/abs/1703.06870
- https://arxiv.org/abs/1405.0312
- https://www.crowdai.org/challenges/mapping-challenge
- https://arxiv.org/abs/1409.1556
- https://arxiv.org/abs/1801.05746