Towards accelerating disaster response with automated analysis of overhead imagery

A review of the SpaceNet Challenge for off-nadir building footprint extraction

Published in

Towards Data Science

14 min readMar 29, 2019

SpaceNet’s mission is to accelerate geospatial machine learning and is supported by the SpaceNet member organizations. To learn more visit https://spacenet.ai.

A series of images taken by DigitalGlobe’s WorldView-2 satellite, from the SpaceNet Multi-View Overhead Imagery (MVOI) dataset described below, illustrating how look angle changes image appearance.

In this post, we’ll describe the challenges associated with automated mapping from overhead imagery. The key takeaway: though algorithms are fantastic at mapping from “ideal” imagery taken directly overhead, the types of imagery that exist in urgent collection environments — such as after natural disasters — pose a currently unsolved problem for state-of-the-art computer vision algorithms.

The current state of affairs

As computer vision methods improve and overhead imagery becomes more accessible, scientists are exploring ways to unite these domains for many applications: among them, monitoring de-forestation and tracking population dynamics in refugee situations. Rapid disaster response could be aided by automated overhead image analysis: new maps are often essential after disasters, as infrastructure (e.g. roads) needed to coordinate response efforts may be disrupted. At present, this is done manually by teams in government, the private sector, or volunteer organizations like the Humanitarian Open Street Maps (HOT-OSM) team, who created a base map (roads and buildings) of Puerto Rico after Hurricane Maria at the request of the USA’s Federal Emergency Management Agency (FEMA). But manual labeling is time-consuming and labor-intensive: even with 5,300 mappers working on the project, the first base map of Puerto Rico was not delivered until over a month after the hurricane hit, and the project was not officially concluded for another month. This is by no means a criticism of the HOT-OSM team or their fantastic community of labelers — they had 950,000 buildings and 30,000 km of roads to label! Even a preliminary automated labeling step, corrected manually afterward, could improve map delivery time.

Computer vision-based map creation from overhead imagery has come a long way as deep learning models have grown from the nascency of AlexNet, implemented before TensorFlow even existed, to today’s advanced model architectures including Squeeze and Excitation Networks, Dual Path Networks, and advanced model training methods implemented in easy-to-use packages like Keras. Alongside these developments, automated mapping challenges have seen steadily improving performance, as evidenced by the SpaceNet competition series: building extraction scores on imagery taken directly overhead improved almost three-fold from the first challenge in 2016 to the most recent challenge (discussed here) at the end of 2018.

Why don’t we automatically map after natural disasters?

A major barrier still exists to automated analysis of overhead imagery in disaster response scenarios: look angle.

A satellite’s location — and therefore the area that can be visualized by satellite imagery — is constrained by satellite orbits. In urgent collection situations where there isn’t time for the satellite to get directly overhead, this means taking an image at an angle — sometimes a substantial one. The first cloud-free publicly available collection over San Juan, Puerto Rico after Hurricane Maria, taken 2 days after the hurricane, was at a 52 degree nadir angle according to DigitalGlobe’s Discover platform. For comparison, the most off-angle image in the animation at the top of this story is taken at 54 degrees.

There are a lot of features of so-called “off-nadir” imagery that pose a challenge to automated analysis:

Displacement and distortion

In off-nadir imagery, the tops of tall objects (trees, buildings, etc.) are displaced from their ground footprint. This makes segmentation challenging!

Buildings are perfectly outlined in the nadir looks, but the footprint is offset in distorted off-nadir looks. Competitors’ solutions needed to account for this. Imagery courtesy of DigitalGlobe.

Shadows and varied lighting

In overhead imagery, views of an area can vary dramatically in appearance depending on shadows. See the example below, where buildings are obvious when sunlight reflects back at the satellite, but much less so when the image is collected from the shadowed side of buildings.

Two looks at the same buildings at nearly the same look angle, but from different sides of the city. It’s visually much harder to see buildings in the South-facing imagery due to shadows. Imagery courtesy of DigitalGlobe.

Occlusion of objects

It’s hard to identify an object if you can’t even see it in the image! In off-nadir looks, tall objects block out other structures:

Occlusion can make it impossible to see some objects in off-nadir imagery. A building whose roof is visible (though cloaked in shadow) in nadir imagery (left, red arrow) is obscured by the skyscraper in off-nadir imagery (right).

Resolution

Images taken at a greater angle cover more ground, but still contain the same number of pixels, reducing ground resolution. In the SpaceNet Multi-View Overhead Imagery (MVOI) dataset described below, images taken “at nadir” had a resolution of 0.51 m/pixel, whereas resolution very off-nadir dropped to 1.67 m/pixel. This is obvious in the animation at the top of this story — as the image gets more off-nadir, resolution gets worse. To find out how much these factors influence model performance, we need a well-labeled dataset that controls for all variables except look angle.

The SpaceNet Multi-View Overhead Imagery (MVOI) Dataset

To explore how much look angle influences model performance, we released the SpaceNet MVOI Dataset, which is open source and freely available on AWS S3 — all you need is an AWS account! Download instructions and additional metadata are available here.

The imagery

The dataset is derived from a unique series of collections from a DigitalGlobe WorldView-2 satellite during a single pass over Atlanta, GA USA. The satellite took 27 images, ranging from 7 degrees to 54 degrees off-nadir, including both North- and South-facing looks:

Location where each collect was taken from to generate the SpaceNet 4 Off-Nadir dataset. This not-to-scale representation is simplified: in reality, the satellite did not pass directly over Atlanta, but nearby. See this paper and the dataset metadata for additional details.

Each of these collects covers the same 665 square kilometers in and around Atlanta. Because they were all acquired within 5 minutes of one another, changes to lighting, land usage, and other temporal factors are minimized. Really, the only thing that varies from collect to collect is look angle.

Sample images from the SpaceNet Multi-View Overhead Imagery (MVOI) dataset (described below). The look angle is shown on the left, along with a bin ID: Nadir, ≤25 degrees off-nadir; Off-nadir, >25 and <40 degrees; very off-nadir, ≥40 degrees off-nadir. Negative numbers correspond to South-facing looks, positive numbers to North-facing looks. Imagery courtesy of DigitalGlobe.

The exact same geography looks very different depending upon the look direction and angle. Understanding the effect this has upon algorithm performance has applications not only in remote sensing analysis, but also in the general computer vision domain.

The labels

Radiant Solutions, a SpaceNet partner, undertook a rigorous labeling campaign to draw polygon footprints around 126,747 buildings in the dataset. We used footprints instead of bounding boxes because bounding boxes are often insufficient for foundational mapping — for example, if a building footprint overlaps with a road, it’s important to know that’s actually true, not an artifact of the labeling. We performed careful quality control to ensure that all buildings were labeled, as high-quality labels are critical for training computer vision models (i.e., how can an algorithm learn to find buildings if only half of them are marked as such in the dataset?) The building footprints range in size from 20 square meters to over 2,000 square meters, and density varies dramatically across different geographies within the dataset:

A histogram showing the number of buildings in each unique geography in the dataset. X axis: the number of buildings apparent in the image, which ranges from 0 (mostly forested areas) to 297 (a very dense suburb). Y axis is the number of images from unique geographies that have that number of buildings — multiply that number by 27 for the total number of images.

This variability poses a challenge of its own, as the algorithm can’t learn that roughly the same number of objects should be present in each image. This is particularly difficult for object detection (bounding box prediction) algorithms, which often require an estimate of the number of objects per image as a hyperparameter.

After labeling was complete we re-sampled the images so that they all have the same ground area covered per pixel, then broke up the images and labels into 900-by-900 pixel pieces to enable easier creation of model inputs. Once processing was complete, we split it into three parts: a large training set and two smaller test sets, one for public testing and one for final competition validation. We released the imagery and labels for the training set, the imagery alone for the public testing set, and held back everything from the final test set.

The SpaceNet 4: Off-Nadir Building Footprint Extraction Challenge

Competition summary

Once the dataset was complete, we put algorithms to the test through the public SpaceNet Off-Nadir Building Footprint Extraction Challenge hosted by TopCoder. There, competitors trained algorithms to identify building footprints in images at every different look angle, competing for $50,000 of prize money.

Scoring

A common assessment tool for segmentation tasks is pixel-by-pixel scoring, but we asked competitors for more: actual polygon footprints of the buildings. These are much more useful in a deployment context, where one often needs to know where individual buildings are. To score their polygon predictions, we used an Intersection over Union (IoU) metric:

We scored algorithms by an instance IoU metric. Algorithms generated building proposals for each image, and predictions where the intersection between a ground truth and prediction footprint was greater than 50% of the union of those footprints were scored successes. All other predictions were scored as failures (false positives).

We set an IoU threshold of 0.5 for positive predictions, and after scoring, calculated the number of true positives, false positives, and false negatives in each look angle group (Nadir/Off-nadir/Very off-nadir). We calculated F1 score for each, as that metric penalizes both false positive and false negative predictions. We then averaged those three F1 scores for the final competition score.

Nearly 250 competitors registered for the competition, producing over 500 unique submissions. The results are below!

The winning algorithms

Score summary

First, let’s take a look at the scores from the top five competitors:

Competitors’ scores in the SpaceNet 4: Off-Nadir Building Detection Challenge compared to the baseline model. Each score represents the SpaceNet Metric for the entire image set (Overall) or subsets of the imagery with specific look angles: Nadir, 7–25 degrees; Off-nadir, 26–40 degrees; Very off-nadir, >40 degrees.

There are a few key takeaways from these results that are worth highlighting:

Building footprint extraction scores have improved dramatically since the first SpaceNet building identification competition two years ago, where the winning algorithm achieved an IoU-F1 of ~0.3. Geospatial computer vision algorithms come a long way, and fast!
Building footprint extraction from off-nadir (25–40 degree look angles) can be achieved nearly as well as nadir imagery. This type of knowledge can inform imagery acquisition decisions, since we now know that perfect looks aren’t required for this task. That said,
Building footprint extraction is still challenging very off-nadir (>40 degree look angles). Performance was about 25% lower on those images. We’ll come back to why that might be soon.

Summary of the winning algorithms

The top competitors provided Dockerized versions of their algorithms, including full solution code and written summaries, which have been open sourced here. Along with those summaries, we’ve examined the algorithms each competitor used and found a few interesting details to highlight.

A summary of the models used by the top 5 competitors in the SpaceNet Off-Nadir Building Footprint Extraction Challenge. See the end of this post for numbered reference links.

Let’s break down a few key details here:

A variety of deep learning models (but no classic machine learning):

We were struck by how many different deep learning model architectures were used, and how similar the scores were among them. Some competitors used substantial ensembles — over 20 independently trained models — to generate predictions of which pixels corresponded to buildings, and then averaged those predictions for their final results. Another competitor (number13) trained a different set of weights for each individual collect, then generated predictions for each image using the corresponding look’s model weights.

We weren’t terribly surprised that every winning algorithm used deep learning. This is consistent with general trends in computer vision: almost all high-performance segmentation algorithms utilize deep learning. The only “classical” machine learning algorithms used were Gradient Boosted Trees that the top two competitors — cannab and selim_sef — used to filter out “bad” building footprint predictions from the neural nets.

Model tailoring to geospatial-specific (and related) problems

These algorithms taught us a lot about tailoring models to overhead imagery. Building pixels only account for 9.5% of the entire training dataset. Segmentation algorithms are trained to classify individual pixels as belonging to an object (here, a building) or background. In this case, an algorithm can achieve high pixel-wise accuracy by predicting “non-building” for everything! This causes algorithms trained with “standard” loss functions — such as binary cross-entropy — to collapse to predicting zeros (background) everywhere. Competitors overcame this through two approaches: 1. using the relatively new Focal Loss, which is a cross-entropy variant that hyper-penalizes low-confidence predictions, and 2. combining this loss function with an IoU-based loss such as Jaccard Index or Dice Coefficient. These loss functions guard against the “all-zero valley” by strongly penalizing under-prediction.

An additional challenge with overhead imagery (and related problems like instance segmentation of small, densely packed objects) is object merging. The semantic segmentation approach described above does nothing to separate individual objects (a task called “instance segmentation”, which is what competitors were asked to do in this challenge). Instances are usually extracted from semantic masks by labeling contiguous objects as a single instance; however, semantic segmentation can produce pixel masks where nearby objects are erroneously connected to one another (see the example below). This can cause problems:

Building instance segmentation from attached (left) vs. separated (right) semantic segmentation output masks. Red arrows, poor prediction results in connection of very closely apposed buildings.

This is a problem if the use case requires an understanding of how many objects exist in an image or where their precise boundaries are. Several competitors addressed this challenge by creating multi-channel learning objective masks, like the one below:

A sample pixel mask taken from cannab’s solution description. Black is background, blue is the first channel (building footprints), pink is the second channel (building boundaries), and green is the third channel (points very close to two different or more buildings). Cannab’s algorithm learned to predict outputs in this shape, and in post-processing he subtracted the boundaries and contact points from the predicted footprints to separate instances more effectively.

Rather than just predicting building/no building for each pixel, the algorithm is now effectively predicting three things: 1. building/no building, 2. edge of a building/no edge, 3. contact point between buildings/no contact point. Post-processing to subtract the edges and contact points can allow “cleanup” of apposed objects, improving instance segmentation.

Training and test time varied

The competition rules required that competitors’ algorithms could train in 7 days on 4 Titan Xp GPUs, and complete inference in no more than 1 day. The table above breaks down training and testing time. It’s noteworthy that many of these solutions are likely too slow to deploy in a product environment that requires constant, timely updates. Interestingly, individual models from the large ensembles could perhaps be used on their own without substantial degradation in performance (and with a dramatic increase in speed) — for example, cannab noted in his solution description that his best individual model scored nearly as well as the prize-winning ensemble.

Strengths and weaknesses of algorithms for off-nadir imagery analysis

We asked a few questions about the SpaceNet Off-Nadir Challenge winning algorithms:

What fraction of each building did winning algorithms identify? I.e., how precise were the footprints?
How did each algorithm perform across different look angles?
How similar were the predictions from the different algorithms?
Did building size influence the likelihood that a building would be identified?

These questions are explored in more detail at the CosmiQ Works blog, The DownlinQ. A summary of interesting points is below.

How precise were the footprints?

When we ran the SpaceNet Off-Nadir Challenge, we set an IoU threshold of 0.5 for building detection — meaning that of all of the pixels present between a ground truth footprint and a prediction, >50% had to overlap to be counted as a success. Depending upon the use case, this threshold may be higher or lower than is actually necessary. A low IoU threshold means that you don’t care how much of a building is labeled correctly, only that some part of it is identified. This works for counting objects, but doesn’t work if you need precise outlines (for example, to localize damage after a disaster). It’s important to consider this threshold when evaluating computer vision algorithms for product deployment: how precisely must objects be labeled for the use case?

We asked what would have happened to algorithms’ building recall — the fraction of ground truth buildings they identified — if we had changed this threshold. The results were striking:

Recall, or the fraction of actual buildings identified by algorithms, depends on the IoU threshold. Some algorithms identified part of many buildings, but not enough to be counted as a successful identification at our threshold of 0.5. The inset shows the range of IoU thresholds where XD_XD (orange)’s algorithm went from being one of the best in the top five to one of the worst out of the prize-winners.

There is little change in competitor performance if the threshold is set below 0.3 or so — of the buildings competitors found, most achieved this score, if not better. However, performance begins to drop at this point, and once the threshold reaches ~0.75, scores have dropped by 50%. This stark decline highlights another area where computer algorithms could be improved: instance-level segmentation accuracy for small objects.

Performance by look angle

Next, let’s examine how each competitor’s algorithm performed at every different look angle. We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:

F1 score, recall, and precision for the top five competitors stratified by look angle. Though F1 scores and recall are relatively tightly packed except in the most off-nadir look angles, precision varied dramatically among competitors.

Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their tight packing at the top of the leaderboard. Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees). Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees. cannab seems to have won on his algorithm’s performance on very off-nadir imagery!

One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges. The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings (below figure reproduced from earlier as a reminder):

This pattern was even stronger in our baseline model. Look angle isn’t all that matters — look direction is also important!

Similarity between winning algorithms

We examined each building in the imagery and asked how many competitors successfully identified it. The results were striking:

Histograms showing how many competitors identified each building in the dataset, stratified by look angle subset. The vast majority of buildings were identified by all or none of the top five algorithms — very few were identified by only some of the top five.

Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins! This means that the algorithms only differed in their ability to identify about 20% of the buildings. Given the substantial difference in neural network architecture (and computing time needed to train and generate predictions from the different algorithms), we found this notable.

Performance vs. building size

The size of building footprints in this dataset varied dramatically. We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range? The graph below answers that question.

Building recall (y axis) stratified by building footprint size of varying size (x axis). The blue, orange, and green lines represent the fraction of building footprints of a given size. The red line denotes the number of building footprints of that size in the dataset (right y axis).

Even the best algorithm performed relatively poorly on small buildings. cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir. This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset. It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.

Conclusion

The top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions. Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks — developing better loss functions, pre- and post-processing techniques, and optimizing solutions to specific challenges may provide more value. Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance. Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!

What’s next?

We hope you enjoyed learning about off-nadir building footprint extraction from this challenge, and we hope you will explore the dataset for yourselves! There will be more SpaceNet Challenges coming soon — follow us for updates, and thank you for reading.

Model references: