Identifying gaps in OpenStreetMap coverage through machine learning

Published in

Towards Data Science

5 min readNov 6, 2019

This post is by Ran Goldblatt, New Light Technologies, and Nicholas Jones, GFDRR Labs/World Bank.

OpenStreetMap (OSM) is one of the wonders of the digital age: founded in 2004, it has allowed volunteers to map some 42 million buildings and 1 million km of road — creating a valuable resource that often surpasses official maps for completeness and ease-of-use.

And in few circumstances are complete maps more crucial than disaster preparedness and response. Hurricane Dorian in the Bahamas is just the latest example, with relief agencies relying on OSM data to understand the location of affected houses, schools and clinics.

But how can we evaluate the completeness of OSM coverage in a given area? Read on for the key steps we used — in the case of Haiti — to distinguish fully mapped areas from those where additional mapping (for example through crowd-sourcing campaigns) would pay greatest dividends.

Assessing OSM completeness

A recent study from McGill University finds that OSM captures about 83% of the world’s streets and roads, but with greater coverage gaps in countries with weak governance and less internet access. In this post, we will focus specifically on building footprints.

Already, you can compare OSM building footprints with the Global Human Settlement Layer using Humanitarian OpenStreetMap Team’s Gap Detection map. With a machine learning workflow, we can go a little further. We will model gaps in OSM coverage at higher resolution, utilizing free and open-source satellite data, evaluate model accuracy, and predict under-mapped areas.

1. Acquire and inspect OSM data

For exploratory visualization of OSM data in a region, the Python library OSMNx is an invaluable resource — it facilitates precise calls to the Overpass API, pulling the segments of data that you need into a notebook environment.

In the plot below, we show building footprints for three densely populated sections of the city. The coordinates for these square-mile plots were just chosen by dropping a pin into Google Maps where satellite imagery shows dense building coverage.

Clearly, Port-au-Prince has seen significant mapping efforts for some districts but not yet others. For a full analysis, we acquire the building footprint data for Haiti as a whole (use a server such as GeoFabrik).

Python code: Download OSM building footprints for selected locations in Haiti

Building footprints: Three square mile plots for dense areas of Port-au-Prince (note differential coverage)

2. Build a set of predictive features

A number of geospatial layers may prove predictive of building density, particularly those derived from free, openly accessible and up-to-date satellite imagery. Existing products like the Global Human Settlement Layer (GHSL) do a great job at delineating urban areas, but they don’t capture any change since their release date — bringing live satellite data into our workflow remedies this.

We assessed several remote sensing measures as potential predictors of building footprint coverage, including:

• Intensity of light emitted at night (VIIRS);

• Vegetation and built-up area spectral indices derived from Sentinel-2 imagery (e.g. NDVI, NDBI, SAVI);

• Surface texture (based on Synthetic Aperture Radar data from Sentinel-1);

• Elevation and slope;

• Other OSM-derived layers, including density of road junctions.

As in general with machine learning, better-quality features would mean better prediction accuracy and less noise. We evaluated sixteen predictor features, making use of Google Earth Engine (GEE) to efficiently create and aggregate the remote sensing derived layers. The snippet below highlights the ease and power afforded by GEE; here we pull in the last three years of VIIRS imagery — a high-resolution dataset that offers great benefits as a proxy for localized economic growth, available since 2012 — and map the median night-light intensity across Haiti.

Google Earth Engine code (Javascript): Median night-light intensity over Haiti, 2015–2018

3. Create training and test data

We divide the territory into a fishnet of cells (you can use QGIS for that). We set the size of a cell to 500m*500m, but you can use any other size if you wish. Our goal is to predict the coverage of OSM building footprints (total area of footprints) in a cell based on the predictors.

To create training data, we manually tag grid cells where we evaluate at least 75% of the buildings to be fully mapped (we rely on a high-resolution satellite image as a base layer for this estimation). About 1,600 cells were tagged as fully mapped.

We then take 70% of these cells as training data, and keep the remainder as test data — it remains unseen during model training.

Left: Complete cells are tagged to create training and test data; Right: Remote sensing layers such as night-lights intensity (pictured) serve as predictive features.

4. Build and evaluate model

Our training data comprises approximately 1,100 cells over the territory that we judged to be close to fully OSM-mapped (i.e. at least 75% of buildings figure on OSM according to visual inspection).

A Multiple Linear Regression Analysis showed that the combined effect of nine of the variables, taken together, explains up to 82% of the variation of OSM building footprint area in a cell. The prediction accuracy of these variables together is much higher than each variable independently: for example, just using World Settlement Footprint explains only 62% of variation.

Model evaluation: actual vs. predicted building floorspace

Using a Random Forest algorithm, we see higher prediction accuracy. The model predicts 89% of the variation in OSM building footprint area per cell. The predictors with highest importance in the model are Global Urban Footprint and World Settlement Footprint, followed by NDBI, number of road junctions, and VIIRS night-time lights.

Output: identified OSM gaps

Having found our model to work sufficiently well, we apply it to predict OSM building footprint for all of Haiti, and flag cells that are predicted to be fully mapped but are actually not covered by OSM .

Many areas of Haiti lack a full mapping of their buildings. Inspecting the predictions for Port-au-Prince, we see these mapping black-spots coexist alongside other cells where OSM coverage is dense, detailed and relatively complete — perhaps unsurprising given the episodic nature of community mapping in many developing countries, where emergency situations such as the 2010 earthquake have spurred bursts of effort.

When areas populated with homes, schools, health clinics and other critical infrastructure are not mapped — planning for and responding to extreme events is tough. But a simple machine-learning workflow like the one above can help inform crowd-sourcing efforts by pointing out those regions where an additional mapping campaign offers greatest benefits.

Acknowledgments: Thanks to Jenny Mannix and Brad Bottoms at New Light Technologies who contributed to this project.