OpenStreetMap: From Browser Querying to Python+R Manipulation

A succinct guide for Data Scientists

Michelle Audirac
Towards Data Science

--

There are multiple users of OpenStreetMap (OSM), for example website app developers can integrate OSM maps to their online tools and services, Data Scientists for the most part use OSM data to create spatial analytics and build models. This guide is for Data Scientists who wish to quickly start working with OSM data; to this end, it provides a a jumpstart to three basic skills:

  1. Querying OSM data directly in a web browser,
  2. Making requests of OSM data through an API with python, and
  3. Exploring the request results with R.

OSM data basics

OpenStreetMap is a set of open-source tools that store and allow access to geo-spatial data that is tagged via crowd-sourcing. The tool set includes:

database and API There are several options to access the OpenStreetMap database. This walk-through focuses in Overpass API which provides read-only API access.

query language To query data in Overpass API, one must become familiar with Overpass QL. This is a query language specially created for the OpenStreetMap data model.

interactive access Overpass turbo is a web-based interface that allows exploring the OpenStreetMap database with Overpass QL. The query results are shown on an interactive map.

Querying OSM data with Overpass turbo

OpensStreetMap data model is comprised of nodes and ways.

  • An OSM node represents a single point in space and requires two tags, its latitude and longitude.
  • An OSM way represents lines (roads, rivers) which are ordered lists of nodes. An area is a “closed” way and should be tagged with isarea=true.

Another OSM’s data model element is a relation.

  • OSM relations are used to assign boundaries of countries, counties, named multi-polygons (landmarks), and routes.

Let’s start getting familiar with the OSM elements by opening https://overpass-turbo.eu/?lat=30.30&lon=-97.75&zoom=11, this Overpass view is centered in Austin, TX.

On the left, the main view contains a text box where Overpass QL queries are fed. On the right, the results are shown on the map.

query 1

The following code returns restaurants in the Austin area. This area is a box defined by two points: the south-west corner (30.20, -97.85) and the north-east corner (30.40, -97.65).

node[amenity=restaurant]
(30.20, -97.85, 30.40, -97.65);
out;

These are the results of the query on a map.

The output is in xml format, with tags for each node representing a restaurant.

<node id="358532482" lat="30.3221769" lon="-97.6932173">
<tag k="amenity" v="restaurant"/>
<tag k="created_by" v="Potlatch 0.10f"/>
<tag k="name" v="La Palapa"/>
</node>
<node id="358669465" lat="30.3231224" lon="-97.7037688">
<tag k="amenity" v="restaurant"/>
<tag k="created_by" v="Potlatch 0.10f"/>
<tag k="name" v="China Star Buffet"/>
</node>
<node id="358693667" lat="30.3253747" lon="-97.7048159">
<tag k="amenity" v="restaurant"/>
<tag k="created_by" v="Potlatch 0.10f"/>
<tag k="name" v="Pappadeaux"/>
</node>
<node id="359344991" lat="30.3269188" lon="-97.7048810">
<tag k="amenity" v="restaurant"/>
<tag k="created_by" v="Potlatch 0.10f"/>
<tag k="name" v="Pappasitos"/>

Overpass QL, statements write their results into sets, which can then be the input for a subsequent statement.

query 2

[out:json];
(
node[amenity=restaurant]({{bbox}});
node[amenity=school]({{bbox}});
);
out;

This query returns the location of two types of amenities, restaurants and schools. The statement contains:

  • Semicolons ; that separate and end each part of the statement.
  • A union block, that consists of statements inside a pair of parenthesis ().
  • [out:<format>] a pair of square brackets [] and an out statement that determine the format of the output (in this example, a json format).
  • The bounding box, where the search is done. ({{bbox}}) means filtering results to the current map view.

So far the queries in the examples filter the results to bounded boxes that are defined either by coordinates or the current map view. Requesting results within a city or a county is a common use-case. To do this the id for an administrative bounded area is required. An example, the id of the OSM relation for the city of Austin, TX is 113314.

Nominatim is the tool that allows searching for OpenStreetMap data by name, it can be accessed at https://nominatim.openstreetmap.org/ui/search.html. The next image is a screenshot of Nominatim after searching for `Austin`. The search results are elements in the OpenStreetMap database whose tags relate to the search key words.

query 3

The next query performs a search within the city of Austin, Texas. Note that the area id is determined by the sum of the number 3600000000 and the OSM relation id (in this case 3600113314 = 3600000000 + 113314). If the area of interest is an OSM way, add the number 2400000000 to its OSM id [1].

area(3600113314);
node[amenity=restaurant](area);
out;

Making requests directly with Python

Making requests to Overpass API can be done by adding an Overpass QL query after http://overpass-api.de/api/interpreter?data=. The following Python code will return the same results as the query 3 example from the previous section.

# %pythonimport requests
import json
overpass_url = "http://overpass-api.de/api/interpreter"
overpass_query = "[out:json];area(3600113314);node[amenity=restaurant](area);out;"
response = requests.get(overpass_url, params={'data': overpass_query})
response = response.json()

The API request returns a json, that is a combination of lists and dictionaries. The elements item contains the list of all extracted nodes,. These are the first five entries:

>>> response[‘elements’][1:5][{'type': 'node', 'id': 358669465, 'lat': 30.3231224, 'lon': -97.7037688, 'tags': {'amenity': 'restaurant', 'created_by': 'Potlatch 0.10f', 'name': 'China Star Buffet'}}, {'type': 'node', 'id': 358693667, 'lat': 30.3253747, 'lon': -97.7048159, 'tags': {'amenity': 'restaurant', 'created_by': 'Potlatch 0.10f', 'name': 'Pappadeaux'}}, {'type': 'node', 'id': 359344991, 'lat': 30.3269188, 'lon': -97.704881, 'tags': {'amenity': 'restaurant', 'created_by': 'Potlatch 0.10f', 'name': 'Pappasitos'}}, {'type': 'node', 'id': 359354860, 'lat': 30.3288301, 'lon': -97.7049325, 'tags': {'amenity': 'restaurant', 'created_by': 'Potlatch 0.10f', 'name': 'Japon Sushi'}}]

The (meta) values that define a node are type, node, id, lat and lon. The tags element is a nested dictionary. An example of how to flatten a nested dictionaries using pd.json_normalize is shown below [2]. In this case the amenity, name, and cuisine tags are selected.

# %pythonimport pandas as pd
import numpy as np
restaurants = (
pd.json_normalize(
response,
record_path = 'elements'
)
)[['type', 'id', 'lat', 'lon', 'tags.amenity', 'tags.name', 'tags.cuisine']]
# since NaNs are not defined for characters in R (only for numeric)
# we need to replace NaNs for temporary 'NA' strings in python and
# then convert them to actual NAs in R
restaurants = restaurants.replace(np.nan, "NA")

The resulting restaurants dataframe can be easily exported to R.

Exploring OSM data with R

The `leaflet` library in R can be used to explore the dataframe we obtained previously with Python. In addition to the spatial visualization of the OSM elements with `leaflet`, the `sf` library provides access to functions that allow processing GIS data.

Before moving to R, the census block groups of Travis County (City of Austin) will be extracted with Python from the Tiger/Line ftp. In this case, the downloaded shapefile is stored in the /tmp local folder.

# %pythonimport wget
from zipfile import ZipFile
import os
cwd = os.getcwd()
# census block groups
url = 'https://www2.census.gov/geo/tiger/TIGER2020PL/STATE/48_TEXAS/48453/tl_2020_48453_bg20.zip'
wget.download(url, '/tmp')
file_name = '/tmp/tl_2020_48453_bg20.zip'
ZipFile(file_name, 'r').extractall('./tmp')

With the restaurants dataframe loaded in R and the census block groups shapefile at hand it is possible to obtain a map with two layers:

  • A layer that contains restaurants represented as blue points.
  • A layer with polygons representing the census block groups (cbg).
# %rlibrary(leaflet)
library(sf)
cbg_sf = st_read("/tmp/tl_2020_48453_bg20.shp")
cbg_sp <- as(cbg_sf, "Spatial")
leaflet() %>%
addProviderTiles(providers$OpenStreetMap) %>%
addPolygons(data = cbg_sp, color = "black", weight = 0.5,
) %>%
addCircles(data = restaurants,
~lon, ~lat, popup = ~tags.name)

To finish this guide, the next code chunk contains commands that count the number of restaurants in each cbg [3].

# %rlibrary(tidyverse)restaurants_sf = st_as_sf(restaurants, coords = c("lon","lat"), crs = "NAD83")st_join(restaurants_sf, cbg_sf, join = st_within) %>% 
as.data.frame() %>%
select(GEOID20) %>%
group_by(GEOID20) %>%
summarise(n = n()) %>%
arrange(desc(n))

The cbg restaurant counts can then be used as a covariate for a supervised model or to estimate a spatial distribution.

Additional learning material on how to work with geospatial data in R is found elsewhere:

While there exist packages in Python and R with built-in functions for OSM data extraction (like OSMnx or osmdata), this guide provides the underlying concepts behind those functions. Understanding the fundamentals of OSM data, python API requests and R spatial manipulation ultimately allows Data Scientists to easily adapt their work-stream according to the needs of the applied problem at hand (and applied problems where OSM data can be used are limitless!).

[1] Nikolai Janakiev, Loading Data from OpenStreetMap with Python and the Overpass API (2018), Personal Blog

[2] Ankit Goel, How to parse JSON data with Python Pandas? (2020), Towards Data Science

[3] Matt Herman, Point-in-polygon with sf (2018), Personal Blog

--

--

Data Scientist / Research Assistant / working for the Department of Statistics and Data Science at The University of Texas at Austin