The world’s leading publication for data science, AI, and ML professionals.

Comparison of Geocoding Services Applied to Stroke-Care Facilities in Vietnam with Python

With a close look at match rate and spatial accuracy

Python Hands-on Tutorial

This work was co-authored with Kai Kaiser and Mahdi Fayazbakhsh. All errors and omissions are those of the author(s).

Photo by Tamas Tuzes-Katai on Unsplash
Photo by Tamas Tuzes-Katai on Unsplash

Geocoding is the process of converting addresses in text format – the information necessary to locate a building, a plot of land, or structure, generally used in a specific format and contains things like political boundaries, street names, building numbers, organization names, and postal codes – into geographic coordinates like latitude and longitude. Geocoding when only addresses are available is the first step to location validation (e.g., by satellite overlay), and analytics (e.g., access analysis, climate exposure, and any spatially based study require geocoding if the solution requires the knowledge of where points of interest (e.g., health facilities, population, schools, roads) are located.

In a health context, this task is often a fundamental first step performed prior to all operations that take place in a spatially-based health study. As such, the quality of the geocoding system used within these agencies is of paramount concern to the agency (the producer) and researchers or policy-makers who wish to use these data (consumers). However, geocoding systems are continually evolving with new products coming on the market continuously. Agencies must develop and use criteria across a number axes when faced with decisions about building, buying, or maintaining any particular geocoding systems.

Especially when geocoding with existing tools is applied in developing countries, analysts need to pay close attention to both the process and pitfalls of doing this conversion. Especially in settings with weak addressing, this may result in gaps in translation.

We demonstrate this by using a validated data set of addresses of stroke facilities in Vietnam manually geocoded and validated by the World Bank team in Vietnam.

In this blog, we will geocode the addresses of stroke care facilities in Vietnam with multiple services available through OpenStreetMap, Mapbox, and Google in Python and demonstrate a method to calculate the quality of geocoding. There are various metrics reported in the literature for measuring the quality of geocoding.

Geocoding system quality metrics (Source - An evaluation framework for comparing geocoding systems)
Geocoding system quality metrics (Source – An evaluation framework for comparing geocoding systems)

In this blog, we will report on the match rate (percentage of all records capable of being geocoded), and the spatial accuracy (frequency distribution of the distances between matchable geocodes and ground truth locations).

The goal is not to recommend the use of one over the other but to showcase how such an evaluation can be easily done in a Jupyter Notebook environment (JPNE) in Python for different country contexts.

The dataset that we are using for this exercise is the list of 106 stroke care facilities in Vietnam for which the complete address is known. Our team has also manually compiled the ground truth data of these locations and the latitude and longitude are verified.

Dataframe containing addresses and the ground truth geographic co-ordinates of 106 stroke care facilities in Vietnam (Image by Author)
Dataframe containing addresses and the ground truth geographic co-ordinates of 106 stroke care facilities in Vietnam (Image by Author)

1. Geocoding using OpenStreetMaps (OSM) Data through Geopy

Geopy is a Python Client that makes it possible for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources. Nominatim is a tool to search OpenStreetMap data by name and address (geocoding) and to generate synthetic addresses of OSM points (reverse geocoding). Through the geopy client in Python, it is possible to use Nominatim to query OSM data for geocoding addresses as demonstrated in the code below –

The code took 55.4 seconds to execute and had a match rate of only 16%, which means out of the 106 facilities, Nominatim was able to return the geocodes of only 17 addresses.

Wall time and Match Rate for Geocoding with Geopy (Image by Author)
Wall time and Match Rate for Geocoding with Geopy (Image by Author)

For calculating the spatial accuracy, the haversine distance (the great-circle distance between two points on the surface of a sphere) between the geocode returned by Geopy and the ground truth coordinates is calculated. We then plot the frequency distribution of the haversine distance to visualize the spatial accuracy.

Spatial Accuracy for Geocoding with Geopy (Image by Author)
Spatial Accuracy for Geocoding with Geopy (Image by Author)

"Geocoder" Library in Python

Many online geocoding service providers (especially proprietary platforms such as Google, HERE, and Mapbox) do not include a Python library. To fetch data from these platforms in Python, we need to use requests and have different JSON responses, without a standardised query and response schema. To help overcome this challenge, and to provide a single access library to these platforms, Geocoder library have been developed. For geocoding using Mapbox and Google, we will demonstrate how with a few lines of code, we can query these services using the Geocoder library.

2. Geocoding using Mapbox through Geocoder

The code took 23 seconds to execute and had a Match Rate of 73% (77 addresses geocoded out of the 106 stroke facilities).

Stats of the Haversine distance between geocoded locations with Mapbox and the actual location (Image by Author)
Stats of the Haversine distance between geocoded locations with Mapbox and the actual location (Image by Author)

As there are a few outliers with extreme values, we binned all the outliers (greater than 30 km) into one bin in the histogram.

Mapbox is giving results which are in different countries such as Taiwan and Libya. If we remove these outliers, we can see the haversine distance distribution within Vietnam clearly.

3. Geocoding using Google through Geocoder

The code took 1 minute and 12 seconds to execute and had a Match Rate of 100% (all the 106 stroke facilities geocoded).

The distribution of haversine distance between the geocoded locations with Mapbox and the actual location (Image by Author)
The distribution of haversine distance between the geocoded locations with Mapbox and the actual location (Image by Author)

Google geocodes all locations within Vietnam, with a match rate of 100%. There is only one location which is having an error difference from the actual location of greater than 4 km.

Conclusion

In this blog, we explored the use of Geopy, Google, and Mapbox geocoding services through Python and validated the results with a manually geocoded dataset of Stroke Care Facility Locations in Vietnam. For the three services, the match rate and spatial accuracy metrics are calculated and reported. As the costs associated with these services are also very relevant for applications in Low and Middle-Income country cotexts, we summarise the metrics, along with the costs associated with the platforms in the table below –

Summary of results comparing Geopy, Mapbox, and Google for Geocoding (Image by Author)
Summary of results comparing Geopy, Mapbox, and Google for Geocoding (Image by Author)

We would like to reiterate that the goal is not to recommend the use of one over the other but to showcase how such an evaluation can be easily done in a Jupyter Notebook environment (JPNE) in Python for different country contexts.


We are building up a growing track record of experiences with the deployment of cloud-based JPNE processes, including in the World Bank’s Big Data Observatory (BDO) for COVID-19 and Beyond and Vietnam Disruptive Technologies for Public Asset Governance (DT4PAG) programs. Through an ongoing series of Medium blog contributions (eg. Visualising Global Population Datasets with Python, and Local climate analytics: Health Facility Rain Exposure in Vietnam), we will continue to share our experiences, particularly in how they can help advance data-driven insights towards the achievement of SDGs and nudge progress towards applied technology learning by doing and skills building with a focus on the public sector.


The full code for this tutorial can be found in the GitHub repo. Even if you are not a Python programmer, we hope this contribution gives you an intuitive sense of the possibilities and processes for leveraging this type of data for a new generation of decision support.


Related Articles