Python Hands-on Tutorial
This work was co-authored with Kai Kaiser and Mahdi Fayazbakhsh. All errors and omissions are those of the author(s).

Geocoding is the process of converting addresses in text format – the information necessary to locate a building, a plot of land, or structure, generally used in a specific format and contains things like political boundaries, street names, building numbers, organization names, and postal codes – into geographic coordinates like latitude and longitude. Geocoding when only addresses are available is the first step to location validation (e.g., by satellite overlay), and analytics (e.g., access analysis, climate exposure, and any spatially based study require geocoding if the solution requires the knowledge of where points of interest (e.g., health facilities, population, schools, roads) are located.
Especially when geocoding with existing tools is applied in developing countries, analysts need to pay close attention to both the process and pitfalls of doing this conversion. Especially in settings with weak addressing, this may result in gaps in translation.
We demonstrate this by using a validated data set of addresses of stroke facilities in Vietnam manually geocoded and validated by the World Bank team in Vietnam.
In this blog, we will geocode the addresses of stroke care facilities in Vietnam with multiple services available through OpenStreetMap, Mapbox, and Google in Python and demonstrate a method to calculate the quality of geocoding. There are various metrics reported in the literature for measuring the quality of geocoding.

In this blog, we will report on the match rate (percentage of all records capable of being geocoded), and the spatial accuracy (frequency distribution of the distances between matchable geocodes and ground truth locations).
The goal is not to recommend the use of one over the other but to showcase how such an evaluation can be easily done in a Jupyter Notebook environment (JPNE) in Python for different country contexts.
The dataset that we are using for this exercise is the list of 106 stroke care facilities in Vietnam for which the complete address is known. Our team has also manually compiled the ground truth data of these locations and the latitude and longitude are verified.

1. Geocoding using OpenStreetMaps (OSM) Data through Geopy
Geopy is a Python Client that makes it possible for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources. Nominatim is a tool to search OpenStreetMap data by name and address (geocoding) and to generate synthetic addresses of OSM points (reverse geocoding). Through the geopy client in Python, it is possible to use Nominatim to query OSM data for geocoding addresses as demonstrated in the code below –
The code took 55.4 seconds to execute and had a match rate of only 16%, which means out of the 106 facilities, Nominatim was able to return the geocodes of only 17 addresses.

For calculating the spatial accuracy, the haversine distance (the great-circle distance between two points on the surface of a sphere) between the geocode returned by Geopy and the ground truth coordinates is calculated. We then plot the frequency distribution of the haversine distance to visualize the spatial accuracy.

"Geocoder" Library in Python
Many online geocoding service providers (especially proprietary platforms such as Google, HERE, and Mapbox) do not include a Python library. To fetch data from these platforms in Python, we need to use requests and have different JSON responses, without a standardised query and response schema. To help overcome this challenge, and to provide a single access library to these platforms, Geocoder library have been developed. For geocoding using Mapbox and Google, we will demonstrate how with a few lines of code, we can query these services using the Geocoder library.
2. Geocoding using Mapbox through Geocoder
The code took 23 seconds to execute and had a Match Rate of 73% (77 addresses geocoded out of the 106 stroke facilities).

As there are a few outliers with extreme values, we binned all the outliers (greater than 30 km) into one bin in the histogram.
Mapbox is giving results which are in different countries such as Taiwan and Libya. If we remove these outliers, we can see the haversine distance distribution within Vietnam clearly.
3. Geocoding using Google through Geocoder
The code took 1 minute and 12 seconds to execute and had a Match Rate of 100% (all the 106 stroke facilities geocoded).

Google geocodes all locations within Vietnam, with a match rate of 100%. There is only one location which is having an error difference from the actual location of greater than 4 km.
Conclusion
In this blog, we explored the use of Geopy, Google, and Mapbox geocoding services through Python and validated the results with a manually geocoded dataset of Stroke Care Facility Locations in Vietnam. For the three services, the match rate and spatial accuracy metrics are calculated and reported. As the costs associated with these services are also very relevant for applications in Low and Middle-Income country cotexts, we summarise the metrics, along with the costs associated with the platforms in the table below –

We would like to reiterate that the goal is not to recommend the use of one over the other but to showcase how such an evaluation can be easily done in a Jupyter Notebook environment (JPNE) in Python for different country contexts.
We are building up a growing track record of experiences with the deployment of cloud-based JPNE processes, including in the World Bank’s Big Data Observatory (BDO) for COVID-19 and Beyond and Vietnam Disruptive Technologies for Public Asset Governance (DT4PAG) programs. Through an ongoing series of Medium blog contributions (eg. Visualising Global Population Datasets with Python, and Local climate analytics: Health Facility Rain Exposure in Vietnam), we will continue to share our experiences, particularly in how they can help advance data-driven insights towards the achievement of SDGs and nudge progress towards applied technology learning by doing and skills building with a focus on the public sector.

The full code for this tutorial can be found in the GitHub repo. Even if you are not a Python programmer, we hope this contribution gives you an intuitive sense of the possibilities and processes for leveraging this type of data for a new generation of decision support.