The world’s leading publication for data science, AI, and ML professionals.

Nearest Neighbor Analysis for Large Datasets in Helsinki Region

BallTree function for efficient distance analysis: Bus stops and restaurants

Image by the Author. Closest Bus stop (pink) to each fetched restaurant (yellow) from OSM. Demo
Image by the Author. Closest Bus stop (pink) to each fetched restaurant (yellow) from OSM. Demo

In the last few months, I have been part of the well-known _Automating GIS_ course at the University of Helsinki as a Research Assistant. My experience has been remarkable while giving my tips for automating GIS processes to students during their tasks. I am glad to see how geographers are taking over the GIS automation with python language programming and I am still amazed about the good quality of the material. At the beginning of the course, the lessons start giving advice for the usage of python and basic concepts of common objects used in GIS. Then the classes turn out into deep and advanced analysis that is really useful in a vast range of uses.

One of the lessons that will remain with me is the nearest neighbor analysis. It is an algorithm coming from the shapely library that can retrieve the closest point from a selected point. During the process, it uses the Unary Union function and this consumes memory and slows down the operation. So, instead of Unary Union, we are going to use in this tutorial the BallTree function that makes the process faster and consumes less memory.

Final web map: Demo Repository with code: Repo

For this example, I will use Helsinki as a study case and find all the closest bus stops to the restaurants in the city. The restaurants are fetched during the process from the Python library OSMnx developed by Geoff Boeing and the bus stops from the GTFS dataset for Helsinki Region (latest dataset) that contains vast information about public transportation. Let’s start

Datasets

A brief description of the license of the datasets used in this article and the material used.

  • GTFS dataset for Helsinki Region. Licensed under Creative Commons CC0. This dataset is collected through OpenMobilitydata, a free service for transit data, app developers, web developers, and transit agencies.
  • Open Street Map data. Licensed under Open Data Commons Open Database License (ODbl) or attribution license. Users are free to copy, distribute, transmit and adapt the data as long as it is attributed to the author like © OpenStreetMap contributors.
  • Automating GIS instructional material. Licensed under Creative Commons Attribution-ShareAlike 4.0 International license. Users are free to share and adapt under attribution. The course is shared openly and after using the material it is recommended to share the material openly as well.

Coding practice

Starting from the most important. Import the needed libraries.

import geopandas as gpd
import pandas as pd
import osmnx as ox
import matplotlib.pyplot as plt
from shapely.geometry import Point, LineString

Let’s fetch the area of interest with OSMnx. Helsinki

# Specify the name that is used to seach for the data
place_name = "Helsinki, Finland"
# Fetch OSM street network from the location
graph = ox.graph_from_place(place_name)
# Plot the streets
fig, ax = ox.plot_graph(graph)
# Get place boundary related to the place name as a geodataframe
area = ox.geocode_to_gdf(place_name)
Image by the author. A plot of the Graph object
Image by the author. A plot of the Graph object

Now, let’s fetch and arrange the restaurants from OSMnx

# List key-value pairs for tags such as restaurants
tags = {'amenity': 'restaurant'}
# Retrieve restaurants
restaurants = ox.geometries_from_place(place_name, tags)
# getting centroids from restaurants to avoid polygon geometric objects
restaurants['geometry'] = [geom.centroid for geom in restaurants['geometry']]
# reset index
restaurants = restaurants.reset_index(drop=True)

Now that the restaurants are in a formated GeoDataFrame, we are going to read the stops file from the GTFS dataset for Helsinki Region and create another GeoDataFrame

# read the stops
stops = pd.read_csv(r'gtfs/stops.txt', sep=',')
# creating geodataframe of stops
stops['geometry'] = None
stops_gdf = gpd.GeoDataFrame(stops, geometry = 'geometry', crs=4326)
stops_gdf['geometry'] = [Point(lon, lat) for lon, lat in zip(stops_gdf['stop_lon'], stops_gdf['stop_lat'])]
# Clip stops from the area of interest
stops_mask = gpd.clip(stops_gdf, area)

Let’s make a plot to see how it looks so far

Image by the Author. Bus stops (blue) and restaurants (red)
Image by the Author. Bus stops (blue) and restaurants (red)

Now that we have what we need we are going to define a couple of functions that will help to find the nearest neighbor

from sklearn.neighbors import BallTree
import numpy as np

First function.

def get_nearest(src_points, candidates, k_neighbors=1):
 '''
Find nearest neighbors for all source points from a set of candidate points
'''
# Create tree from the candidate points
 tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
 distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
 distances = distances.transpose()
 indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
 # note: for the second closest points, you would take index 1, etc.
 closest = indices[0]
 closest_dist = distances[0]
# Return indices and distances
 return (closest, closest_dist)

Second function.

def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
'''
 For each point in left_gdf, find closest point in right GeoDataFrame and return them.

 NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
'''

 left_geom_col = left_gdf.geometry.name
 right_geom_col = right_gdf.geometry.name

 # Ensure that index in right gdf is formed of sequential numbers
 right = right_gdf.copy().reset_index(drop=True)

 # Parse coordinates from points and insert them into a numpy array as RADIANS
 # Notice: should be in Lat/Lon format 
 left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
 right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())

 # Find the nearest points
 # - - - - - - - - - - - -
 # closest ==> index in right_gdf that corresponds to the closest point
 # dist ==> distance between the nearest neighbors (in meters)

 closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
 closest_points = right.loc[closest]

 # Ensure that the index corresponds the one in left_gdf
 closest_points = closest_points.reset_index(drop=True)

 # Add distance if requested 
 if return_dist:
 # Convert to meters from radians
 earth_radius = 6371000 # meters
 closest_points['distance'] = dist * earth_radius

 return closest_points

Let’s apply the functions

# Find closest public transport stop for each building and get also the distance based on haversine distance 
# Note: haversine distance which is implemented here is a bit slower than using e.g. 'euclidean' metric 
# but useful as we get the distance between points in meters
closest_stops = nearest_neighbor(restaurants, stops_mask, return_dist=True)

At this moment, the same amount of restaurants is going to be the same amount of closest_stops. Here we have in total 1.111 linked.

Now, we are going to merge the restaurants and bus stops. With this, we will have in each row the restaurants with each closest bus stop and we can create a LineString as the link.

# Rename the geometry of closest stops gdf so that we can easily identify it
closest_stops_geom = closest_stops.rename(columns={'geometry': 'closest_stop_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
restaurants = restaurants.join(closest_stops_geom)

We can apply descriptive statistics in this GeoDataFrame with distance attribute and we can see that the average distance from a restaurant to a bus stop is 114m, the max distance registered is 4813m. You will notice in the final map that some restaurants are located in the Islands 🙂

Let’s continue creating the links.

# Create a link (LineString) between building and stop points
restaurants['link'] = restaurants.apply(lambda row: LineString([row['geometry'], row['closest_stop_geom']]), axis=1)
# Set link as the active geometry
restaurants_links = restaurants.copy()
restaurants_links = restaurants_links.set_geometry('link')

Finally, if we make a plot of our elements. It will look like this:

# Plot the connecting links between buildings and stops and color them based on distance
ax = restaurants_links.plot(column='distance', cmap='Greens', scheme='quantiles', k=4, alpha=0.8, lw=0.7, figsize=(13, 10))
ax = restaurants.plot(ax=ax, color='yellow', markersize=4, alpha=0.7)
ax = closest_stops.plot(ax=ax, markersize=8, marker='o', color='red', alpha=0.7, zorder=3)
# Zoom closer
ax.set_xlim([24.924, 24.95])
ax.set_ylim([60.16, 60.1740])
# Set map background color to black, which helps with contrast
ax.set_facecolor('black')
Image by the Author. Links (green) between the Restaurants (yellow) and the closest Bus stop (red)
Image by the Author. Links (green) between the Restaurants (yellow) and the closest Bus stop (red)

Now, it is done. You can download the data if you want and formated it with your own taste as I did.

Image by the Author. Final map.
Image by the Author. Final map.

Conclusion

BallTree is an efficient function that helps to accelerate the process of distance analysis between geometric elements at scale. Test it by yourself and you will notice how it works fast avoiding extra charge over the computer memory. Also, by studying distances with Nearest Neighbor it is possible to create reports of Mobility. In this example, the average distance between a bus stop and a restaurant is 114m, and it is less than 5 minutes walking. In the case, you are analyzing other elements than restaurants, and the average is too high and gives a higher reaching time it will be evident that exist an accessibility issue.

That is all. Good coding practice and good analysis for the future. You can support my work in my web maps or catch me on my LinkedIn.


Related Articles