The world’s leading publication for data science, AI, and ML professionals.

Modeling Bike-Share Services in Tirana, Albania

Using unsupervised learning to model the optimal urban places for bike-share services

Photo by Edin Murati on Unsplash
Photo by Edin Murati on Unsplash

In this story, I wanted to take a look at bike-sharing services from a Data Science viewpoint. Specifically, this article is going to be about using various modeling techniques to figure out where the best places to put bike stations in Tirana, Albania are. Tirana is my hometown, and recently I have been inspired by its new initiatives in promoting biking as a commuting and health activity, especially through the addition of protected bike lanes, special "no car" days and general road safety measures. However, the city does not have a major bike share platform, so I wanted to address this issue using machine learning.

Firstly, some setup. When thinking about optimal bike station locations, it makes sense to consider factors such as population clusters, proximity to other public transport or clusters of business areas that people usually commute. That is because setting up a successful bike share service would involve making sure as many people as possible have access to it, and they can feasibly and reliably use it daily to get to work or other places. The mayor’s office provides some information on each administrative area’s population as well as the routes for public transport stations. We are going to build off of these two at first and use K-Means clustering and Folium to visualize our predictions.

The Data

Here is the dataset of the population counts for all of urban Tirana’s 11 administrative areas (taken from this website:https://opendata.tirana.al/?q=popullsia-e-tiran%C3%ABs-2020). To make it clearer, I have also added a map using GeoPandas built-in plot function:

DataFrame including population counts and polygon geometries (Image by Author)
DataFrame including population counts and polygon geometries (Image by Author)
Density of Tirana (Image by Author)
Density of Tirana (Image by Author)

The map above is actually plotting densities of areas, to account for various area sizes, and I chose a red sequential palette to make the distinctions between different values clear.

In addition, we also have access to bus station data, which represents the major bus lines and stopping points (taken from this website: https://opendata.tirana.al/sites/default/files/linjat_e_autobus_ve_publik__dhe_stacionet.geojson). Tirana does not have a subway system, so most of public transport is actually carried out by buses. Below, you can find the data frame and a plot of the bus stations map:

Bus Stations in Tirana (Image by Author)
Bus Stations in Tirana (Image by Author)

Cool! Now let’s get to modeling. First, let’s fit a K-Means cluster model to our data. The features that the model will take in as input will be the x and y coordinates of the bus station points, and it will output a number of clusters. Here is the fitting and the predictions:

Map of bus stations with the clusters overlaid (Image by Author)
Map of bus stations with the clusters overlaid (Image by Author)

Now that we have some predictions, let’s visualize them in context, or in the geographical space of Tirana. To do so, we can use Folium, a Python package for map visualizations that allows us to add pop-ups like so:

Folium Map of the Clusters (Image by Author)
Folium Map of the Clusters (Image by Author)

However, is this model good enough? One thing to keep in mind is that our current model does not account for the population size being served by these stations, so it might not be representative of the places with the most demand for bike share services. Luckily, we can pass in a "weight" argument to K-Means Clustering that will consider the population size when clustering data as well:

Let’s now map the two clustering results together to see how the predictions changed:

Folium Map of the Clusters (Image by Author)
Folium Map of the Clusters (Image by Author)

Refining the Model

From this point, we are going to be continuing with the weighted K-means model, and trying to improve upon it. Before we start doing that, let’s look into how to score the models we are producing. We are going to be using the Silhouette Score and the Calinski Harabasz score. The former takes values from -1 (worst) to 1(best), and it is calculated as the mean intra-cluster distance and the mean nearest-cluster distance for each sample. The latter is not bounded above by a value, rather it is defined as the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters.

So we can see that the Silhouette score actually goes down in the second, weighted model , as does the Calinski Harabasz Score.

How many clusters to use?

Another aspect of building the model is considering how many clusters to separate the points into as it has many implications about the overall structure of the clusters and how many points go into them. Here is how to do that in code, and visualizing it in a plot:

The Elbow Method (Image by Author)
The Elbow Method (Image by Author)

Let’s apply this optimal number of clusters to our model, and update our map to show them, added in green:

Folium Map of the Clusters (Image by Author)
Folium Map of the Clusters (Image by Author)

In this short story we looked at how to build a model that would predict the best places to put bike stations in a metropolis, in this case Tirana, and we explored and refined the parameters of our model. Some next steps after laying the groundwork would be to fit other unsupervised models to the data, such as Self-Organising Maps, and to supplement the dataset with more information about details such as traffic conditions, number of people that work in a location, and perhaps also pollution data. These could inform a more sophisticated model, which could eventually be implemented in the real world.

Thank you for reading!


Related Articles