The world’s leading publication for data science, AI, and ML professionals.

What Dublin Bikes data can tell us about the city and its people

Dublin cyclists are predictable and (perhaps thankfully) aren't phased by cold or rain, but they avoid cycling home after a night out!

Photo by Picography on Pixabay
Photo by Picography on Pixabay

My favorite way to practice and build my data science skills is to hunt for publicly available datasets online and analyze them to uncover interesting and human behavior insights from what can initially seem like administrative, even boring, records. Public authorities are increasingly sharing more and more data online with easy-to-use APIs or CSV repositories. My general impression is that state authorities have more data than data scientists and many databases lack an accompanying analyses, not to mention aggregating with other data to discover deeper insights and trends. This offers some exciting whitespace for the curious.

I came across the Dublin Bikes station occupancy data on the Smart Dublin open data site and felt an immediate interest. Everyone who has lived in Dublin sees these bikes criss-crossing the city tens of times per day, and stations seem ubiquitous every few blocks. I wondered whether the data could uncover new information about how Dubliners interact with each other and the city they live in, proving or challenging the assumptions I had about people’s behavior.

This article describes the surprisingly deep and vivid insights this simple bike station occupancy dataset reveals about Dublin and Dubliners. From clear different zones of activity and when each area is at its busiest, the change in people’s cycle movements (or lack thereof!) on cold and rainy days and even shows very predictable station occupancy rates, allowing for improved bike redistribution work or the forecasted station occupancy information for users.

The Data

The dataset is straightforward: station occupancy data for all of Dublin Bikes’ 115 stations collected every 10 minutes and downloadable in quarterly CSV files. I used data from July-19 to March-20 to look at pre-COVID19 lockdown behavior patterns. I then merged in hourly weather data downloaded from the Irish Meteorological society website to enhance the data.

Sample of the raw data showing structure and features
Sample of the raw data showing structure and features

Bike journeys segment Dublin into clear functional zones

I started by thinking about how to identify users movement patterns i.e. what journeys were being taken at different times and days of the week. Having bad memories of walking in the Dublin weather to a station only to find no bikes available, I was particularly interested in understanding blockages that arise in the system due to either empty or full stations.

The dataset is structured around station information so I made the assumption that stations with high occupancy % (i.e. number of bike slots occupied by a bike) were areas that people were travelling to i.e. busy areas at that time, while stations with low occupancy were areas where people were travelling from.

To find areas with similar movement behavior, I implemented K-Means Clustering. K-Means is the easiest choice, allowing control of the number of clusters, and fast linear time complexity. I ran the ‘Elbow Method’ test and judged the optimum number of clusters (K) to be 5. While arguably 3 would be sufficient, I chose 5 because there was still meaningful information gain, and the purpose of the analysis is to understand as much about user behaviors from the data as possible, so more clusters were desirable.

The dataset has entries every 10 minutes so it was necessary to group the time aspect of the data into broader groups to reduce the numbers of features and make the output easier to understand. I grouped the station occupancy % for each station based on, ‘day type’ i.e. Weekday, Saturday, or Sunday, and ‘time type’ which had five values – ‘6am-10am’, the to-work commuting hours, ’11am – 3pm’, the working hours and lunch period, ‘4pm – 7pm’, the from-work commuting hours, ‘8pm-11pm’, capturing evening activities, and ‘overnight’. Distinguishing on both ‘data type’ and ‘time type’ this resulted in 15 features. I used the scikit-learn K-Means library in Python for the clustering and plotted the output on a map using the Folium library.

Dublin Bikes' Stations colored by clusters of similar occupancy trends
Dublin Bikes’ Stations colored by clusters of similar occupancy trends

The clustering algorithm does not consider the distance between location as an input, but the output showed clear geographical patterns in usage behavior. Stations belonging to one cluster generally tended to be together on the map. Prior to examining the descriptive statistics of the clusters, any Dublin local will see some logical patterns emerging. The "Blue Cluster" capturing the very center of Dublin City, the commercial and recreational hub. The "Purple Cluster" more geographically dispersed across the city, but also with a clear geographical pattern, formed around the transport hubs – Heuston Train Station (just south of the Liffey river, on the west of the city), Connolly Station and Busaras (located north of the Liffey on the edge of the City Centre).

All the "Orange Cluster" dots clustered together in the south west of the City, along and below the River Liffey, the office center of the city, known locally as "Silicon Docks" due to the majority of Technology, Banking, Audit and Legal Offices being located in this area – including Google, Facebook, Amazon, Goldman Sachs, etc. The "Red Cluster" and "Green Cluster" groups both formed at the very outer part of the city, to the North, South and West. Particularly the "Red Cluster" tended to be the stations at the very outer part of the Dublin Bike Network, the most residential areas with Bicycle Stations present. The "Green Cluster" was mostly focused around the Grangegorman (new University Campus) and Mater Hospital area in the north of the city and a small number of scattered stations without any obvious common geographic attribute.

Average Station Occupancy per cluster per hour, showing different weekday vs weekend trends
Average Station Occupancy per cluster per hour, showing different weekday vs weekend trends

When the data behind each cluster is analyzed, the user behavior and occupancy trend matched the geographic inspection analysis. There were significant differences in the hourly and weekday/weekend occupancy trends by cluster. The "Orange Cluster (Office Zone)" stands out for its high Weekday morning occupancy rate, bicycle scheme users come from other zones (Overnight occupancy is the lowest in this area) in the morning to work and leave the bicycles here. This area has very low weekend occupancy rates, evidence of the common and undesirable urban planning problem of deserted Central Business Districts at weekends. The "Blue Cluster (City Centre)" had the highest Saturday & Sunday occupation rate, and a consistent mid-range occupation rate during the week, highest just after work when perhaps people go into the city center for shopping, socializing etc.

Both the "Purple Cluster (Transport Hubs)" and "Red Cluster (Suburbs)" have low occupancy during the to-work commuting hours and office time indicating the lack of 9–5 employment in these areas, at the weekend the occupancy of blue areas near transport hubs like Heuston Station is much higher while green bicycles occupancy falls over the course of the weekend. The "Red Cluster (Suburbs)" has the highest overnight occupancy and the lowest occupancy between 6am and 10am on Weekday mornings. This confirms the visual inspection of the "Red Cluster" as residential areas. Contrastingly, the "Purple Cluster" has the most consistent usage over the course of a week, not surprising given the stations proximity to transport hubs.

The "Green Cluster" (primarily around the north east and Grangegorman area of the city) has the lowest overall occupancy on average and occupancy tends to be quite flat throughout the weekdays and weekends in this cluster, indicating low usage of the scheme here overall. This "Green Cluster" had by far the lowest average number of both arrivals and departures per day, at 22 and 25 respectively compared to an average of 60 for the remaining four zones. This indicates these stations are those most likely to be placed in suboptimal locations. These underused "Green Cluster" stations at the south east of the city were close to Heuston Station, one of the busiest Dublin Bike zones, but appear to be "hidden" at the back of the station or just one or two streets away, perhaps an indication that better awareness of these stations would increase their traffic and reduce pressure on the other Heuston sites.

One amusing but perhaps unsurprising point noted was the behavior of users on a Saturday evening. Contrary to other days of the week, users headed in towards the centre of the city on Saturday evenings, but then left the majority of bikes in the city centre, presumably taking another method of transport home. This phenomenon has a knock-on impact on Sunday with city centre stations becoming too full in the early afternoon from an already high occupancy starting point after the night before.

Average Station Occupancy for City Centre Stations showing unusual trend of cycling into City Centre & not cycling outwards on Saturday nights
Average Station Occupancy for City Centre Stations showing unusual trend of cycling into City Centre & not cycling outwards on Saturday nights

A bit of cold or rain doesn’t phase Dublin’s cyclists

I decided to dive deeper into the data to understand the impact of cold or wet weather conditions on how people used Dublin Bikes. To do this, I merged hourly temperature and rainfall data from Met Eireann, Ireland’s meteorological service onto the existing bike station data. I grouped average station occupancy % of each cluster and time period by weather type

Scatterplot showing that per station there is very little difference in the average number of bike rentals/returns on wet versus dry days
Scatterplot showing that per station there is very little difference in the average number of bike rentals/returns on wet versus dry days

The station and times with the biggest differences between wet and dry tended to be the stations along the southside canal – increasing on dry Saturday and Sunday afternoon and evenings. With many popular bars and restaurants in this area, it is an insight that won’t come as a surprise to most locals!

Average number of bike rentals/returns showing the top 5 stations/periods with the biggest increase in activity between Wet and Dry weather
Average number of bike rentals/returns showing the top 5 stations/periods with the biggest increase in activity between Wet and Dry weather

How these insights can improve Bike Scheme design (Dublin and beyond!)

Finally, given that the clustering produced such a clear output, I wondered whether Dublin Bikes users, or administrators of the system, could anticipate when a station is likely to be either too full (no room to leave bike a bike) or too empty (no bikes available to rent) based on the time of day. This could help users avoid a problematic station during relevant times and help Dublin Bikes in planning the rebalancing schedule for the trucks that move bikes between stations to increase availability.

To do this, I created three output classes in the data – 0, low occupancy i.e. 10% occupancy or less, 1, normal occupancy, i.e. between 10% and 90% occupancy and 2, high occupancy i.e. 90% occupancy or more. To predict these, I used station, hour of the day, day of the week, month and whether it was dry (binary yes/no) or wet (binary yes/no) as features. I split the data into a random training and test set and implemented a Random Forest classifier again using the Scikit-Learn Python library.

The model predicted whether a station would be too full/empty/ok with an accuracy score of 79%. The feature importances scores showed that Station ID was by far the most important predictor, followed by hour of day and then both month and day of week. Weather attributes were almost inconsequential in predicting the occupancy. 79% is a high level of predictability; if a user relied on this method to predict whether a bike or bike stand would be available then on average 4 out of 5 times they would be correct. This offers potential for using a simple forecasting method in planning the bike redistribution done by maintenance trucks and/or an online application that not only shows current availability but also forecasted availability for the time in future when a user wants to rent or deposit a bike.

Conclusion

A beginner-level python data science analysis on what appeared to be a simple dataset unveiled fascinating behaviors of Dubliners, brought to life with map visualization. It confirms what most Irish already know, that cold or rain won’t interfere with our plans, but also brought new insights about cycling to but not from nights out, that areas have clear activity purposes and that problematic full or empty stations can be forecasted in advance, offering possible straightforward improvements to the system operation and user planning guidance.

If you want to explore the analysis methodology, check out the git repository here. If you want to examine Bike Scheme data in other cities there are plenty of similar datasets available online, such as Paris, New York, and London. For me, my next analysis will look at COVID-19 lockdown-driven changes in Dubliners’ behaviors and what they might tell us about how Dublin’s city streets will ebb and flow in the next months and years.


Related Articles