Smart City Sensors to Predict Footfall

Utilizing Spatial Data Analysis to Understand and Forecast Pedestrian Traffic in Melbourne, Australia.

Published in

Towards Data Science

9 min readJul 21, 2020

Coming from a background in architectural design and city planning, I have always been intrigued by the application of data to improve our cities and built environments. Melbourne, like many cities around the globe, aims to become a smart city in the next few decades and data is driving this goal. Data analysis claims to be the future for smart cities, so there are many substantial data sets available to the public for free. For my first deep dive into city data, I worked with several geospatial datasets provided by the City of Melbourne to understand pedestrian footfall patterns.

Project Goals

For businesses deciding where they would like to set up shop, foot traffic and pedestrian activity is a vital feature to consider. Shops and restaurants in busy areas will likely garner more customers and attention. With this clientele in mind, the main data set I worked with was an hourly count of how many people pass over different sensors under the sidewalks across the city.

Through looking at the sensor counts, it was possible to observe broad trends in foot traffic. However, I also wanted to understand what was bringing more or less people to specific areas. Could pedestrian foot traffic be predicted based on nearby buildings or city features? From a city planning perspective, information on what elements of the city increase or decrease pedestrian traffic would be helpful for predicting the impact a new development would have on the nearby area.

In order to determine if certain elements impact the volume of pedestrians, I incorporated several other geospatial datasets in my analysis. This article will walk through the main steps in this project, but the full code can be found on my Github Repository.

Data Collection & Cleaning

As mentioned, many valuable and clean data sets are provided by the City of Melbourne. It was fortunate that there was minimal data cleaning to do, but below I outline the data sets I used and how I organized the data before modeling.

Hourly Sidewalk Sensor Counts

My main dataset was the sidewalk sensor counts, which had the following features: location coordinates for each sensor, a sensor ID number, date time of hourly records, and hourly count values. Below is a Tableau map visualization of all the sensor’s locations across the city.

All sidewalk sensors in Melbourne, mapped by location coordinates on Tableau.

Data was given from 2009-present and is updated monthly. When the data collection began in 2009 there was only 18 sensors, and now there is 65, so many sensors have been added over the past ten years. During data cleaning, I found that many sensors had faulty records as the counts would suddenly drop to zero for weeks or months at a time. New sensors were also added halfway through the year. For consistency, I created a function to produce a yearly list of sensors that had full records for at least 12 months at a time. Since the records were not very stable for the first two years (2009 & 2010) and the records for 2020 are not yet a full 12 months, the scope of the sensors was altered to be 2011–2019.

Nearby City Features

In order to gain a better understanding of the city features near different sensors, I included several geospatial datasets from the city. These were included to test if there was a correlation between features near the sensors and how much footfall the area received. These data sets included Yearly Building Data, Bike Dock Locations & Capacities, Landmarks & Points of Interest, and City Street Lighting. Each of these individual data sets included location coordinates and a bit of information about the feature.

One of the city features data sets included in this analysis — Street Furniture and Infrastructure in Melbourne.

Sensor Trends

To gain an overview of the trends of the sensors, I created a heat map visual to understand the days and months that were busy for each sensor. Here are the heat maps for two sensors which had consistent data from 2011–2018.

Heat map for Sensor 2, which looks to be busiest on Fridays in December. There is not as much foot traffic during the week, so this sensor is most likely in an entertainment or leisure district.

Heat map for sensor 9, which is almost exclusively busy on weekdays, indicating it may be in a business district. There is also a noticeable increase in foot traffic from 2016 to 2017. Also notice the difference in scale between the two sensors, even at its busiest sensor 9 does not have as much pedestrian traffic as sensor 2.

Mapping Location Features

Creating Sensor Bubbles

To combine all of the location based datasets with the sensor they were closest to, I utilized the GeoPandas Python library to create a 100 meter radius around each sensor. The radius was kept small so it only included elements that directly impacted footfall on the sensor. Below is the map I created in Folium that has the sensors plotted with each of their surrounding radii. For more information on developing maps in Folium check out the source code here.

Folium Map of each sensor with a 100 meter radius around it.

Mapping Nearby Features

Then, using GeoPandas again, I developed a function to check each item in the location datasets (buildings, bikes, lighting, etc.) to see if the coordinate was inside any of the sensors’ radii. If the point was inside a sensor’s radius, the coordinate and information on the feature was added to the sensor’s list. Once all of the features inside the 100 meter radius were found across all of the location datasets, I created a function to plot this information in Folium.

Below are the maps for sensor 2 and 9 with all of their nearby features. With this visual, we can see all of the city elements within a 100 meter radius from the sensor. Street lighting is coral, landmarks are purple, buildings are light blue, and street infrastructure is gray. The sensor can be seen in the center of the circle in dark blue.

Sensor 2 with all location features within a 100 meter radius of the sensor.

Sensor 9 with all location features within a 100 meter radius of the sensor.

Regression to Predict Daily Footfall Based on Location Features

Now that we could see which features are within 100 meters from each sensor, modeling could begin. For the first stage of modeling, I wanted to do a relatively simple Linear Regression to see it was possible to predict daily footfall for each sensor based only on the location features nearby. The features for the model were the number of each type of feature within the radius and the sensor ID, and was predicting daily average footfall.

A snippet of what the data frame looked like. In total there was over 60 columns detailing the number of each nearby feature in the sensor’s radius.

Since the number of location features did not vary within the year, I did not expect the model to be accurate in predicting the daily variation and trends. However, I thought it would be interesting to see which features increased or decreased footfall. The training and test data was taken as a random 80/20 split, with all years mixed into each group.

Model Results: Linear Regression

-Training Score: 0.7751

-Test Score: 0.7749

-CV Score: 0.7747

-Root MSE: 224.592

With all scores above .77, the model appeared to be performing relatively well for a first attempt. It is interesting to analyze the features importance to understand what features increased and decreased footfall. However this model does not perform well when it is used to predict data on an unseen data. For instance, if the training data only included information from 2011–2017 and the test data was exclusively 2018, the model’s scores dramatically decrease.

Feature Importance

Rather surprisingly, a greater number of retail locations in a sensor’s radius actually negatively impacted footfall. Other interesting features that decreased footfall where a greater number of places of worship, greater average building floors, and more parking locations.

A greater number of seats and hospitals had a positive impact on the amount of footfall a sensor received, as well as several pieces of city infrastructure such as floral boxes and tree guards. These were perhaps most surprising since these are the features that we would not necessarily pay attention to when we walk around the city.

Actual vs. Predicted

The main drawback of this model is that it predicts the same value everyday for each sensor. Since the features don’t change within the year, the model has no way of taking daily trends into account. This prevents the model from performing well when forecasting future footfall.

This graph shows the difference between what the Linear Regression model predicted and the actual daily footfall. The vertical lines indicate the model is predicting the same counts for each sensor everyday, and is not adjusting for daily variation and trends.

After tuning this model with the same features, the CV score would improve to .828 by using a Decision Tree Regressor with GridSearchCV. However, the same problem illustrated with the Actual vs. Predicted graph remains. A model using only location features can only become so accurate since it does not account for daily variation and is unable to predict future traffic.

Forecasting Future Footfall

In order to develop a model that could predict future footfall, I combined the location features used in the previous regression model with more date based information. Along with the number of different location features in the radius, information on the month, day, year, and day of week were included as features. Additionally, I included the number of counts from that day one year prior, or two years prior if there was not valid data from the previous year. With these features, the train and test data was split based on year, so the model was trained on data from 2011–2017 and the test data was from 2018.

After tuning the parameters and testing several different model types, this model produced the highest scores so far, and was able to predict 2018 daily footfall for each sensor with 90% accuracy.

Model Results: Ada Boost Regressor

-Training Score: 0.9994

-Test Score: 0.9032

-CV Score: 0.8744

-Root MSE: 158.7632

Feature Importance

The features that were most influential for this model’s predictions were the date based information and past year data. This shows that there is a yearly trend in each sensor that is useful for forecasting. There were also several location based factors that were relatively important for predicting, such as the number of basketball hoops, bollards, average number of floors, and number of bicycle rails.

Features that influence predictions the most

Actual vs. Predicted

This model was more nuanced in its predictions, and was able to predict different daily counts for each sensor. This model is therefore much more accurate, and there is a much stronger linear trend between predicted and actual. There are a couple of strong outliers where the model predicted much higher or lower than the actual, and these points could be investigated further to better understand what influenced these variations.

Conclusion

So what impacts where people walk around Melbourne? From these models we have learned which features of the city will draw more pedestrians, which will cause less visitors, and which will help predict future trends.

The first model showed that an area with more seats, flower beds, hospitals, and community use buildings will have more foot traffic. Areas with more retail stores, places of worship, and barbeques will decrease the number of people that walk there on an average day. If the goal of a business or city planner is to predict future trends they will want to use the second model. This model indicated that past year data and date specific information will be the strongest predictors for estimating future footfall. However, several city infrastructure elements such as the number of basketball hoops, bollards, and average number of floors of surrounding buildings can also provide more accurate predictions.

This project was a great learning experience, and allowed me to better understand geospatial data science — a topic I will continue exploring. Overall I am happy with the results that I found so far, and am eager to continue working on this project in the future. Some ideas I want to explore further to improve my model are:

return to analyzing hourly pedestrian count data rather than daily averages to get more precise trends
include more location features of the city
include seasonal information such as city events, public holidays, sporting events, etc.
cluster sensors based on location features so it is possible to predict footfall for new sensors that don’t have previous year data

Thank you for reading, and please feel free to contact me with any suggestions/ideas via LinkedIn :)