The Battle of the Neighborhoods — Open a Movie Theater in Montreal

The capstone project of utilizing Folium and Foursquare APIs for IBM Data Science Professional Certificate

Tony Xu
Towards Data Science

--

I followed IBM Data Science Professional Certificate in Coursera, it’s composed of 9 courses in this professional certificate. There are pretty good examples of the courses.

The final assignment is to finish a project called the Capstone project which requests you to leverage Foursquare APIs to fetch the data from API calls and utilize the folium map library to visualize data analysis. It’s quite a good opportunity to practice data science methodology and toolset in this project.

In this project, we will cover all phases in the data science life cycle to resolve a problem. And we will dive into the following tools/library in data science:

  • Folium library, including choropleth map, heatmap in map view
  • Honeycomb Grid for Folium map
  • Google Geocoding APIs
  • Foursquare APIs
  • K-Means Clustering Algorithm
  • Horizontal Bar Chart
  • Pandas, Numpy, Shapely

Ok, let’s get started.

Introduction: Business Problem

In this project, we are going to look for an optimal location to open a movie theater. Specifically, this report can provide a reference for stakeholders who are interested in opening a movie theater in Montreal, Quebec, Canada.

Montreal is the second-largest city in Canada and the largest city in the province of Quebec, located along the Saint Lawrence River at its junction with the Ottawa River. It sits on an island. In this report, we will focus on all areas on the Montreal island. There are many movie theaters on Montreal island, we will conclude where are the existing movie theaters. Then we will use a clustering model to find similar areas on the island considering demographic data of each borough and region. The preferred area shall be distant from existing movie theaters.

We will use data science tools to fetch the raw data, visualize it then generate a few most promising areas based on the above criteria. In the meanwhile, we will also explain the advantage and traits for the candidates, so that stakeholders can make the final decision base on the analysis.

Data

Based on the definition of our problem, factors that may impact our decision are:

  • Demographic information, e.g. population, density, education, age, income
  • Number of existing shopping malls in the neighborhood and nearby
  • Number of existing movie theaters in the neighborhood and nearby

We decided to use a regularly spaced grid of locations all around the whole Montreal island, to define our neighborhoods. Concretely, we will use popular hexagon honeycomb to define our neighborhoods.

In this project, we will fetch or extract data from the following data sources:

  • Montreal census information of the 2016 year
  • Centers of hexagon neighborhoods will be generated algorithmically and approximate addresses of centers of those areas will be obtained using Google Geocoding API
  • Shopping malls and movie theaters data in every neighborhood will be obtained using Foursquare API
  • Coordinate of Montreal center will be obtained using Google Geocoding API of well known Montreal location
  • Montreal borough shapefile is obtained from Carto

Montreal Island Shape File

To show the Montreal island boundary in the folium map, we need a geojson definition file for Montreal island. We downloaded this shapefile from the Carto website.

The file is in JSON format, containing boundary definition for every borough or municipality in Montreal island. We will visualize this geojson definition file with a folium map in the next step.

Folium

folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.¹

It’s not difficult to use folium, just required a few lines of code to show Montreal island with boundary data.

Show Montreal boundary in the Folium map with geojson definition.
Montreal island in Folium map.

Next step, we want to generate candidate cells in the map, more specifically, only within Montreal island. It’s popular to use the honeycomb hexagon grid when dealing with problems related to the map. Unlike circle, there is no spacing among hexagons which make sure no missing area. Furthermore, the distance between any two adjacent hexagons is the same.

Unfortunately, Folium doesn’t provide native support to draw hexagon in the map view, we have to write some code to support this feature.

We write a method to calculate the hexagon vertices’ coordinates by giving centroids coordinates and length of the side.

Get hexagon’s vertices.

After that, we generate a honeycomb hexagon grid throughout the island.

Generate honeycomb hexagon grid in Montreal.
Show honeycomb hexagon grid in the Montreal map.

Looks great! 😄

So far we created a honeycomb grid on the island and we generated the center coordinates for each hexagon. We will use Google Geocoding API to reversely lookup the address accordingly.

Google Geocoding API

The Google Geocoding API is a service that provides geocoding and reverse geocoding of addresses.²

It requires a Google API key to use this set of APIs. It can be applied from Google Developer Console.

Google Geocoding APIs.

Let’s put all the data in a Pandas Dataframe, and show the first 10 items.

Each row contains the center address of a hexagon and corresponding latitude and longitude degrees which are in WGS84 spherical coordinate system, X/Y columns are in UTM Cartesian coordinate system which uses the common metric unit — meter or kilometer.

Dataframe of candidate hexagons.

Foursquare API

The Foursquare Places API offers real-time access to Foursquare’s global database of rich venue data and user content to power your location-based experiences in your app or website.³

Now we generated all the candidate neighborhoods on Montreal island, we will get all movie theaters information using Foursquare API.

From Foursquare API documentation, we can find the corresponding movie theater category in Venue Categories. The corresponding ID of Movie Theater in Foursquare API is 4bf58dd8d48988d17f941735 which is under Arts & Entertainment main category. It contains several sub-categories:

  • Drive-in Theater, id: 56aa371be4b08b9a8d5734de
  • Indie Movie Theater, id: 4bf58dd8d48988d17e941735
  • Multiplex, id: 4bf58dd8d48988d180941735

Unlike coffee shops, restaurants everywhere, there aren’t lots of movie theaters in the region, it also makes sense since we don’t expect movie theater in every neighborhood.

Let’s fetch all the movie theaters on Montreal island first. To do so, we will fetch movie theaters data in each borough and municipality.

Fetch nearby movie theaters by Foursquare APIs.

From the response of Foursquare APIs, there are a total of 44 movie theaters on Montreal island. Let’s plot it in a map view.

Movie theaters on Montreal island.

Let’s show it in heatmap using the positron style.

Heatmap of movie theater distribution on the island.

From heatmap, we can see the movie theaters are mainly concentrated in downtown areas and the center of the island. Usually, there are also a lot of shopping malls nearby, let’s pull out the shopping centers data on Montreal island using Foursquare APIs.

From Foursquare API documentation, there are several categories related to shopping malls or shopping centers.

We will fetch all shopping malls data in the above categories and show them on the map with movie theaters data.

The shopping mall in the blue, movie theater in the red.

From the map view, we can see movie theater is located near shopping malls in most scenarios.

Our target area shall have more shopping malls and fewer movie theaters nearby.

Before that, we need to cluster all the candidate hexagons based on certain information, in this project, we pull out census data as major features for clustering.

Montreal Census information

Now we will fetch census information of each borough or municipalities on Montreal island. The latest data was collected in 2016. We can get it from the Montreal city official website.

It’s a pretty big excel file containing a lot of data, I modified some sheets a bit to extract data easier into Pandas Dataframe.

We only focus on several basic census information: Population, Density, Age, Education and Income.

Load excel into Pandas Dataframe.
Census Dataframe after pre-processing

Next, we will show census data distribution on a choropleth map.

A Choropleth Map is a map composed of colored polygons. It is used to represent spatial variations of a quantity.⁴

We also show shopping centers and movie theaters’ locations on the same map.

Show Census info on a choropleth map.
Population distribution by boroughs.
Density distribution by boroughs.
Education distribution by boroughs.
Age distribution by boroughs.
Income distribution by boroughs.

From the above choropleth maps, we can see movie theaters are mostly located in areas with a higher population. Same for shopping centers’ locations. Moreover, most movie theaters locate in the area with lower revenue. Regions with higher revenue have fewer shopping centers and movie theaters.

So far, we retrieved all the necessary raw data we needed and visualized them. In the following steps, we will manipulate these datasets, extract data, and generate new features for the machine learning algorithm. Finally, we will find out the best suitable place to open a movie theater on Montreal island.

Methodology

The business purpose of this project is to find a suitable place on Montreal island to open a movie theater.

Now we retrieved the following data:

  1. All movie theaters data on Montreal island
  2. All shopping centers data on Montreal island
  3. 2016 Montreal census data for each borough, concretely, Population, Density, Age, Education and Income data for each borough or municipality within Montreal island
  4. Boundary data of each borough and municipality on Montreal island

We also generated a honeycomb hexagons grid throughout the whole Montreal Island.

Based on the above raw data, we will try to generate new features accordingly, e.g. census information for each candidate cell, and the number of movie theaters and shopping malls in local and nearby.

In the final step, we will focus on the most promising areas with more shopping malls and fewer movie theaters. And we will also present the candidate hexagon cells in the map view for stakeholders to make the final decision.

Analysis

We got the basis census information of each borough and municipality. We want to get the census information for each candidate hexagon cell accordingly, we calculate those census information based on borough and municipality which intersects with the cell.

If a hexagon is in one borough completely, we will use the borough’s census info as hexagon’s one. So it means for all hexagons inside one borough, we will treat them the same for census feature.

Accordingly, if a hexagon has a 50% intersection with two boroughs respectively, we will generate the census data of this hexagon, 50% ratio from these two boroughs respectively.

Based on this rule, we can calculate the census for all hexagons.

Generate census Info for each hexagon.

Let’s merge this data frame with the previous location data frame and generate a new one: candidates_df which contains basic information on each hexagon. We print several rows of this data frame.

candidates_df.iloc[200:206]
Census info of hexagons

Looking good. Now we have census information in each hexagon area.

Then we will calculate the shopping center and movie theaters related information for each hexagon area.

We will calculate the following features for shopping malls and movie theaters:

  1. The number of shopping malls and movie theaters within the current hexagon cell.
  2. The number of shopping malls and movie theaters within 1 km away from the center of the hexagon cell.
  3. A number of shopping malls and movie theaters within 3 km away from the center of the hexagon cell.
Generate the number of movie theaters for each hexagon.

Now we prepared all the data we need, we can use the K-Means clustering algorithm to group the similar candidate hexagon areas into clusters.

K-Means Clustering

We pick up census features and the number of shopping malls and the number of movie theaters as input features.

Selected features as input parameters for the K-Means Clustering Algorithm

We will run an evaluation step first to select the best K which is the number of categories in the algorithm.

We use the Sum of Squared Distance and Silhouette Score two methods to evaluate the K-Means algorithm for different K.

Sum of Squared Distance measures error between data points and their assigned clusters’ centroids. Smaller means better.

Silhouette Score focuses on minimizing the sum of squared distance inside the cluster as well, meanwhile, it also tries to maximize the distance between its neighborhoods. From its definition, the bigger the value is, the better K is.

K Selection for K-Means Clustering

From the figure, we can see Sum of Squared Distance going down when K becomes bigger. When K=2,3, Silhouette Score is higher, but SSE is still high at that time, we choose K=10 for this project, it's a balanced number for both Sum of Squared Distance and Silhouette Score. Let’s run the K-Means algorithm again with k=10.

K-Means clustering algorithm with k=10.

Let’s visualize clustering results with a different color in the map view.

10 Clusters of candidates hexagons

Let’s put everything together on one map view:

  1. Clusters in colors for hexagons
  2. Shopping malls in blue point
  3. Movie theaters in redpoint with yellow ring
Clusters with Shopping malls and Movie theaters

From the cluster plot in the above map view, we can see there is one cluster in light blue composed of 4 hexagons in downtown, there are full of movie theaters and shopping malls in this cluster.

The purple cluster contains the area with a lot of shopping malls. The light green cluster contains more shopping malls and movie theaters except for the downtown cluster.

Let’s assign weights to all three movie theaters related features and combine them into one feature. Same for shopping malls. It’s easier for sorting.

We will calculate weighted Mall Score and weighted Cinema Score, then generate a new Score feature for sorting.

The higher final score is, it means there are more shopping malls and fewer movie theaters.

Generate a score for the cluster.
Sort clusters by the score.

Cluster 7 have the highest score, it has more shopping malls and fewer movie theaters. Let’s explore more characteristics of cluster 7.

Statistics for cluster 7

There are 40 hexagons in Cluster 7 with an average of 0.77 Malls in local and 0.0 Cinemas in local. Let’s plot all clusters for comparison of each feature in a bar chart using matplotlib.pyplot library. We highlight Cluster 7 which is our target cluster.

Draw horizontal bar charts for clusters.
Feature comparison of clusters

From the bar chart, we can see that Cluster 7 has the most population and density among all the clusters. Furthermore, it has fairly more shopping centers in the hexagon area or nearby and relatively fewer movie theaters.

Next, we sort all hexagons in Cluster 7 by Score in descending order and pick the first 5 hexagons. They will be our first choice position to open a movie theater.

As the above statistics information, there are 1~3 shopping malls in local and more shopping malls nearby, but without any movie theater within 1 km. Looks quite good selections.

Let’s plot Cluster 7 hexagons in the map view, gray out the other clusters and highlight our 5 choices as well.

5 most promising candidate areas

This concludes our analysis. We have found out 5 most promising zones with more shopping malls nearby and fewer movie theaters around the area. Each zone is in regular hexagon shape which is popular in map view. The zones in the cluster have the most population and density comparing with other clusters.

Results and Discussion

We generated hexagon areas all over Montreal island. And we group them into 10 clusters according to census data information including population, density, age, education, and income. Shopping center information and existing movie theaters information are also considered when running the clustering algorithm.

From data analysis and visualization, we can see movie theaters are always located near shopping malls usually, which inspired us to find out the area with more shopping malls and fewer movie theaters.

After the K-Means Clustering machine learning algorithm, we got the cluster with most shopping malls nearby and fewer movie theaters on average. We also discovered the other characteristics of the cluster. It shows the cluster has the most population and density which implies the highest traffic among all the clusters.

There are 40 hexagon areas in this cluster, we sort all these hexagon areas by shopping malls and movie theaters info in descending order which targets to cover more shopping malls and fewer movie theaters in the local cell or nearby.

We draw our conclusion with the 5 most promising hexagon areas satisfying all our conditions. These recommended zones shall be a good starting point for further analysis. There are also other factors which could be taken into account, e.g. real traffic data and the revenue of every movie theater, parking lots nearby. They will be helpful to find more accurate results.

Conclusion

The purpose of this project is to find an area on Montreal island to open a movie theater.

After fetching data from several data sources and process them into a clean data frame, applying the K-Means clustering algorithm, we picked the cluster with more shopping malls and fewer movie theaters on average. By sorting all candidate areas in the cluster, we get the most 5 promising zones which are used as starting points for final exploration by stakeholders.

The final decision on optimal movie theater’s location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like the parking lot of each location, traffic of existing movie theaters in the cluster, and current revenue of them, etc.

Code

https://github.com/kyokin78/Coursera_Capstone/blob/project/CapstoneProject_OpenCinemaInMontreal.ipynb

References

Capstone Project — Open an Italian Restaurant in Berlin, Germany

  1. https://python-visualization.github.io/folium/
  2. https://developers.google.com/maps/documentation/geocoding/start
  3. https://developer.foursquare.com/docs/places-api/
  4. https://plotly.com/python/choropleth-maps/

--

--

Certified IBM Data Scientist, Senior Android Developer, Mobile Designer, Embracing AI, Machine Learning…