The world’s leading publication for data science, AI, and ML professionals.

The Battle of Neighborhoods: Starting a Coffee Shop Business

Build models for segmenting the neighborhoods to find the most conducive locations for starting a Toronto City cafe business.

Create a clustering model for segmenting Toronto neighborhoods to find the best locations for starting a business

Photo by Rachael Annabelle on Unsplash
Photo by Rachael Annabelle on Unsplash

A. Introduction

A. 1. Background and Business Problem

Toronto is Canada’s largest city with a population of more than 2,7 million and a density of 4,334.4 people **** per square kilometer. The city is renowned as one of the most multicultural cities globally due to its large population of immigrants from all over the globe. This leads the city to become a world leader among other metropolitan and cosmopolitan cities from many sectors, including business.

Designed by the author using Tailor Brands
Designed by the author using Tailor Brands

Now, imagine that you own a Coffee Shop called Kopiasli (fictitious) that has been doing business successfully in New York. This year, your team plans to expand the business and decide to look for a city that shares the same trait as New York, and one of which is Toronto.

To ensure this project’s success, the team requires insights into ** the demographics, neighboring businesses, and crime rate**s. For each neighborhood, we can ask:

  • How many cafes exist?
  • What are the most popular venues?
  • Can we get information about the vehicle and foot traffic?
  • What is the neighborhoods’ crime rate? And so on.

Thus, the project goal is to figure out the best locations for opening up a new coffee shop in Toronto City.

Entrepreneurs who are passionate about opening a coffee shop in a metropolitan city would be very interested in this project. The project is also for business owners and stakeholders who want to expand their businesses and wonder how data science could be applied to the questions at hand.

A.2. Data Description

The followings are data sources that we can use for this project:

  • 1st Data: The most updated record of traffic signal vehicle and pedestrian volumes in Toronto City. The data is typically collected between 7:30 a.m. and 6:00 p.m at intersections where there are traffic signals.[1]

  • 2nd Data: The most updated record of crime incidents reported in Toronto City provided by Toronto Police Services. [2]

  • 3rd Data: The list of Toronto neighborhoods represented by postal codes and their boroughs. We will be using the Geocoder python package to retrieve the postal code’s coordinates. [3]

  • 4th Data:The popular or most common venues of a given neighborhood in Toronto. This information is stored inside Foursquare Location Data, and we will use Foursquare API to access it. [4]

To sum up, we will use the 1st and 2nd data to analyze the pedestrian/vehicle volume and crime rates. Then, we load the 3rd data to obtain the exact coordinates for each neighborhood based on the postal code, allowing us to explore and map the city. Finally, we will use the coordinates and Foursquare credentials to access the 4th data source through its API and retrieve the popular venues along with their details, especially for coffee shops. The venue frequency in each neighborhood will be the features of the clustering model.

B. Methodology

B.1. Analytic Approach

We approach the problem using the clustering technique, namely k-Means. This approach enables the audience to see how similar neighborhoods about their demographics. We can then examine each cluster and determine the discriminating venue categories that distinguish each cluster. We will also display any statistics needed to answer questions concerning crime incidents, and vehicle and foot traffic records.

B.2. Exploratory Data Analysis

B.2.1. Vehicle and Foot Traffic

We begin by analyzing the data about the pedestrian and vehicle volumes. The column Main contains the main street name that appears several times indicating it contains intersections. We can group by the street name and aggregate this either by summing those value up or averaging it. We will choose to average it for simplicity. This returns 248 main roads.

The statistics summary of pedestrian and vehicle volumes during peak hour.
The statistics summary of pedestrian and vehicle volumes during peak hour.

We want our candidate neighborhoods to be lively. Hence, we filter out the roads. In this example, we only show the roads with an average of pedestrian volume above 1200 or vehicle volume above 12000 during peak hour (above ~70%). This gives us 139 main roads.

Finally, we can visualize the roads using the Folium Python module from the given coordinates. The map shows a glimpse of the city’s busiest roads, where many are located around downtown, which is not surprising 🤣.

In the next section, this visualization helps us filter the candidate areas and neighborhoods we need to focus on.

B.2.2. Crime Statistics

Next, we analyze the crime statistics from ** 2014 to 2019. It gives us 206,435 crime incidents segmented by police divisional boundaries, neighborhoods, and Major Crime Indicators (MCI). Toronto Police Service divides the major crimes into 5 categories scattered to 17 division**s and 140 neighborhood IDs.

The first five rows of major crime data. The red box is the police service division code we are interested in.
The first five rows of major crime data. The red box is the police service division code we are interested in.
Toronto Police Service Divisional Boundaries
Toronto Police Service Divisional Boundaries

We will group the data based on division (Division), not neighborhood (Hood_ID). This will give us insight into the safest boroughs and their neighborhoods.

Left - MCI from 2014 to 2019. Right - MCI in 2019 only.
Left – MCI from 2014 to 2019. Right – MCI in 2019 only.

Among the 5 MCIs, Assault incidents have the most occurred for 6 consecutive years. In the same period, several divisions are consistent about their crime rates. We can segment them into three groups:

  1. High Crime Rates (D51, D43, D41, D32, D31, D14)
  2. Middle Crime Rates (D52, D42, D23, D22)
  3. Low Crime Rates (D55, D54, D53, D33, D13, D12, D11)

Since we expect our candidate neighborhoods to be:

  • safe – having low crime rates
  • lively – crowded by people, vehicles, and easy to access
  • close to downtown,

therefore, the divisions qualified are D55, D54, D53, and D13. Referring to Toronto Police Service Wikipedia [5], these divisions cover:

  • Central Toronto (D53)
  • East York (D53, D54, D 55)
  • York (D13)

In the next section, we will explore the neighborhoods inside Central Toronto, East York, and York as the selected boroughs.

B.2.2. Neighborhoods Analysis

Lastly, we have built a neighborhood data frame that contains 103 postal codes, 10 boroughs, neighborhood names inside each borough, and their coordinates. Since we are interested in neighborhoods inside Central Toronto, East York, and York only, we filter the data frame. This results in having 3 boroughs and 19 neighborhoods.

The first 5 neighborhoods of the selected boroughs
The first 5 neighborhoods of the selected boroughs
Left - the map of city neighborhood distribution. Right -The neighborhood distribution for Central Toronto, East York, and York.
Left – the map of city neighborhood distribution. Right -The neighborhood distribution for Central Toronto, East York, and York.

Given the coordinates information, we can use the Foursquare Api to access the 2nd data source, explore the neighborhoods, and get the top 100 venues within a radius of 1 km for each. As a result, it returns 905 venues with 172 unique venue categories.

The first 5 venues returned for Parkview Hill Neighborhood
The first 5 venues returned for Parkview Hill Neighborhood

Some neighborhoods return above 50 venues, such as Davisville and Davisville North (100 venues). However, many return below 50 venues, such as Thorncliffe Park (38 venues) and Parkview Hill (19 venues). For each neighborhood, we can create the top 10 venues based on occurrences as follows.

The first five rows of the neighborhood's top 10 venues.
The first five rows of the neighborhood’s top 10 venues.

The data frame above indicates that we have the same venue categories returned to different neighborhoods. We can use this idea to cluster the neighborhoods based on their venues representing services and amenities.

B.3. Clustering the Neighborhoods

We will run the k-Means algorithm to build a clustering model with a different number of clusters (k). The features will be the mean of the frequency of occurrence of each venue category. Using Silhouette Score Elbow, we can measure and plot the clustering performances.

We can inspect that the best k value for this task is 4. Hence, we will have 4 cluster neighborhoods at the end.

The result of KMeans with k = 4. Now, the table has a cluster label for each neighborhood.
The result of KMeans with k = 4. Now, the table has a cluster label for each neighborhood.

C. Results

Finally, Let’s visualize the resulting clusters!

As a result, we can examine venues listed inside each cluster and define the discriminating venue categories that distinguish them.

The list of the Top 5 Venues in Cluster 0, Cluster 2, and Cluster 3
The list of the Top 5 Venues in Cluster 0, Cluster 2, and Cluster 3
  1. Cluster 0: "Gas Station Venues" The first cluster contains 1 neighborhood only, with the Gas station as the first most common venue.

  2. Cluster 1: "Coffee Shop and Restaurant Venues"The second cluster holds 16 neighborhoods, with the coffee shop, restaurant, and Cafe venues appear to be the most common ones.
  3. Cluster 2: "Pharmacy Venues"The third cluster includes 1 neighborhood with pharmacy as the most occurrence venue category.
  4. Cluster 3: "Park and Store Venues"The fourth cluster has 1 neighborhood with the park, convenience store, and grocery as the majority venues.

D. Discussion

The project’s main goal is to determine the best location for opening a coffee shop business in Toronto. Discussing what locations can be considered "the best" may vary, but we can equate it as the most conducive ones by considering the following criteria:

1. Safety

  • The conducive locations are supposed to be safe; hence we analyze the crime statistics for all divisions of Toronto Police Service. We conclude that divisions D55, D54, D53, D33, D13, D12, D11 have the lowest crime rates. These cover Central Toronto, West Toronto, York, and East York.

2. Demographics and Accessibility

  • Vehicle and foot traffic are important when we choose a location for the new coffee shop. We have shown the busiest main roads in the city where many are located around downtown. Then, we consider focusing on Central Toronto, York, and East York at first. However, this would come to waste if those people are not our target demographic. Hence, we need to understand the target market and discuss it further with the team.
  • Accessibility is also another part to consider. Soon, if we have picked a few location candidates, knowing how and why your customers will get to your location are crucial, such as street visibility, parking slot, and location convenience. Thus, further discussion with the team is again needed.

3. Neighboring businesses

  • Neighboring businesses can affect the profitability both positively and negatively.
  • Cluster 1 has the most coffee shops and restaurants in their neighborhoods. Although these businesses can be in different categories, they could all contend with the products you serve. Therefore, cluster 1 is not recommended.
  • Cluster 0, 2, and 3 are recommended neighborhoods to inspect further. However, it is also wise to consider other businesses or amenities surrounding the area to complement your offerings. For example, if we target people who spend their morning or afternoon outside, cluster 3 might be a good choice since it has "park" as the most common venue.

E. Conclusion

Finding the best location to start a business can be challenging and quite frustrating due to many uncertainties. However, we can quickly gain meaningful insights into the city and its neighborhoods with data available today. This helps everyone, including entrepreneurs, business owners, and stakeholders, to make solid decisions based on facts.

Using the coffee shop and Toronto as an example, I hope this project gives you a basic idea of how to deal with a similar case in the future. What other things need to be considered? Let’s discuss below!

Thank you,

Diardano Raihan LinkedIn Profile

Note: _Everything you have seen is documented in my GitHub repository. For those who are curious about the full code, please do have a visit 👍 ._


F. References


Related Articles