Be deliberate in the problem-solving process
The cumulation of Coursera’s IBM Data Science Professional Course is a Capstone Project that requires course participants to identify a business problem that requires the use of location data and neighborhood clustering. The ability to analyze business problems, cut through the noise, and identify the actual issue to be addressed is an important skill to have and constantly hone. If the right questions are not identified, the effectiveness of the model results would be greatly diminished or rendered meaningless. Therefore, I had also applied several of Boston Consulting Group’s problem-solving approaches that I was recently exposed to while working on this project to enhance the framing of the problem statement.
This article provides an overview of the capstone, from the initially proposed problem statement to recommendations.
CONTENTS
- Introduction/ Business Problem
- Data Collection & Preparation
- Data Visualization & Troubleshooting
- Exploratory Data Analysis
- Silhouette Method
- Evaluation & Recommendations
- Other Tools
- Afterword
Introduction/ Business Problem
Context: A client has approached the Consultation firm to advise on the business strategies and execution roadmap on setting up restaurants in Kyoto. The initial business problem question is "Should the Client set up a restaurant chain in Kyoto, and where?"

Rather than diving into the problem and apply machine learning, I deliberated on the business problem and outlined the following considerations which may not be apparent:
a) The Client is targeting to set up restaurant presence in Kyoto
b) They are not certain of the market saturation nor potential locations in Kyoto to act on.
Reframing the problem statement as "What can be done to determine the Client’s business positioning and potential restaurant locations in Kyoto?" With the reframed problem statement, several top-level business drivers viz. Business Strategies, Operations, and Profitability are brought into focus as illustrated below.

Regarding Business Strategies, K-Means Clustering will be applied to the relevant restaurants’ geospatial data to cluster restaurants and uncover insights such as restaurant themes and suitable locations.
Data Collection & Preparation
Two data sources were identified for use. These are:
1) List of Kyoto wards and their respective geo coordinates. The wards list can be retrieved from the following webpage, whereas the corresponding coordinates can be retrieved using the geopy library.
2) Restaurants in each neighborhood of Kyoto. The data can be retrieved using the Foursquare API and specifying the particular category of interest.
The list of Kyoto wards was scraped using the pandas read_html() method.

The pandas get_levels() method is then used to collapse the headers.
The data is then furnished with the geographical coordinates retrieved using the geopy library. A user-agent is specified per Nominatum’s terms of service; it is also to avoid one’s IP address from being blocked from accessing the service.
Data Visualization & Troubleshooting
Visualization of the Kyoto wards is done using the folium library.

It was found that ‘Kita-Ku’ and ‘Minami-Ku of Kyoto are not plotted on the Kyoto Map. One hypothesis their coordinates could be referring to other cities’ wards. To begin, I might check the returned addresses, find the correct coordinates. Lastly, I will replace the corrected coordinates in the data frame.
The returned addresses confirm the hypothesis that the retrieved coordinates are off. Adding the city name "Kyoto" to the ward names enables the correct coordinates to be acquired. The code snippet for this task is reproduced below.

The updated map verifies the corrected coordinates with all eleven wards plotted on the map.

Exploratory Data Analysis
Foursquare API is then utilized to retrieve the data on restaurants in each of these wards. Specifying the criteria to return 100 restaurants within a radius of 500 meters, the result is visualized below. No surprise on Ramen and Japanese restaurants being the most common types of restaurants.


Compared with the other wards, Higashiyama-Ku, Ukyō-Ku, Nishikyō-Ku & Sakyō-Ku have a higher density of restaurants.
Silhouette Method
Noting the characteristics (i.e. restaurant types by proportion for each ward), K-Means clustering will be applied to cluster these entities and uncover potential insights such as viable restaurant themes and suitable restaurant locations. These insights could enhance the formulation of business strategies.
For K-Means clustering, the optimum number of k first needs to be established. This is a step that could potentially be overlooked. Several methods are available, such as the Elbow method and the Silhouette method. These methods are briefly discussed below.
Elbow Method. The Within-Cluster-Sum of Squared Errors (WSS) is calculated for different values of k. Select the k value for which WSS first starts to decrease (i.e. ‘elbow’). However, if the dataset is not well clustered (i.e. overlapping clusters), the elbow may not be distinct.
Silhouette Method. The Silhouette method measures the similarity of a point to its own cluster, compared to other clusters. The range of the Silhouette coefficients is between +1 and -1. A positive coefficient tending close to +1 indicates the particular point is assigned in the ideal cluster. It also implies point is as practically distanced from the neighboring clusters as possible. A coefficient of zero indicates that the particular point is on or very close to the decision boundary between two neighboring clusters. A negative coefficient indicates that the point has been assigned to the wrong cluster.
The Silhouette method is selected for determining the optimum k value.

The K-Means clustering is then implemented with init=’k-means++’, and random_state=42 for reproducibility of results. The resulting clusters are then plotted.

Evaluation & Recommendations
Examining each cluster based on venue categories, the following observations are derived.

Ramen restaurants are predominantly prevalent in cluster 1. This is closely followed by restaurants offering Asian-styled cuisines such as Chinese, Yoshoku, or Sushi dishes.

Japanese restaurants are predominantly prevalent in cluster 2. This is closely followed by a mix of either Chinese or Ramen or Donburi restaurants.

Udon restaurants are predominantly prevalent in cluster 3, followed by Donburi restaurants.
Exploring the Neighbourhoods in Kyoto, Higashiyama-Ku, Ukyō-Ku and Nishikyō-Ku & Sakyō-Ku have a higher density of restaurants (more than 10 in its area). The higher restaurant density could imply these areas as being more popular with visitors with more tourist attractions in their vicinity.
For example, the ward of Higashiyama-Ku features many historical sights such as the entertainment district of Gion in front of Yasaka Shrine, Ninenzaka, Sannenzaka, and Kiyomizu Temple (designated as World Heritage site). Ukyō-Ku is also home to many famous sites such as Tenryū-Ji, and Arashiyama, a hill famed for its maple leaves.
The Preliminary recommended locations are Higashiyama-Ku and Ukyō-Ku for market entry. Higashiyama-Ku is assigned to cluster 2; a restaurant offering Japanese cuisine could have a higher chance of success with the visitors. Ukyō-Ku is assigned to cluster 1; a restaurant offering Ramen could have a higher chance of success with the visitors. Regardless of the above recommendations, the other fundamentals of F&B service such as quality food & services and strict hygiene practices are not to be overlooked.
Other Tools
For this capstone project, I also tried out the readme.so for the creation and editing of the readme file. The tool provides several predefined templates for editing. Users only need to select the desired templates, make the necessary edits and upload the generated readme file to Github.
Afterword
I would say the capstone project is an enriching experience with the introduction of an API for data collection and being able to work on geospatial data. While it could have been easy to state a business problem statement and apply a machine-learning technique, the capstone presented a good opportunity to apply BCG’s business problem-solving approaches. It had helped enhance the framing of the problem statement. Without applying these approaches, the capstone could have been more difficult without a clear direction and purpose for machine learning application.
The code for the capstone can be accessed here.