
As a part of the final IBM Capstone Project, we get a tang of what data scientists go through in real life. Objectives of the final assignments were to define a business problem, look for data in the web and, use Foursquare location data to compare different districts within wards (municipalities) of Tokyo (choice of city depends on the students) to figure out which neighborhood is suitable for starting a restaurant business(‘idea’ also depends on individual students). As prepared for the assignment, I go through the problem designing, data preparation and final analysis section step by step. Detailed codes and images are given in Github and link can be found at the end of the post.
1. Discussion and Background of the Business Problem:
Problem Statement: Prospects of a Lunch Restaurant, Close to Office Areas in Tokyo, Japan.
Tokyo, where I am currently staying, is the most populous metropolitan area in the world. Currently ranked 3rd in the global economic power index, Tokyo is definitely one of the best places to start up a new business.
During the daytime, specially in the morning and lunch hours, office areas provide huge opportunities for restaurants. Reasonably priced (one lunch meal 8$) shops are usually always full during the lunch hours (11 am – 2 pm) and, given this scenario, we will go through the benefits and pitfalls of opening a breakfast cum lunch restaurant in highly densed office places. Usually the profit margin for a decent restaurant lie within 15−20% range but, it can even go high enough to 35%, as discussed here. The core of Tokyo is made of 23 wards (municipalities) but, I will later concentrate on 5 most busiest business wards of Tokyo – Chiyoda (千代田区), Chuo (中央区), Shinjuku (新宿区), Shibuya (渋谷区) and Shinagawa (品川区), to target daily office workers.
We will go through each step of this project and address them separately. I first outline the initial data preparation and describe future steps to start the battle of neighborhoods in Tokyo.
Target Audience
What type of clients or a group of people would be interested in this project?
- Business personnel who wants to invest or open a restaurant. This analysis will be a comprehensive guide to start or expand restaurants targeting the large pool of office workers in Tokyo during lunch hours.
- Freelancer who loves to have their own restaurant as a side business. This analysis will give an idea, how beneficial it is to open a restaurant and what are the pros and cons of this business.
- New graduates, to find reasonable lunch/breakfast place close to office.
- Budding Data Scientists, who want to implement some of the most used Exploratory Data Analysis techniques to obtain necessary data, analyze it, and, finally be able to tell a story out of it.
2. Data Preparation:
2.1. Scraping Tokyo Wards Table from Wikipedia
I first make use of Special Wards of Tokyo page from Wiki to scrap the table to create a data-frame. For this, I’ve used requests and Beautifulsoup4 library to create a data-frame containing name of the 23 wards of Tokyo, Area, population and 1st Major District. We start as below –

After little manipulation, the data-frame is obtained as below –

2.2. Getting Coordinates of Major Districts : Geopy Client
Next objective is to get the coordinates of these 23 major districts using geocoder class of Geopy client. Using the code snippet as below –

As you can see 4 coordinates are completely wrong (Bunkyo, Koto, Ota, Edogawa), which is due to the names of the districts are written little different than the way they are in this data-frame (ex. Hongō – Hongo), so, I had to replace these coordinates with values acquired from google search. After little more playing around with pandas, I could get one well-arranged data-frame as below –

2.3. Average Land Price in Major Wards of Tokyo: Web Scraping
Another factor that can guide us later for deciding which district would be best to open a restaurant is, the average land price of 23 wards. I get this information from scraping ‘land market value area in Tokyo‘ web-page, similarly to the Wiki page before. As I want to consider the 5 busiest business municipalities of Tokyo as mentioned in section 1 , the data-frame looks as below

2.4. Using Foursquare Location Data:
Foursquare data is very comprehensive and it powers location data for Apple, Uber etc. For this business problem I have used, as a part of the assignment, the Foursquare API to retrieve information about the popular spots around these 5 Major Districts of Tokyo. The popular spots returned depends on the highest foot traffic and thus it depends on the time when the call is made. So we may get different popular venues depending upon different time of the day. The call returns a JSON file and we need to turn that into a data-frame. Here I’ve chosen 100 popular spots for each major districts within a radius of 1 km. Below is the data-frame obtained from the JSON file that was returned by Foursquare –

3. Visualization and Data Exploration:
3.1. Folium Library and Leaflet Map:
Folium is a Python library that can create interactive leaflet map using coordinate data. Since I am interested in restaurants as popular spots first I create a data-frame where the 'Venue_Category'
column in previous data-frame contains the word ‘Restaurant’. I used the following snippet of code –

Next step is to use this data-frame to create a leaflet map with Folium to see the distribution of the most visited restaurants in the 5 major districts.

With the code snippet above the leaflet map looks as below

3.2. Exploratory Data Analysis:
There are 134 unique venue categories and Ramen Restaurants top the charts as we can see in the plot below –

Now, as that reminds of Ramen, definitely it is time to take a break.

After delicious ramen, let’s get back to exploring the data a little more. To know about the top 5 venues of each district we proceed as follows
- Create a data-frame with pandas one hot encoding for the venue categories.
- Use pandas groupby on the District column and obtain the mean of the one-hot encoded venue categories.
- Transpose the data-frame at step 2 and arrange in descending order.
Let’s see the code snippet below –

The above code outputs top 5 venues of each district –

From the several data-frames that I had to create for exploratory data analysis, using one of them, I’ve plotted which district has restaurants among the most frequently visited places and, Nagatacho of Chiyoda ward comes on top with 56 restaurants.

We can also look at the violin plots which are used to represent categorical data, and I used seaborn library to show the distribution of 4 major types of restaurants in different districts –

Once we get quite a broad overview of the different types of venues and specially restaurants around 5 major districts of Tokyo, it is time to use clustering the districts using K-Means.
4. Clustering the Districts
Finally, we try to cluster these 5 districts based on the venue categories and use K-Means clustering. So our expectation would be based on the similarities of venue categories, these districts will be clustered. I have used the code snippet below –

We can represent these 3 clusters in a leaflet map using Folium library as below –

5. Results and Discussion:
We reached at the end of the analysis, where we got a sneak peak of the 5 major wards of Tokyo and, as the business problem started with benefits and drawbacks of opening a lunch restaurant in one of the busiest districts, the data exploration was mostly concentrated on the restaurants. I have used data from web resources like Wikipedia, python libraries like Geopy, and Foursquare API, to set up a very realistic data-analysis scenario. We have found out that –
- Ramen restaurants top the charts of most common venues in the 5 districts.
- Nagatacho district in Chiyoda ward and Nihombashi in Chuo ward are dominated by restaurants as the the most common venue whereas Shibuya and Shinjuku areas are dominated by bars, pubs, and cafe as most common venues.
- Nagatacho has maximum number of restaurants as the most common venue whereas has Shibuya area has the least.
- Since the clustering was based only on the most common venues of each district, Shinjuku, Shibuya fall under the same cluster and, Nagatacho, Nihonbashi fall under another cluster. Shinagawa is separated from both of these clusters as, convenient stores stand out as the most common venue (with a very high frequency).
According to this analysis, Shinagawa area will provide least competition for an upcoming lunch restaurant as convenience store is the most common venue in this area and, the frequency of restaurants as common venue are very low compared to the remaining districts. Also seen from the web-scrapped data, the average land price in and around Shinagawa is much cheaper compared to the districts close to central Tokyo. So, definitely this region could potentially be a target for starting quality restaurants. Some drawbacks of this analysis are – the clustering is completely based on the most common venues obtained from Foursquare data. Since land price, distance of the venues from closest stations, number of potential customers, benefits and drawbacks of Shinagawa being a port region, could all play a major role and thus, this analysis is definitely far from being conclusory. However, it certainly gives us some very important preliminary information on possibilities of opening restaurants around the major districts of Tokyo. Also, another pitfall of this analysis could be consideration of only one major district of each ward of Tokyo, taking into account of all the areas under the 5 major wards would give us an even more realistic picture. Furthermore, this results also could potentially vary if we use some other clustering techniques like DBSCAN. I wrote a separate post on the detailed theory of DBSCAN and how can we cluster spatial database using it.
6. Conclusion
Finally to conclude this project, We have got a small glimpse of how real life data-science projects look like. I’ve made use of some frequently used python libraries to scrap web-data, use Foursquare API to explore the major districts of Tokyo and saw the results of segmentation of districts using Folium leaflet map. Potential for this kind of analysis in a real life business problem is discussed in great detail. Also, some of the drawbacks and chance for improvements to represent even more realistic pictures are mentioned. Finally, since my analysis were mostly concentrated on the possibilities of opening a restaurants targeting the huge pool of office workers, some of the results obtained are surprisingly exactly what I have expected after staying 5 years in Tokyo. Specially cafe, bars, pubs as most frequent venues around Shinjuku and Shibuya area, and Japanese restaurants around Nihombashi, Nagatacho area are what I see! Hopefully, this kind of analysis will provide you initial guidance to take more real-life challenges using data-science.
Stay strong and Cheers !!
Find the code in Github.
Find me in Linkedin.