Background image by SeanPavonePhoto | Credit: Getty Images/iStockphoto/Canva

Battle of the Neighborhoods in Austin, TX — Where to Open a Chinese Restaurant?

Published in

Towards Data Science

12 min readNov 20, 2020

This project aims to utilize data science concepts and machine learning tools learned in the IBM Data Science Professional Certificate Course to solve a popular problem for entrepreneurs or business owners: where is the best neighborhood to open a restaurant? In this project, I will go through the processes of problem definition, data preparation, and use machine learning to drive the decision making process.

Introduction

Austin is the capital of Texas in the United States and is one of the fastest growing cities in America. It was voted the №1 place to live in America (U.S. News & World Report, 2019) for the third year in a row and was ranked №4 of the best large cities to start a business (WalletHub, 2019). According to Austin City Government, the City of Austin has crossed the threshold of becoming a Majority-Minority city, meaning that no demographic group exists as a majority of the City’s population. One notable trend is the growing number of Latino and Asian households. Read more here.

Being a fast-growing city with diverse ethnicities, Austin is a great place for entrepreneurs to start and grow their businesses. The city is also well known for its outstanding food and great live music venues.

Having lived in Austin for almost four years, I have always wondered why there aren’t many authentic Chinese restaurants in the Austin area and would love to see more. The objective of this project is to segment and cluster the neighborhoods of Austin using different data sources including Foursquare location data to find the ‘best’ neighborhood to open a Chinese restaurant based on the venues in the area. I will:

1) Collect neighborhood data from Austin City Government

2) Use Google Geocoding API to find the approximate coordinates of the neighborhoods

3) Use Foursquare API to find the first 100 venues in the neighborhoods within a radius of 1500 meters

4) Cluster the neighborhoods using Scikit-learn’s K-Means Clustering algorithm

5) Compare cluster analysis with demographic data

and finally I will discuss data-driven decision making for a new Chinese restaurant business in the city of Austin.

I am excited to use my newly learned skills to explore Austin with data, let’s begin!

Target Audience

The target audience of this project would be anyone that is interested in opening or growing a Chinese restaurant in Austin, TX. The cluster analysis of Austin neighborhoods and demographic data will help entrepreneurs or restaurant owners make an informed decision about which neighborhoods to aim for.

Data Acquisition and Cleaning

1. Data Sources

To begin with, I gathered data on the reporting neighborhoods in the city of Austin. The neighborhood data I found is from the Housing and Planning Department of the Austin City Government. This dataset includes the names and geometric information of the different neighborhoods, their sizes, and their shapes.

According to the shape of the dataset, there are 102 neighborhoods in Austin. Since this dataset doesn’t include the coordinates of the neighborhoods, I will be using Google Geocoding API to get the latitudes and longitudes.

After cleaning and adding the coordinates of the neighborhoods, here is what the dataframe looks like:

2. Explore Neighborhoods

After finding the latitudes and longitudes of the neighborhoods, we can then use Folium to map out all these neighborhoods (click here for web map):

Now it’s time to find the venues around the center of these neighborhoods. I will request this data from Foursquare API and collect information for the first 100 venues in the neighborhoods within a radius of 1500 meters.

Here are the results:

3. Explore Chinese Restaurants in Austin

Before we get into clustering, I created a new data frame with all the Chinese restaurant data that was returned by Foursquare API. Since some of these venues were double counted, I dropped them in order to make a map of these restaurants. Chinese restaurant locations are marked red in the following map. (click here for web map)

From this map, we can tell that a lot of the Chinese restaurants in Austin are located in the northern and southern parts of Austin with some in the central area, but not so many in the West Lake Hills area or the eastern neighborhoods. According to the Foursquare data, there are approximately 34 Chinese restaurants in Austin.

4. Data Preparation — One Hot Encoding

Previously, we collected data on venues in Austin with their names and coordinates. However, to run machine learning algorithms on the data, we need numerical data about the existence of these venues. One hot encoding helps us do that by creating new (binary) columns to indicate the presence of each possible value from the data. This means that each venue in each neighborhood will be labeled as 1 in their correct category, if there is no venue found in that category, it will be labeled as 0. After that, we can group the data frame by the neighborhoods to get the mean of the frequency of occurrence of each venue category.

Since we are analyzing Chinese Restaurant as the venue, I filtered out on the ‘Chinese Restaurant’ category. This will tell us the average frequency of occurrence of Chinese restaurants in each neighborhood.

Machine Learning — Cluster Neighborhoods

1. What is clustering?

After cleaning and preparing the data, we are finally ready to get into the fun part! For this project, I am using k-means clustering.

To begin with, what is clustering? A cluster is a collection of data points aggregated together based on their similarities. Using machine learning algorithms, we can group the neighborhoods based on their similarities with each other. K-means algorithm, in particular, first identifies k number of centroids, and then allocates every data point to the cluster, in a way that the data point is closer to that cluster’s centroid than any other centroid. K-means algorithm runs this in a repetitive fashion until the centroids are stabilized and the clusters are formed. I am using this method because it is an unsupervised learning method meaning that the algorithm will find the similarities between the data points for us given we don’t know them to begin with.

The following plot shows the average frequency of Chinese restaurants of all neighborhoods before they are clustered.

2. Find best K

One limitation of k-means clustering is that the algorithm does not decide how many clusters to form on its own and we need to find the best K to make clustering more accurate. The Elbow Method is one of the most popular methods to determine this optimal value of k. We iterate the values of k from 1 to 9 and calculate the distortion and inertia values for each value of k in the given range.

Distortion is the average of the squared distances from the cluster centers of the respective clusters and inertia is the sum of squared distances of samples to their closest cluster center. (reference)

Below I ran the clustering algorithms and visualized the results of using different k values to determine the ‘Elbow’ point.

To determine the optimal number of K, we select the value of k at the “elbow” of the plots, the point after which the distortion/inertia starts decreasing in a linear line. Given these plots, we conclude that 4–5 clusters would work best for our data. Ultimately, I decided to go with 5.

3. Run K-Means to Cluster Neighborhoods

Using the scikit-learn package for K-means, I ran k-means clustering on the neighborhood data. Here are the clusters that the algorithm found for us labeled in different colors.

4. Examine Clusters

We created 5 clusters (cluster 0–4) using k-means. Now let’s look at each cluster more closely.

Cluster 0 has a long list of 70 neighborhoods so I won’t show all of it here. But as we can tell from the dataframe below, Cluster 0 has an average frequency of Chinese restaurant of 0. It means that Chinese restaurants are not very common in these neighborhoods.

Cluster 1 (see below) seems to have a higher frequency of Chinese restaurants than Cluster 0.

So far, Cluster 2 has the highest frequency of Chinese restaurants.

Cluster 3

Cluster 4

In order to look at these clusters better, I calculated the mean frequency of Chinese restaurants of each cluster and here are the results:

Cluster 0 has an average frequency of Chinese restaurants of 0.0000

Cluster 1 ~ 0.0237

Cluster 2 ~ 0.0613

Cluster 3 ~ 0.0108

Cluster 4 ~ 0.0393

After looking at each cluster, we can conclude that Cluster 2 has the highest frequency of Chinese restaurants while Cluster 0 has the lowest.

Data Visualization

Let’s create a map to examine these clusters further. Using Folium, I mapped out the neighborhoods with each cluster labeled a different color.

Cluster 0 = Orange, Cluster 1 = Red, Cluster 2 = Purple, Cluster 3 = Blue, Cluster 4 = Mint green. (click here for web map)

As can we see from the map, most Chinese restaurants are located in Clusters 1–4, which include neighborhoods in the northern and southern parts of Austin. Some of the neighborhoods in Cluster 3 (blue) are in the central area but Cluster 3 has the second lowest frequency of Chinese restaurants. Overall, Chinese restaurants are concentrated in north and south Austin. This is interesting information so far!

In the following graph we can see the approximate number of Chinese restaurants in each cluster. In Cluster 1 there are 15 of them. This is interesting, because Cluster 1 doesn’t have the highest average frequency of Chinese restaurants. This might be because Cluster 1 has a high number of neighborhoods compared to other clusters, and that there are other common venues in the neighborhood which makes the frequency of Chinese restaurants lower.

In this next graph we can see that there are 70 neighborhoods that don’t have a Chinese restaurant within the radius of 1500 meters. This is surprising because of the growing number of Asian population in Austin! Maybe the Chinese population is still small compared to other Asian ethnicities. We should also note that the Foursquare data only returned the first 100 venues within the 1500-meter radius. There might be Chinese restaurants beyond the 100 venue limit and the radius.

A Look at Demographics in Austin Neighborhoods

1. Demographic Data — Population Percentage by Ethnicity

Of course there are other things to consider when trying to open a Chinese restaurant… It is a big decision to make! Another factor I would imagine a business owner taking into account is the demographic data based on ethnicity. I found a related dataset on austintexas.gov, it shows the information of population, race and ethnicity, housing and density, grouped by neighborhood reporting areas in Austin (based on the 2010 Census data). These neighborhoods are the same as the ones we did the cluster analysis on. Thus, we can merge the dataframes together to see the neighborhoods with different Asian population densities in their clusters! Here’s a preview of the demographic data:

After I sorted and merged the cluster data with the population data, we can now see which clusters the neighborhoods with a high Asian population percentage are in. As we previously discovered, Cluster 0 has the lowest frequency of Chinese restaurants, while Cluster 2 has the highest, followed by Cluster 4, Cluster 1, and Cluster 3.

From the results we can see that, the top 9 neighborhoods with the highest Asian population percentage actually have some of the lowest frequency of Chinese restaurants. This is surprising! Although I couldn’t find the demographic data on Chinese ethnicity specifically, given the data, it seems that there are not many Chinese restaurants in UT/West University areas, where there are a lot of Chinese international students.

2. Data Visualization

This graph shows the sorted Asian population percentage by Austin neighborhoods.

I also created a heatmap to show which neighborhoods have a higher density of Asian population. (click here for web map)

Discussion

1. So where should the Chinese restaurant be?

During our cluster analysis, we found that Cluster 0 has the lowest average frequency of Chinese restaurants while Cluster 2 has the highest. However, Cluster 1 has the highest number of Chinese restaurants, with a lower average frequency, which might be because Cluster 1 has a high number of neighborhoods compared to other clusters, and that there are other common venues in the neighborhood which makes the frequency of Chinese restaurants lower. But overall, compared to other venues in the Cluster 1 neighborhoods, Chinese restaurants are less common.

Based on the demographic data I found, the top 9 neighborhoods with the highest Asian population density don’t have a high frequency of Chinese restaurants. These neighborhoods include UT, Lakeline, Gateway, West University, Anderson Mill and so on. It makes sense since some of these neighborhoods have a very small total population. However, Chinese restaurants are still not common in highly populated college neighborhoods like UT, West University with many Chinese students. I might be biased but this needs to be changed (because Chinese food is awesome!). Other highly populated neighborhoods like Anderson Mill (with a total population of 28473!) also don’t have an average high frequency of Chinese restaurants either. Does this mean that someone should open a Chinese restaurant there?

2. Limitations

After discussing (possibly) the best neighborhood to open a Chinese restaurant, we should also note some limitations to this analysis. To start with, the coordinates of neighborhoods are not 100% accurate. They were taken from Google Geocoding API based on the neighborhood names. Thus, we could only approximate the location of each neighborhood.

Furthermore, I set a limit to the Foursquare API to return only the first 100 venues within the radius of 1500 meters. But the neighborhoods are very different in shapes and sizes. Some neighborhoods are much larger but less populated while others are more densely populated with a smaller area. Thus, the Foursquare API might not have been able to capture all the Chinese restaurants in each neighborhood. However, we calculated the occurrence of Chinese restaurants within the 1500 meter radius for each neighborhood, which could still reflect the average frequency of Chinese restaurants within that neighborhood.

Lastly, because we don’t have specific population data on the Chinese ethnicity, it is hard to tell how much of the Asian population is Chinese. Thus, before opening a restaurant, it might be better to do some research on that specific neighborhood, for example, on its commercial pricing, consumers, competitors, etc. and take other factors into consideration.

Conclusion

At the start of this project, we defined a business problem: where to open a Chinese restaurant in Austin, TX? After we collected the neighborhood zoning information, we used Google Geocoding API to find the approximate coordinates for those neighborhoods. We then used Foursquare API to discover the first 100 venues within the radius of 1500 meters in each neighborhood and took the average frequency of Chinese restaurants in comparison to other venues. Using clustering algorithms, we grouped the neighborhoods in a total of 5 clusters, with Cluster 0 having the lowest average frequency of Chinese restaurants, Cluster 2 the highest. Finally, we compared the clusters to the neighborhood demographic data provided by the government and found that the neighborhoods with the highest percentages of Asian population don’t have a very high frequency of Chinese restaurants. Then we concluded that Anderson Mill is the best neighborhood to open a Chinese restaurant based on our cluster analysis and the demographic data.

Overall, this project was a great practice utilizing data science concepts and machine learning algorithms. In addition to using Scikit-learn’s k-means clustering algorithm, we also used Folium and Seaborn to create meaningful maps and graphs to conclude our findings. I believe that this project would give our target audience a starting point to consider the possibility and value of a new Chinese restaurant in a given neighborhood in Austin, TX. Our analysis on Austin demographics also gives an idea of where Asian population is concentrated in Austin, which could be one of the factors to consider when opening an Asian or Chinese restaurant. Note that the population data is from the 2010 Census, it is likely that the Asian population has gone up since then, which makes opening a Chinese restaurant even more promising. And plus, everyone loves Chinese food, right?

Thanks for reading on this data science project! I look forward to connecting with you!

You can find the datasets and Jupyter notebook here.

You can also find me on Linkedin.

Learn more about the IBM Data Science Professional Certificate here.