All analysis done in this article is done using SafeGraph CBG data and Patterns data. Please cite SafeGraph for any reference to this article.
This article is the first in a series of two articles that will revolve around the Clustering of Census Block Group data. The purpose of this project is to use CBG clusters to predict future site locations of Chipotle. Site selection is a major point of interest amongst big corporate franchises like Chipotle that are looking to expand into new locations. In between 2020 and 2021 Chipotle has opened over 100 new locations in the United States, making it an ideal candidate to test this site selection algorithm. This article will focus on using US Census and SafeGraph’s Neighborhood Patterns data to first create clusters and then match these clusters with Chipotle locations.
This article will primarily use Sklearn’s K-Means Clustering algorithm to produce clusters and attribute these clusters to the Chipotle locations across America. The Chipotle Places data as well as the US Census Block Group Data can be accessed using through SafeGraph. SafeGraph is a data provider that provides POI data for hundreds of businesses and categories. It provides data for free to academics. For this project, I have chosen to use SafeGraph Places data to run my analysis. The schema for the Places data can be found here: Schema Info
CBG Data Selection: Which Demographics to choose?

The following snippet allows us to access the US Census data’s field descriptions from the SafeGraph package:
!pip install -q --upgrade git+https://github.com/SafeGraphInc/safegraph_py
from safegraph_py_functions import safegraph_py_functions as sgpy
from safegraph_py_functions import cbg_functions as sgpy_cbg
df = sgpy_cbg.get_cbg_field_descriptions()
df['field_level_1'].value_counts()
df = df[df['field_level_1'] == 'Estimate']
df.to_csv('census.csv')
The field descriptions serve to provide the features that are in the actual Census Dataset. Since this dataset has over 7500 features to offer, having this table makes selecting the features that we need for our analysis much more straightforward and simple. It’s important to note that for the sake of this analysis we are limiting the field descriptions table to just the ‘estimate’ records instead of both the ‘estimate’ and ‘confidence’ records. The ‘estimate’ records are the records that provide numerical values that can be associated with the count of a certain demographic in a given census block group. This is the information that we are looking to use in our analysis. Here is what the field descriptions table looks like:

While this table is a little cryptic and hard to decipher, it essentially boils down to each record in this table corresponding to a unique column in the actual census data. Given this information, it now becomes evident that choosing the correct demographics will be very essential in the accuracy of the generated clusters for this project. For the purposes of this project, we chose to take into consideration features that correlate with Age, Gender, Income, and other factors such as homeownership and work hours. In total, we looked at about 70 features for the clustering algorithm and the code for this looks like the following:
census_cols = sgpy_cbg.get_census_columns(['B01001e1',
'B01001e10','B01001e11','B01001e12','B01001e13','B01001e14',
'B01001e15','B01001e16','B01001e17','B01001e18','B01001e19',
'B01001e20','B01001e21','B01001e22','B01001e23','B01001e24',
'B01001e25','B01001e27','B01001e28','B01001e29','B01001e30',
'B01001e31','B01001e32','B01001e33','B01001e34','B01001e35',
'B01001e36','B01001e37','B01001e38','B01001e39','B01001e40',
'B01001e41','B01001e42','B02001e1','B02001e10','B02001e2',
'B09019e2','B09019e20','B09019e21','B09019e22','B09019e23',
'B09019e24','B09019e25','B09019e26','B09019e3','B09019e4',
'B09019e5','B09019e6','B09019e7','B09019e8','B09019e9',
'B19001e1','B19001e10','B19001e11','B19001e12','B19001e13',
'B19001e14','B19001e15','B19001e16','B19001e17','B19001e2',
'B19001e3','B19001e4','B19001e5','B19001e6','B19001e7',
'B19001e8','B19001e9'], 2019)
The dataframe of these generated features looks like this :

after cleaning up the data slightly (converting all numeric values to floats) we can append this data to the SafeGraph Neighborhood patterns data, which will provide us with columns that relate to the raw visitor counts to a particular census block group, the distance these visitors traveled to get to the CBG in question, and the median time that was spent in the CBG in question. There are many more columns available that may be useful in the Neighborhood Patterns data, but for simplicity, we are just using these
patterns = patterns[['area', 'raw_stop_counts', 'raw_device_counts', 'distance_from_home', 'distance_from_primary_daytime_location', 'median_dwell']]
patterns['census_block_group'] = patterns['area']
patterns = patterns.drop(['area'], axis = 1)
df = df.merge(patterns, on = 'census_block_group')
This results in the two dataframes being merged together:

CBG Clustering: The Final Step

The first step of this process is to use the K-means clustering on the created dataset and cluster the records based on the provided features. For the purposes of this experiment, we chose to create 50 clusters. The code snippet below performs this operation. Note that the large number of records being clustered into a small number of clusters leads to a large runtime:
model = KMeans(init="random",
n_clusters=50,
n_init=1000,
max_iter=400,
random_state=42)
model.fit(df[1:])
pred = model.predict(df)
To see the distribution of records across cluster IDs the following snippet can be used:
freq = {}
for item in pred:
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
To visualize this distribution we can plot it using the following snippet of code:
plt.bar(freq.keys(), freq.values())

From this distrubution we can see that the records are not even distributed between clusters, a sign that the variations in the features of each record lead to independent decision making by the k-means algorithm to assign cluster ID.
Conclusion:
from **** this article, we are able to see the versatility of the Census Block Data, The Neighborhood Patterns Data, and the Monthly Patterns Data that SafeGraph provides. Using these three datasets we were able to derive a set of over 70 features that we determined to be essential for the site selection process for future locations of the Chipotle franchise. These demographics included data regarding Age, Gender, Income, and other factors such as homeownership and work hours. We then used the neighborhood patterns data to associate these features with a few more features such as the raw visitor counts to a particular census block group, the distance these visitors traveled to get to the CBG in question, and the median time that was spent in the CBG in question. using this data we joined the CBG information to the Monthly Patterns Data for Chipotle for 2020. This complete data was then used to cluster the data into 50 unique clusters. For the next step of this project and the next article, we will be focusing on using the 2021 Chipotle data as the ground truth and seeing which clusters receive the new locations.
Questions?
I invite you to ask them in the #help channel of the SafeGraph Community, a free Slack community for data enthusiasts. Receive support, share your work, or connect with others in the GIS community. Through the SafeGraph Community, academics have free access to data on over 7 million businesses in the USA, UK, and Canada.