Healthiest Cities in the US: Part 1, Food

How healthy are US cities?

Published in

Towards Data Science

7 min readJun 16, 2021

I have been thinking about using data science and statistics in the context of urban analysis, especially about the question of how healthy US cities are. In these multi-part stories, we are going to be analysing this question using data science. At this moment, I plan on dividing the stories like this: food, recreational spaces, and work-life balance. For this first part, we are going to be focusing on food: namely how common fast food chains are throughout American cities, and how accessible grocery bought food is to the American people. Let’s get started:

Data:

Fast Food Data: https://www.kaggle.com/datafiniti/fast-food-restaurants

Tools:

Jupyter Notebooks
Pandas (a data frame manipulations package)
Stats (Python statistical analysis library)
Seaborn and Matplotlib (visualization libraries)

1. Fast Foods

Let’s first try and gather some basic facts around the data:

How many unique cities and towns are there:

Result: 2764 unique cities and towns
What columns/data does the dataset contain for each fast food entry?

Result: Index([‘id’, ‘dateAdded’, ‘dateUpdated’, ‘address’, ‘categories’, ‘city’,
‘country’, ‘keys’, ‘latitude’, ‘longitude’, ‘name’, ‘postalCode’,
‘province’, ‘sourceURLs’, ‘websites’],
dtype=’object’)
Which is the most common brand?

Result: McDonald’s (1898 restaurants), Taco Bell (1032), Burger King(833), Subway(776), Arby’s (663).
How many fast foods?

Result: 10,000

One major question in figuring out the healthiest US cities is how many fast food businesses there are per city. To answer this, we can group the data by the cities names using pandas, and then sort the data frame by the number of fast foods for each city like so:

Unsurprisingly, large cities such as Houston and Las Vegas have the highest number of fast food restaurants because of their sheer size. Therefore, simply using the restaurant count will not be enough: what we need is a variable to quantify this count based on population. To do so, let’s devise a new metric, the fast food count divided by each city’s population. This variable will have a high value if a city has many fast foods, but a small population and vice versa for low values. The code for adding this metric to the data frame is as follows:

And here is a look at the top 20 cities and towns with the highest value of the metric we devised:

You can see towns such as Encino (California), Guilford (North Carolina), and Kingdom City (Missouri) on the top of the list. Interestingly, all of these cities/towns seem to be smaller both in size and in population, meaning the very largest cities have less fast foods serving more customers, making them healthier by our metric. Let’s look at whether this is actually the case. First, here is a scatter plot of the population counts of all the cities in the data set and their corresponding fast food counts:

While the relationship between population and number of fast foods seems linear for the towns of up to 250,000, it does seem to be less so with increasing population. Here is a scatter plot of the population versus the metric we devised:

In this case, it actually seems like the relationship is slightly negative: with increasing population, there are less fast foods available to people.

To truly see if there is an actual difference in cities with smaller population and those with larger ones in terms of the fast food restaurants, we can perform some statistical analysis on both populations. For this story, we will be defining two different city sizes: less than or equal to 25,000 and larger than 25,000. Before we test for differences in these two populations, we need to decide on the test we will be using. For this, first we must see if the populations are normally distributed or not. This can be done both by using a qqplot and the Shapiro-Wilk test for normality:

qqplot

The qqplot is a scatter plot plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. The code below is splitting the data set into our two pre-defined cities sizes.

And here is the qqplot for the larger (more than 25,000 people) cities:

Here is the qqplot for the smaller towns:

As you can see, neither of the population fits the straight line well, inclining us to say that the populations are not normally distributed.

Shapiro-Wilk

The Shapiro-Wilk test for testing normality works by setting two hypothesis:

Null Hypothesis: The data is normally distributed.
Alternative Hypothesis: The data is not normally distributed.

Here is the result of the Shapiro-Wilk test for the larger towns:

And here is the result of the Shapiro-Wilk test for the smaller towns:

Mann-Whitney test

Now that we have decided the data is not normally distributed, a suitable test to use would be the Mann-Whitney test for determining differences between our two populations. For this test, we should also state a null and an alternative hypothesis:

Null Hypothesis: The two populations come from the same underlying population.
Alternative Hypothesis: The two populations don’t come from the same underlying population.

The code for this test is as follows:

The Mann Whitney test tells us that there is indeed a difference between the number of fast foods in smaller and larger towns, which is interesting in terms of understanding the type of food available to Americans based on the cities they live in.

2. Grocery Food Access

For the second part of this story, we are going to be looking at which US counties have the most access to grocery food. The data is taken from the Food Environment Atlas of the Economic Research Service (https://www.ers.usda.gov/data-products/food-environment-atlas/).

As you can see, the data set contains information on the county level about various metrics. The one we are going to be using is “LACCESS_POP10”, which refers to the number of people per county that had low access to grocery food in 2010. Let’s first filter the data set and look at the mean values of this metric for each state:

The “Value” column contains the average number of people with low access to grocery food, and the highest value belongs to Massachusetts, followed by Connecticut and New Jersey, which might be slightly surprising, given that they are all states with large urban areas. Like we did with the fast foods, we should scale these numbers to the state’s population and express it as a ratio, so let’s use the following variable: the number of people that had low access to groceries divided by the state’s population. Here is the code:

To interpret the results, large values of the variable would mean a higher proportion of people with low access per state, and small value would mean less low access people. And here are the top 10 the states sorted in descending order by the variable:

And the bottom 10 states:

Texas is the state with the smallest proportion of people living with low access to grocery food, whereas Delaware is the highest, followed by Rhode Island and Hawaii. Do these results seem surprising to you? They are a little to me as I would have expected states which have a more rural population in the center of the US to dominate the list, given that grocery shops might be sparse in them. However, interestingly, it seems that this is not the case.

I hope you liked this short story, and if you want to look at the full code, you can find it here: https://github.com/DeaBardhoshi/Data-Science-Projects/blob/main/_Urban%20Health%20analysis%20Part%201%20-%20Food.ipynb

Thanks for reading!