The world’s leading publication for data science, AI, and ML professionals.

Hurricane Henri: A Data Story

Analyzing visitor data fluctuations due to the impact of Hurricane Henri across the East Coast

Hands-on Tutorials

Photo by NASA on Unsplash
Photo by NASA on Unsplash

Hurricane Henri was a category 1 hurricane that caused heavy destruction across the east coast in August of 2021. This Hurricane made landfall in the state of Rhode Island on August 16th and dissipated by August 24th. Henri reached the peak intensity of 70 mph and lead to destruction ranging from massive property damage to heavy rainfall and flooding in states such as New York, New Jersey, and Pennsylvania. This particular article looks to analyze the impact of this hurricane on the visitor metrics of Points of Interest in the states of New York, New Jersey, and Pennsylvania.

To perform this analysis I will be using the SafeGraph weekly patterns data for POI’s located in the states of New York, New Jersey, and Pennsylvania. SafeGraph is a data provider that provides POI data for hundreds of businesses and categories. It provides data for free to academics. The SafeGraph weekly patterns data is similar in structure to the monthly patterns data (documentation to the monthly patterns can be found [here](https://docs.safegraph.com/docs/weekly-patterns)) but provides data on a weekly and daily basis rather than a monthly basis. The documentation for this more specific and nuanced dataset can be found here.


Analysis:

The first step of the analysis is to load the data and perform some basic data preprocessing to extract the data that is needed for our purposes. This analysis involves the expansion of several columns compressed as JSON strings and arrays, such as columns regarding bucketed median dwell time and the daily visitor data. To perform this expansion we can utilize the from_json command in Spark. For this analysis, the data from August 02, 2021, to August 30, 2021, are utilized and the data is split into weekly intervals located in different files. Since the preprocessing phase is the same across all four weeks we can use the week of 08/02 as a baseline for our analysis and assume the same analysis is performed on all files. The first step of this process is to load the data:

path0802 = '/content/drive/MyDrive/UpWork/safeGraph/Data Projects/Project 6/patterns0802.csv'
df0802 = pd.read_csv(path0802)
df0802 = df0802[(df0802['region'] == 'NY') | (df0802['region'] == 'PA') | (df0802['region'] == 'NJ')]
df0802 = df0802.drop(['parent_placekey','iso_country_code','safegraph_brand_ids','visitor_country_of_origin','related_same_day_brand','related_same_week_brand','device_type','brands','visitor_daytime_cbgs','visitor_home_aggregation','visitor_home_cbgs'], axis = 1)
df0802['postal_code'] = pd.to_numeric(df0802['postal_code'], errors='coerce')
df0802['raw_visit_counts'] = pd.to_numeric(df0802['raw_visit_counts'], errors='coerce')
df0802['raw_visitor_counts'] = pd.to_numeric(df0802['raw_visitor_counts'], errors='coerce')
df0802['poi_cbg'] = pd.to_numeric(df0802['poi_cbg'], errors='coerce')
df0802['distance_from_home'] = pd.to_numeric(df0802['distance_from_home'], errors='coerce')
df0802['median_dwell'] = pd.to_numeric(df0802['median_dwell'], errors='coerce')
df0802 = spark.createDataFrame(df0802)
df0802.show(2)
Image by Author
Image by Author

From here we can expand the JSON string columns:

#Horizontal Explosion of JSON columns using Pyspark
from pyspark.sql.functions import from_json,expr
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
bucketedDT_schema = StructType(
[
StructField('<5',IntegerType(),True),
StructField('5-10',IntegerType(),True),
StructField('11-20',IntegerType(),True),
StructField('21-60',IntegerType(),True),
StructField('61-120',IntegerType(),True),
StructField('121-240',IntegerType(),True),
StructField('>240',IntegerType(),True)
]
)
df0802 = df0802.withColumn('bucketed_dwell_times',from_json('bucketed_dwell_times',bucketedDT_schema)).select('placekey','location_name','street_address','city','region','postal_code','date_range_start','date_range_end','raw_visit_counts','raw_visitor_counts','visits_by_day',
'distance_from_home','median_dwell',
'bucketed_dwell_times.*')
df0802 = df0802.toPandas()
df0802.head(3)
Image by Author
Image by Author

The next step is to expand the array columns:

from ast import literal_eval
df0802['visits_by_day'] = df0802['visits_by_day'].transform(lambda x: literal_eval(x))
pops = ['visit_' + str(i) for i in range(1,8)]
df0802[pops] = pd.DataFrame(df0802.visits_by_day.to_list(), index=df0802.index)
df0802 = df0802.drop(['visits_by_day'], axis = 1)
df0802 = df0802.reindex()
df0802.head(3)
Image by Author
Image by Author

Now that the columns have been expanded in the dataset we can perform some basic analysis:

Trend analysis on visit data for 08/02

px.bar(df0802, 'location_name','raw_visitor_counts',color = 'region', width= 10000000, height= 1000, barmode= 'group')

Since this bar chart is across all POIs of New York, New Jersey, and Pennsylvania, Its visualization is too expansive to show all of it. Instead, we can focus on a singular POI :

Image by Author
Image by Author

In this particular graph the blue bars represent Wendy’s locations from New York, red bars represent Wendy’s locations from Pennsylvania and the teal bars represent Wendy’s locations from New Jersey. The separations amongst the individual bars serve to show the various locations in each state the visitor data is grouped on top of each other.

Taking these POIs into consideration how we can further our analysis by looking at the effects of visitor metrics on median dwell and try to identify some pattern

px.scatter(df0802, x = 'raw_visitor_counts', y = 'median_dwell', color= 'region')
Image by Author
Image by Author

from analyzing the first week we are able to determine that there seems to be very little correlation between median dwell time and the visitor counts to a given POI. This same trend repeats across all weeks of the analysis. You can see these visualizations one the notebook file here if you are looking to dive deeper into this analysis.

One factor that becomes very evident from this analysis is the sheer number of POIs present in the data. If we want to look into the effects of Hurricane Henri across all four weeks of August, doing so with all of the POI’s may be cumbersome and make the analysis more complicated than necessary. In order to simplify the process and reduce the number of POI’s taken into consideration by a significant amount, we can apply random sampling to select the same POI’s for all 4 weeks of analysis and see the effects of the hurricane on daily visitor metrics.

df0802_sample = df0802.sample(10)
df0802_sample['id'] = np.array(['Week 0']*10)
df0802_sample.head(2)
placekey_arr = np.array(df0802_sample['placekey'])
df0809_sample = df0809[(df0809['placekey'] == placekey_arr[0]) | (df0809['placekey'] == placekey_arr[1]) | (df0809['placekey'] == placekey_arr[2])
| (df0809['placekey'] == placekey_arr[3]) | (df0809['placekey'] == placekey_arr[4]) | (df0809['placekey'] == placekey_arr[5])
| (df0809['placekey'] == placekey_arr[6]) | (df0809['placekey'] == placekey_arr[7]) | (df0809['placekey'] == placekey_arr[8])
| (df0809['placekey'] == placekey_arr[9]) ]
df0809_sample['id'] = np.array(['Week 1']*10)
df0816_sample = df0816[(df0816['placekey'] == placekey_arr[0]) | (df0816['placekey'] == placekey_arr[1]) | (df0816['placekey'] == placekey_arr[2])
| (df0816['placekey'] == placekey_arr[3]) | (df0816['placekey'] == placekey_arr[4]) | (df0816['placekey'] == placekey_arr[5])
| (df0816['placekey'] == placekey_arr[6]) | (df0816['placekey'] == placekey_arr[7]) | (df0816['placekey'] == placekey_arr[8])
| (df0816['placekey'] == placekey_arr[9]) ]
df0816_sample['id'] = np.array(['Week 2']*10)
df0823_sample = df0823[(df0823['placekey'] == placekey_arr[0]) | (df0823['placekey'] == placekey_arr[1]) | (df0823['placekey'] == placekey_arr[2])
| (df0823['placekey'] == placekey_arr[3]) | (df0823['placekey'] == placekey_arr[4]) | (df0823['placekey'] == placekey_arr[5])
| (df0823['placekey'] == placekey_arr[6]) | (df0823['placekey'] == placekey_arr[7]) | (df0823['placekey'] == placekey_arr[8])
| (df0823['placekey'] == placekey_arr[9]) ]
df0823_sample['id'] = np.array(['Week 3']*10)
df0823_sample.head(2)
frames = [df0802_sample, df0809_sample,df0816_sample, df0823_sample]
sample = pd.concat(frames)
sample.head(50)

The snippet above takes the random sample of POIs from the first week of data and maps their placekey values to the remaining 3 weeks of data as means to make sure that the random sampling is picking up the same POIs across all four weeks. When plotting the visitor data for these weeks we get the following results

Image by Author
Image by Author

From this plot, we can see that the POIs randomly selected have some evident trends. To fully understand the importance of these trends and understand the impact of Hurricane Henri on these POIs we need to first understand that the Hurricane affected these regions in the third week of August. Taking this information into consideration we can see that trend of some locations seeing a spice in visitors on the third week (MTA Cortelyou Road Station). These could potentially be examples of locations where people went during the hurricane to seek shelter or access goods. In this case, MTA Cortelyou Road Station is a New York City subway station and the influx of visitors could be attributed to the heavy rainfall in the area causing traffic problems. In Many cases, there is an increase in visitors to a POI on the fourth week after a decline in visitors on the third week (Franklin Towne Chs, Kathleen’s closet, etc.). These could be examples of an increase in visitors after the passing of the storm. In some of these cases, however, there is no discernable trend in the visitor metrics that can be attributed to the presence of the hurricane. This lack of fluctuation in the visitors can be correlated to two reasons. First, it could be possible that these particular POI’s were simply not affected by the storm since the data is comprised of all POI’s located in the selected three states, and locations more inland were potentially less affected by the storm. Second, it’s possible that we are unable to truly see the fluctuations in the visitor metrics because the weekly visitor numbers are a summation of the individual days of the week, and the lowered visitor numbers on the day of the hurricane may be offset by the events prior to that day at the POI as people potentially visit the area at an increased number for supplies and other materials prior to the presence of the earthquake. To address the first of these issues the following steps are taken:

NJ_arr = ['Cranbury', 'Jamesburg','Concordia']
NY_arr = ['New York']
PA_arr = ['Albrightsville', 'Gouldsboro', 'Jim Thorpe']
df0802_sel = df0802[(df0802['city'] == NJ_arr[0]) | (df0802['city'] == NJ_arr[1]) | (df0802['city'] == NJ_arr[2]) |
(df0802['city'] == NY_arr[0]) | (df0802['city'] == PA_arr[0]) | (df0802['city'] == PA_arr[1]) |
(df0802['city'] == PA_arr[2])]
df0802_sel['id'] = np.array(['Week 0']*2439)
df0809_sel = df0809[(df0809['city'] == NJ_arr[0]) | (df0809['city'] == NJ_arr[1]) | (df0809['city'] == NJ_arr[2]) |
(df0809['city'] == NY_arr[0]) | (df0809['city'] == PA_arr[0]) | (df0809['city'] == PA_arr[1]) |
(df0809['city'] == PA_arr[2])]
df0809_sel['id'] = np.array(['Week 1']*2436)
df0816_sel = df0816[(df0816['city'] == NJ_arr[0]) | (df0816['city'] == NJ_arr[1]) | (df0816['city'] == NJ_arr[2]) |
(df0816['city'] == NY_arr[0]) | (df0816['city'] == PA_arr[0]) | (df0816['city'] == PA_arr[1]) |
(df0816['city'] == PA_arr[2])]
df0816_sel['id'] = np.array(['Week 2']*2405)
df0823_sel = df0823[(df0823['city'] == NJ_arr[0]) | (df0823['city'] == NJ_arr[1]) | (df0823['city'] == NJ_arr[2]) |
(df0823['city'] == NY_arr[0]) | (df0823['city'] == PA_arr[0]) | (df0823['city'] == PA_arr[1]) |
(df0823['city'] == PA_arr[2])]
df0823_sel['id'] = np.array(['Week 3']*2411)
df0823_sel.head(2)
frames = [df0802_sel, df0809_sel,df0816_sel, df0823_sel]
select = pd.concat(frames)
select.head()
Image by Author
Image by Author

In order to combat the issue of the POI’s not being close to the location of the hurricane, we decided to see the impact of visitor fluctuation on the locations that are most affected in each state according to Wikipedia. Now visualizing this particular dataset we can see the following

Image by Author
Image by Author

The Graph Above shows a much clearer and more expected pattern in the visitor data across each week. Hurricane Henri formed over the Atlantic Ocean on August 16 and Dissipated over the New England Area around August 24. Hence we would expect the visitor patterns of the first two weeks of August to essentially serve as a control group for the given POI and the third week to show some decrease in visitor data and a slight recovery on week 4. This pattern is visible throughout many POIs of these selected towns that were most heavily affected by Hurricane Henri. The POIs showed a stagnated or slightly fluctuating visitor count for the first two weeks, followed by a decline in visitor data in the third week as the hurricane began to affect the region. When the hurricane dissipated at the end of the third week, visitor data for these POIs increased again as expected

To address the second problem of the sum of visitors across the week offsetting the fluctuations on the day of the hurricane, we performed the following analysis

df0802_1 = pd.read_csv(path0802)
df0809_1 = pd.read_csv(path0809)
df0816_1 = pd.read_csv(path0816)
df0823_1 = pd.read_csv(path0823)
BPC02 = df0802_1[df0802_1['location_name']== 'Battery Park City']
BPC02['visits_by_day'] = BPC02['visits_by_day'].transform(lambda x: literal_eval(x))
BPC02 = BPC02.explode('visits_by_day')
name_02 = np.array(BPC02['location_name'])
for i in range(len(name_02)):
name_02[i] = name_02[i] + str(i)
BPC02['location_name'] = name_02
BPC02['id'] = np.array('week 0')
BPC09 = df0809_1[df0809_1['location_name']== 'Battery Park City']
BPC09['visits_by_day'] = BPC09['visits_by_day'].transform(lambda x: literal_eval(x))
BPC09 = BPC09.explode('visits_by_day')
name_09 = np.array(BPC09['location_name'])
for i in range(len(name_09)):
name_09[i] = name_09[i] + str(i)
BPC09['location_name'] = name_09
BPC09['id'] = np.array('week 1')
BPC16 = df0816_1[df0816_1['location_name']== 'Battery Park City']
BPC16['visits_by_day'] = BPC16['visits_by_day'].transform(lambda x: literal_eval(x))
BPC16 = BPC16.explode('visits_by_day')
name_16 = np.array(BPC16['location_name'])
for i in range(len(name_16)):
name_16[i] = name_16[i] + str(i)
BPC16['location_name'] = name_16
BPC16['id'] = np.array('week 2')
BPC23 = df0823_1[df0823_1['location_name']== 'Battery Park City']
BPC23['visits_by_day'] = BPC23['visits_by_day'].transform(lambda x: literal_eval(x))
BPC23 = BPC23.explode('visits_by_day')
name_23 = np.array(BPC23['location_name'])
for i in range(len(name_23)):
name_23[i] = name_23[i] + str(i)
BPC23['location_name'] = name_23
BPC23['id'] = np.array('week 3')
frames = [BPC02, BPC09, BPC16, BPC23]
BPC = pd.concat(frames)
BPC['visits_by_day'] = pd.to_numeric(BPC['visits_by_day'])
BPC.head()
Image by Author
Image by Author

Using the previous data from the most affected areas in New York, New Jersey, and Pennsylvania, we went on to split the records by the day of the week they correspond to in order to relieve any potential bias caused by the summation. We picked three locations: Battery Park City, Wallgreens, and Wendys. When plotting the results for Battery Park City we can see the following visualization

Image by Author
Image by Author

Looking at the day-to-day of Battery Park City in particular across the last four weeks provides a kind of base standard to look at for many reasons. The first reason can be attributed to the fact that the location is simply an upscale apartment home, meaning that there is no reasoning behind visitors coming in large quantities before or after a storm for things such as temporary shelter and food, and other supplies. This allows the location to serve as a sort of ground truth to the way in which the hurricane affects visitor data. From the visualization above we can see that there is a noticeable drop in visitors to this location in the middle of the third week, which could be linked to the notions of the heavy rainfall that the area saw during this particular time. We can also see a spike in visitors as the storm subsides at the end of this week and people can return to the POI. Another noticeable trend is the increase in visitors to the location towards the end of the second week, which could be attributed to a large number of people coming into the location to be secure before the coming of the storm. To better clarify this visualization, we decided to plot the differences in visitor metrics across weeks and saw these results which gave the same analysis in a more understandable way

week0 = BPC[BPC['id']== 'week 0']
week1 = BPC[BPC['id']== 'week 1']
week2 = BPC[BPC['id']== 'week 2']
week3 = BPC[BPC['id']== 'week 3']
diff1 = list(np.array(pd.to_numeric(week1['visits_by_day'])) - np.array(pd.to_numeric(week0['visits_by_day'])))
diff2 = list(np.array(pd.to_numeric(week2['visits_by_day'])) - np.array(pd.to_numeric(week1['visits_by_day'])))
diff3 = list(np.array(pd.to_numeric(week2['visits_by_day'])) - np.array(pd.to_numeric(week3['visits_by_day'])))
combineddiff = list(diff1) + list(diff2) + list(diff3)
location = np.array(BPC['location_name'])
data= {
'Name': list(location[0:21]),
'visitor diff': combineddiff,
'week': list(np.array(["Week 1 - Week 0"] * 7))+  list(np.array(["Week 2 - Week 1"] * 7))+ list(np.array(["Week 3 - Week 2"] * 7))
}
diffdf = pd.DataFrame(data = data)
diffdf.head()
Image by Author
Image by Author

The next step is performing the same analysis on Walgreens and the results look like the following visualizations

Image by Author
Image by Author
Image by Author
Image by Author

For the second analysis for this section, I chose to look at a convenience store like Walgreens to see how the presence of the storm affects visits. through the data for this particular is showing several days having no visitor data we can see two trends. We can first see a huge spike in visitors the day before the storm arrives in the city. The subsequent days in this week show a steep decline in the visitor metrics due to the presence of the storm. Another trend that can be seen is the fact that after the storm passes the store sees a huge spike in visitors as people attempt to find materials after the storm passes.

The third part of this analysis was done on San Carlos Hotel and the results can be seen through the following visualizations:

Image by Author
Image by Author
Image by Author
Image by Author

As a final example of a type of POI that can be explored in this project we can look at the San Carlos Hotel in NYC. from this particular POI we can see how the need for people to take temporary shelter at a hotel can drive up visitor metrics in the weeks before the storm and see how the visitor metrics are affected in the weeks that come after the storm. From the visualization above we can see that the number of visitors to the hotel peaks in Week 1 of August and then sees significant declines in the upcoming weeks when the storm approached the city. We can see that in week 3 where the storm actually made it to the city that the visitor metrics saw a significant drop and thus indicate how people sheltered at the hotel before the storm and did not leave visit the POI during the heavy rainfall of the storm. In the week after the storm subsided, we can see that the number of visitors to the location was still lower than it was during the first weeks, indicating the negative impact of the storm on the number of visitors to the POI

Conclusion:

From this analysis, we were able to see the ways in which visitor metrics fluctuated with the advent of Hurricane Henri and show the potential that the SafeGraph Weekly Patterns data has in assisting with this kind of analysis. For the next part of this series, we will again use the Weekly Patterns data in order to explore the impact of Hurricane Ida in heavily affected states such as Louisiana.

Questions?

I invite you to ask them in the #help channel of the SafeGraph Community, a free Slack community for data enthusiasts. Receive support, share your work, or connect with others in the GIS community. Through the SafeGraph Community, academics have free access to data on over 7 million businesses in the USA, UK, and Canada.


Related Articles