The world’s leading publication for data science, AI, and ML professionals.

Liquor Consumption: a Data Story

Liquor Store Activity in the City of Denver: Overall Trend and Seasonal Analysis

Photo by Colin Lloyd on Unsplash
Photo by Colin Lloyd on Unsplash

Humans have always delved into the idea of inebriation as means of recreation from the burdens of modern life. Alcohol has served as a tool for rest and relaxation since the time of its discovery and humans have adopted the consumption of alcohol as a form of social bonding. The use of marijuana for recreation has often been looked down on in history but this trend is now shifting towards a more positive view of the substance. In 2012 both of the major recreational substances came into play in the city of Denver- for this was the year that Marijuana was legalized in Colorado. With the presence of two large recreational drugs in the city and the rise of over a hundred dispensaries since the legalization of cannabis, it only seems logical to analyze the trends of consumption of these substances in the city of Denver. This article is the first in a series of articles that looks to analyze Liquor and Marijuana popularity in Denver and only addresses the Denver Liquor stores, the next article in this series will look into the notions of the dispensaries of Denver.

For our analysis, we will be using Safegraph Patterns data as well as data from the city of Denver. SafeGraph is a data provider that provides POI data for thousands of businesses and categories. It provides data for free to academics. In this particular project, the data from the Patterns dataset regarding the visit and popularity metrics will prove to be very helpful for our analysis. The schema for the patterns data can be found here: Schema Info

_Note: If you want to follow along with the code presented in this article on your own you can check out the link to the notebook. This provides all of the code snippets for this project and can prove to show a better idea of how each snippet of code is interconnected._

Photo by Fabrizio Conti on Unsplash
Photo by Fabrizio Conti on Unsplash

Upon downloading the Liquor data from the City of Denver’s archives, we can start to visualize the first half of the dataset that we will use for analysis of liquor stores:

This is the schema for the Denver Liquor data:

  1. BFN – BFN Identifier
  2. _BUS_PROFNAME – Business name
  3. _FULLADDRESS – Address of the business
  4. LICENSES – License type
  5. _LICSTATUS – Status of the license
  6. _ISSUEDATE – Date of License Issue
  7. _ENDDATE – License Expiration date
  8. _ADDRESSID – Unique Address identifier
  9. _ADDRESSLINE1 – Primary address associated with the establishment
  10. _ADDRESSLINE2 – Secondary address associated with the establishment
  11. CITY – City of establishment
  12. STATE – State of establishment
  13. ZIP – Postal Code of Establishment
  14. _COUNCILDIST – Council dist unique identifier
  15. _POLICEDIST – Police dist unique identifier
  16. _CENSUSTRACT – Census tract identifier
  17. NEIGHBORHOOD – Neighborhood of liquor store
  18. _ZONEDISTRICT – Zone district unique identifier
  19. _XCOORD – Longitude of establishment
  20. _YCOORD – Latitude of establishment

An interesting column that stands out in this data is the licenses column. Let us see what the unique licenses in this dataset are:

#bar chart of unique types of liquor licenses
plt.bar(unique_licences, liquor_df['LICENSES'].value_counts())
liquor_df['LICENSES'].value_counts()

From this data we will only take records that correlate to locations with active Liquor Licenses:

liquor_df = liquor_df.where(liquor_df['LIC_STATUS'] == 'LICENSE ISSUED - ACTIVE').dropna()

The next steps are a few data cleaning tasks that are done to ensure that the data is exactly what we are looking to analyze. In this particular analysis we will only be looking at data from liquor stores and no other sources of data:

# Dropping irrelavent columns
liquor_df = liquor_df.drop(['ADDRESS_ID','ADDRESS_LINE1','ADDRESS_LINE2'], axis = 1)
#Dropping columns that aren't recorded as from Denver, CO
liquor_df = liquor_df.where(liquor_df['CITY'] == 'Denver').dropna()
liquor_df = liquor_df.where(liquor_df['LICENSES'] == 'LIQUOR - STORE').dropna()
liquor_df['index'] = range(1, len(liquor_df) + 1)

Now we can load the patterns data for the liquor stores of Denver from SafeGraph. The output will look like this:

So now we have two separate datasets, one from the official city of Denver with information such as location coordinates and liquor license type and one from SafeGraph that has information regarding visit and popularity metrics. The question becomes, how do we join these two datasets. The answer can be found in the concept of Placekey generation.

Placekey solves the many issues that come with the notions of address matching by generating a unique identifier for all points of interest and using the generated value as a join column. The process for creating a Placekey is straightforward and an excellent guide for the step by step of this process can be found here

placekey_api_key = 'I0rCRT7FQshK7whZfcfRn56dGA3k4m5U'
pk_api = PlacekeyAPI(placekey_api_key)
def get_df_for_api(df,
column_map = {
"index": "query_id", "BUS_PROF_NAME" : "location_name","FULL_ADDRESS" : "street_address",
"CITY": "city", "region": "region", "ZIP": "postal_code"
}):
    df_for_api = df.rename(columns=column_map)
    cols = list(column_map.values())
    df_for_api = df_for_api[cols]
    df_for_api['iso_country_code'] = 'US'
    return(df_for_api)
liquor_df['index'] = liquor_df['index'].astype(str)
liquor_df['region'] = liquor_df['STATE']
df_for_api = get_df_for_api(liquor_df)
df_for_api.head(3)
data_jsoned = json.loads(df_for_api.to_json(orient="records"))
print("number of records: ", len(data_jsoned))
print("example record:")
data_jsoned[0]
responses = pk_api.lookup_placekeys(data_jsoned, verbose=True)
df_placekeys = pd.read_json(json.dumps(responses), dtype={'query_id':str})
df_placekeys.head(7)

This snippet merges these generated placekeys to the Denver City Liquor data that we cleaned up earlier:

def merge_and_format(loc_df, placekeys_df):
    lr_placekey = pd.merge(loc_df, placekeys_df, left_on="index", right_on="query_id", how='left')
    lr_placekey = lr_placekey.drop('error', axis=1)
    lr_placekey['address_placekey'] = df_placekeys.placekey.str[:3] + df_placekeys.placekey.str[-12:]
    lr_placekey = lr_placekey[['placekey', 'address_placekey'] +     list(loc_df.columns)]
    return(lr_placekey)
loc_placekey = merge_and_format(liquor_df, df_placekeys)
loc_placekey.head(3)

This Snippet now takes the Denver city data and joins the data with the SafeGraph patterns data using the generated placekeys

def merge_with_patterns(patterns_df, loc_res_placekey):
    patterns_df['address_placekey'] = patterns_df.placekey.str[:3] +     patterns_df.placekey.str[-12:]
    df = loc_res_placekey.merge(patterns_df.drop('placekey', axis=1), how='inner',on='address_placekey')
    df = df.reset_index().drop('index',axis=1)
    return(df)
df = merge_with_patterns(patterns18_df, loc_placekey)
cols = list(df.columns)
cols.pop(cols.index('address_placekey'))
df = df[['address_placekey'] + cols]
print(df.shape)
liquor_GS_df = df
df.head(3)

from the shape of the data (791,48) we can see that the initial 18 columns of the Denver Liquor data have now been appended with an additional 20 columns from the patterns data.

Now that all of our data is ready to go we can start on the analysis and see what story our data has to tell us. Let’s start our analysis with the liquor data.

sPhoto by Silas Baisch on Unsplash
sPhoto by Silas Baisch on Unsplash

Liquor Stores in Denver and the Mystery of SKYE LIQUOR

First, let’s analyze the average monthly visits for the entire dataset. To do this we can take just the needed columns to avoid clutter and store them in a different dataframe

l_avg_df = liquor_GS_df[['BUS_PROF_NAME','FULL_ADDRESS','LICENSES','ZIP','NEIGHBORHOOD','date_range_start','date_range_end','raw_visit_counts','raw_visitor_counts','visits_by_day']]

From here we can add a monthly average visits column to the dataframe, but to do so we must first convert the visits column from string to array:

from ast import literal_eval
l_avg_df['visits_by_day'] = l_avg_df['visits_by_day'].transform(lambda x: literal_eval(x))
l_avg_df['Monthly_avg_visits'] = l_avg_df['visits_by_day'].transform(lambda x: sum(x)/len(x))

Now we can see which stores had the most and least average monthly visits in the year 2018

max_avg = l_avg_df[l_avg_df['Monthly_avg_visits'] == max(l_avg_df['Monthly_avg_visits'])]
min_avg = l_avg_df[l_avg_df['Monthly_avg_visits'] == min(l_avg_df['Monthly_avg_visits'])]
print(max_avg)
print(min_avg)

from this, we can see that SKYE LIQUOR was very often visited in the year 2018. But how can we be sure that this seemingly large number of visits isn’t a skew of the data because the value is simply in the presence of all of the least visited stores in Denver? A quick solution would be to create a boxplot

From this, we can see that while there are many values that are close to the value of SKYE LIQUOR, all of these values are outliers and not representative of the true mean of the data, which is closer to 2.5. Thus the popularity of this store is incredibly larger than that of its fellow liquor stores in the city, an interesting finding…

When plotting out the average visits per month of all records in the data as a bar graph we see an even more striking trend in the data. The green bar shows the values for SKYE LIQUORS and the visits show themselves to range from 20 to 140: a range that is strikingly higher than all other visit counts for the other records. What could be so special about this store?

Let us analyze the hourly popularity of these records and see if the results of that analysis also show this trend.

l_pop_df = liquor_GS_df[['BUS_PROF_NAME','FULL_ADDRESS','LICENSES','ZIP','NEIGHBORHOOD','date_range_start','date_range_end','popularity_by_hour','popularity_by_day']]

Like before we must create a monthly average popularity column for our analysis:

l_pop_df['popularity_by_hour'] = l_pop_df['popularity_by_hour'].transform(lambda x: literal_eval(x))
l_pop_df['Monthly_avg_pop'] = l_pop_df['popularity_by_hour'].transform(lambda x: sum(x)/len(x))

Using our newly created column, let us see which liquor stores are least and most popular in the city of Denver in 2018

max_avg_pop = l_pop_df[l_pop_df['Monthly_avg_pop'] == max(l_pop_df['Monthly_avg_pop'])]
min_avg_pop = l_pop_df[l_pop_df['Monthly_avg_pop'] == min(l_pop_df['Monthly_avg_pop'])]
print(max_avg_pop)
print(min_avg_pop)

The difference between those values is astounding! What could be the root cause for this store to have such amazing popularity and visitor metrics in comparison to its surrounding competitors? To make sure this isn’t a case of skewed data we should first look at the boxplot

We can see that like before the popularity of SKYE LIQUOR is a vast outlier in comparison to the remainder of the data and the average popularity, which is seemingly < 10 . The difference in popularity between this store and its surroundings is truly intriguing. What is so special about this location?

l_pop_df[l_pop_df['BUS_PROF_NAME'] == 'SKYE LIQUOR'].head()

A quick google search shows that the location itself has only mediocre ratings (3/5 on google 1.5/5 on yelp and 1/5 on Facebook). How do other stores in the same area as this one fare in terms of popularity?

l_pop_df[l_pop_df['ZIP'] == '80202'].head()

While it seems that the remaining two stores in this particular location are more leaning towards the mean of the overall popularity of the dataset, this particular store is at a much higher rating than the others. What is bringing SKYE LIQUOR this kind of foot traffic?

Perhaps the popularity of SKYE LIQUOR has something to do with the locations that surround it. This information can be best retrieved through the related_same_day_brand column

liquor_GS_df[liquor_GS_df['BUS_PROF_NAME'] == 'SKYE LIQUOR']['related_same_day_brand'].tolist()

It seems that SKYE LIQUOR may have some of its popularity attributed to its surroundings. The location is placed next to a very popular rooftop diner in Denver and receives large crowds from the nearby Costco and Pure Barre gym. In addition to these locations, the store is right next to a Smashburger which will attract lots of customers as well. Thus the answer to this store’s success could in some part be attributed to its location.

Now that we have performed a basic dive into this data and the intricacies behind the success of one particular liquor store, it’s time to take a step back and look at the overall trend of liquor consumption for the year 2018.

Liquor Consumption: a Monthly Analysis

_Note: The code for performing the same analysis for every month of a year can be a bit repetitive as one can imagine. Thus to compress the size of this article and reduce repetition, I will be showing the code for just the analysis done in January and displaying only results for all subsequent months. If you wish to check out the code for each month’s data, don’t hesitate to check out the notebook for this article_

The first step is to separate the data into months:

liquor_GS_df['date_range_start'] = liquor_GS_df['date_range_start'].apply(lambda x: pd.Timestamp(x))
liquor_GS_df['date_range_end'] = liquor_GS_df['date_range_end'].apply(lambda x: pd.Timestamp(x))
liquor_GS_df['date_range_start'] = pd.to_datetime(liquor_GS_df['date_range_start'], utc=True)
liquor_GS_df['date_range_end'] = pd.to_datetime(liquor_GS_df['date_range_end'], utc=True)
liquor_GS_df['month_start'] = pd.DatetimeIndex(liquor_GS_df['date_range_start']).month
liquor_GS_df['month_end'] = pd.DatetimeIndex(liquor_GS_df['date_range_end']).month

Now that there is a month value for us to use for the separation process, the task becomes simple as:

L_GS_df_1 = liquor_GS_df[liquor_GS_df['month_start'] == 1]
L_GS_df_2 = liquor_GS_df[liquor_GS_df['month_start'] == 2]
L_GS_df_3 = liquor_GS_df[liquor_GS_df['month_start'] == 3]
L_GS_df_4 = liquor_GS_df[liquor_GS_df['month_start'] == 4]
L_GS_df_5 = liquor_GS_df[liquor_GS_df['month_start'] == 5]
L_GS_df_6 = liquor_GS_df[liquor_GS_df['month_start'] == 6]
L_GS_df_7 = liquor_GS_df[liquor_GS_df['month_start'] == 7]
L_GS_df_8 = liquor_GS_df[liquor_GS_df['month_start'] == 8]
L_GS_df_9 = liquor_GS_df[liquor_GS_df['month_start'] == 9]
L_GS_df_10 = liquor_GS_df[liquor_GS_df['month_start'] == 10]
L_GS_df_11 = liquor_GS_df[liquor_GS_df['month_start'] == 11]
L_GS_df_12 = liquor_GS_df[liquor_GS_df['month_start'] == 12]

Now that the data has been properly formatted, let’s use the same visitor and popularity metrics to see which stores were popular in January:

l_avg_df_1 = L_GS_df_1[['BUS_PROF_NAME','FULL_ADDRESS','LICENSES','ZIP','NEIGHBORHOOD','date_range_start','date_range_end','raw_visit_counts','raw_visitor_counts','visits_by_day']]
l_avg_df_1['visits_by_day'] = l_avg_df_1['visits_by_day'].transform(lambda x: literal_eval(x))
l_avg_df_1['Monthly_avg_visits'] = l_avg_df_1['visits_by_day'].transform(lambda x: sum(x)/len(x))
max_avg = l_avg_df_1[l_avg_df_1['Monthly_avg_visits'] == max(l_avg_df_1['Monthly_avg_visits'])]
min_avg = l_avg_df_1[l_avg_df_1['Monthly_avg_visits'] == min(l_avg_df_1['Monthly_avg_visits'])]
print(max_avg)
print(min_avg)

An interesting point to note here is that the average number of visitors to the most and least popular liquor stores have decreased when viewing just the month of January. This could be an indicator of the fact that the city of Denver tends to consume less liquor in January than in the rest of the year. Lets analyze the data further.

We can see that the average number of visits to liquor stores in Denver is lower in the month of January than it is for the entire year’s data (~1.5 vs. ~5)

Now let’s perform the same analysis using the popularity metrics and see if the same trend follows:

l_pop_df_1 = L_GS_df_1[['BUS_PROF_NAME','FULL_ADDRESS','LICENSES','ZIP','NEIGHBORHOOD','date_range_start','date_range_end','popularity_by_hour','popularity_by_day']]
l_pop_df_1['popularity_by_hour'] = l_pop_df_1['popularity_by_hour'].transform(lambda x: literal_eval(x))
l_pop_df_1['Monthly_avg_pop'] = l_pop_df_1['popularity_by_hour'].transform(lambda x: sum(x)/len(x))
max_avg_pop = l_pop_df_1[l_pop_df_1['Monthly_avg_pop'] == max(l_pop_df_1['Monthly_avg_pop'])]
min_avg_pop = l_pop_df_1[l_pop_df_1['Monthly_avg_pop'] == min(l_pop_df_1['Monthly_avg_pop'])]
print(max_avg_pop)
print(min_avg_pop)

It seems that the values for the popularity metrics have decreased in the month of January as well, This further corroborates our claim that the liquor stores in Denver are not as often visited in January in comparison to the remainder of the year.

using this analysis as an example, here are the outputs for the remaining months:

February:

March :

April:

May:

June:

July:

August:

September:

October:

November:

December:

The monthly data seems to show a very interesting pattern. As the year heads towards the warmer times (Spring and Summer) the visits and popularity metrics of the least and most popular liquor stores in Denver show an increasing trend. Thus could the overall popularity of Liquor stores be attributed to the weather? This would seem contradictory to the fact that many holidays celebrated with the presence of alcohol (Christmas, New Year, Thanksgiving, etc.) are all celebrated in the colder seasons. Let us see if the correlation between season and liquor store popularity truly stands with a seasonal analysis of the data.

Seasonal Analysis of Liquor Store Data: are Denver Liquor Stores More Popular in the Warmer Seasons?

To simplify this problem statement a little bit let’s work with just the values from SKYE LIQUOR since the remaining data showed a trend of shifting in the same direction as our most popular liquor store with the progression of months. First, let’s compile all of the data from the monthly analysis into a single dataframe:

Season_Df = pd.DataFrame(data = {'Monthly_avg_visits': [17.483871,21.464286,19.387097,24.666667,24.354839,23.066667,21.258065,19.225806,21.2,18.225806,17.933333,20.580645]
,'Monthly_avg_pop': [94.541667,118.208333,77.958333,110.0,101.875,102.541667,89.541667,77.25,96.416667,83.041667,84.041667,100.0],
'Season': ['Winter','Winter','Spring','Spring','Spring','Summer','Summer','Summer','Fall','Fall','Fall','Winter']})
Season_Df.head()

now that the data is compiled lets visualize it, first using a scatter plot of popularity on visits for each of the seasons:

We can see a general trend here that as visits increases the popularity of the store increases as well. Looking deeper into this we can see that the winter and fall months are showing the lowest visit numbers overall, but one particular winter month holds the largest popularity value. This month maps back to the month of January. Could this be a spike caused by the consumption of alcohol during New Years? Let us further analyze the data through a bar chart.

Now we can clearly see that the Summer and Spring months are when SKYE LIQUOR receives most of its visits. However, this data may be too small of a simple size to determine that this is the case for all Denver Liquor stores. Let’s attempt the same analysis on all of the Denver liquor data:

def season_func(month):
    if(month == 1 or month ==2 or month == 12):
        return 'Winter'
    elif(month == 3 or month == 4 or month == 5):
        return 'Spring'
    elif(month == 6 or month == 7 or month == 8):
        return 'Summer'
    elif(month == 9 or month == 10 or month == 11):
        return 'Fall'
liquor_GS_df['Season'] = liquor_GS_df['month_start'].transform(lambda x: season_func(x))
liquor_GS_df['popularity_by_hour'] = liquor_GS_df['popularity_by_hour'].transform(lambda x: literal_eval(x))
liquor_GS_df['Avg_pop'] = liquor_GS_df['popularity_by_hour'].transform(lambda x: sum(x)/len(x))
liquor_GS_df['visits_by_day'] = liquor_GS_df['visits_by_day'].transform(lambda x: literal_eval(x))
liquor_GS_df['Avg_visits'] = liquor_GS_df['visits_by_day'].transform(lambda x: sum(x)/len(x))

Now lets visualize the data using a scatterplot of popularity on visits:

The trend now is even more obvious. with the increase in popularity, the visits metrics seem to increase as the visitor metrics increase and show a clear linear trend. The values correlated with the Spring and Summer months tend to be higher in terms of popularity and visit correlation than the records associated with winter and fall, thus corroborating our theory that the city of Denver visits liquor stores more often in the warmer months of the year.

Liqour Data Conclusion

From this analysis, we are able to see several trends. In terms of visitors, SKYE LIQUORS has the most visitors in the months of April, May, June, February, and July. This is understandable because these months correlate with the months in which there are major holidays such as Memorial Day, Fathers Day, Valentine’s day, and the Fourth of July. This is however a little interesting and not what I expected. The major holidays that most people would associate with Alcohol consumption are Saint Patrick Day (March), Thanksgiving (November), Christmas (December). Thus not seeing any of these months on the top 5 list is very surprising. I believe that this is because these holidays were not enough to offset the overall average. The trend instead shows that as the year progresses into the Spring and Summer months, alcohol consumption increases.

When looking at the popularity of Denver Liquor store data, SKYE LIQUORS tends to be the most popular store for all months except for December. In December the most popular liquor store was Station liquor. This change may have been due to more visitors to Station Liquor during the holiday season which upturned the popularity of SKYE LIQUORS at this time. The trend of liquor store popularity seems to follow the trend of liquor store average monthly visits; as the year progresses and the Spring and Summer months arrive the overall popularity of Liquor stores in Denver seems to increase.

We will address the Denver dispensary data using these same methods in the next article. The results for that analysis is quite different from this one and the juxtaposition between the results of both studies is actually quite interesting.

Questions?

I invite you to ask them in the #safegraphdata channel of the SafeGraph Community, a free Slack community for data enthusiasts. Receive support, share your work, or connect with others in the GIS community. Through the SafeGraph Community, academics have free access to data on over 7 million businesses in the USA, UK, and Canada.


Written By

Topics:

Related Articles