Mapping Avocado Prices in Python

Using GeoPandas, GeoPy and Matplotlib to produce a beautiful choropleth map.

Craig Dickson
Towards Data Science

--

Photo by Peter de Vink from Pexels

I love avocados. They’re tasty and healthy and I enjoy eating them. I know that makes me a basic cliche millenial, but the heart wants what the heart wants! When I came across the Avocado Prices dataset posted by Justin Kiggins on Kaggle, I immediately wanted to have a look at it and play with it. This contains information about avocado sales in the United States between 2015 and 2018, compiled by the Hass Avocado Board.

There’s a lot of good information contained in this dataset, and I could have chosen a few different things to explore (and perhaps I still will later on). I’d seen some cool choropleth maps on twitter and wanted to try my hand at that. I saw that each price observation was tied to a particular region, and I decided to see if I could get that information from text form into map form somehow.

Spoiler alert: I could!

In this post I’ll take you step by step through how I did it, and you can find my complete Jupyter Notebook containing the code here. This was my first attempt at making this kind of map using Python, so I welcome all feedback and ideas on how I could have made this process easier on myself!

Importing and Streamlining the Data

Let’s get into it! The first thing I did was import my libraries and import the data itself. Using pd.read_csv() creates a pandas DataFrame directly from our CSV file.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
data = pd.read_csv('avocado.csv', index_col = 'Date')

This gives us a nice DataFrame.

But that’s too much information for us just now. I want to make a map including price and region data (and perhaps later do something with volume and year), so let’s drop a lot of these columns.

columns_to_drop = ['Unnamed: 0', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type']avo_df = data.drop(columns_to_drop, axis=1)
display(avo_df.head())

Nice! That’s more streamlined. Doing some Exploratory Data Analysis gave me an idea of the dataset I was looking at, the maximum price for an avocado in the dataset was $3.25(!) — the cheapest was $0.44 — more than 7 times cheaper. That’s some variation! The median price for a single avocado in our dataset was $1.37, the mean $1.41 and the standard deviation was $0.40. This gives an idea of the distribution of prices in the data.

It’s always helpful to do some EDA visually. Here is a histogram of the AveragePrice column to give us a clearer sense of the overall distribution.

It doesn’t have to be beautiful yet, don’t worry.

To make the map that I want to make (a choropleth showing the price of an avocado in each region, to clearly show regional variations in price), I want to collapse the many different price measurements into one value for each region. It could be interesting to explore differences year-on-year, or by volume (who buys the most avocados?), but to answer the question I’ve set myself, I want to squash this all down.

There are a few options here, but I think the clearest and most honest way to do this is to take a simple mean of all AveragePrice data, grouped by region. How do we do that?

First, we need to identify all the regions.

regions = avo_df.region.unique()

This give us a list of all the regions contained in the dataset:

['Albany' 'Atlanta' 'BaltimoreWashington' 'Boise' 'Boston'
'BuffaloRochester' 'California' 'Charlotte' 'Chicago' 'CincinnatiDayton' 'Columbus' 'DallasFtWorth' 'Denver' 'Detroit' 'GrandRapids' 'GreatLakes' 'HarrisburgScranton' 'HartfordSpringfield' 'Houston' 'Indianapolis' 'Jacksonville' 'LasVegas' 'LosAngeles' 'Louisville' 'MiamiFtLauderdale' 'Midsouth' 'Nashville' 'NewOrleansMobile' 'NewYork' 'Northeast' 'NorthernNewEngland' 'Orlando' 'Philadelphia' 'PhoenixTucson' 'Pittsburgh' 'Plains' 'Portland' 'RaleighGreensboro' 'RichmondNorfolk' 'Roanoke' 'Sacramento' 'SanDiego' 'SanFrancisco' 'Seattle' 'SouthCarolina' 'SouthCentral' 'Southeast' 'Spokane' 'StLouis' 'Syracuse' 'Tampa' 'TotalUS' 'West' 'WestTexNewMexico']

Now we want to use this list to group our DataFrame and get down to one AveragePrice value per region-

group_by_region = avo_df.groupby(by=['region'])
avo_df_avg = group_by_region.mean()
avo_df_avg = avo_df_avg.drop(['year'], axis=1)

That should do the trick.

Geocoding with GeoPy

Now we have one price value per region, but each region is stored as text, e.g. ‘Albany’, ‘SanDiego’, ‘Roanoke’ etc. Matplotlib is smart, but not yet smart enough to plot those on a map just based on the text. We need to find a way to get geographical data (specifically, co-ordinates) from those region names.

Well, we are not the first people to want to do this! We can use the geocode feature of GeoPy to do exactly that. After playing with this for a while I went with Bing Maps (it was free to get an API Key, and I could get it to work!), but there are other geocoding services available, so feel free to try it with those.

from geopy.geocoders import Bing
from geopy.extra.rate_limiter import RateLimiter
geolocator = Bing(api_key='GetYourOwnAPIKeyYouScallywag', timeout=30)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=2)
regions_dict = {i : geolocator.geocode(i) for i in regions}

I found using the RateLimiter useful to not overload the API with too many requests, but it significantly increases the time for the request to run and might not be necessary for you, so bear that in mind.

This returns a dictionary full of information, from which we just want the co-ordinates. We can convert this to a DataFrame ready to join to our existing DataFrame using the following steps:

regions_df = pd.DataFrame(regions_dict)
regions_df_melted = regions_df.iloc[1:2,:].melt()
regions_df_melted.columns = ['region', 'co-ordinates']

Here is the head of the DataFrame which results.

Now we have the co-ordinates for each region in our dataset, we want to join this to our existing DataFrame so we have one DataFrame which contains the average price and the co-ordinates for each region. Happily, pandas makes this extremely easy using its merge function. Thanks, pandas!

df = pd.merge(avo_df_avg, regions_df_melted, left_on='region', right_on='region')

Our next step is to bring this all together and create a GeoPandas GeoDataFrame which contains our average price for each region, plus the geographic information we need to plot each are on our map. To do that we first need to create a column for latitude and one for longitude (GeoPandas, very reasonably, wants to have that information to create its ‘geometry’ column, which is its way of storing the geodata).

df[['latitude', 'longitude']] = pd.DataFrame(df['co-ordinates'].tolist(), index=df.index)avo_gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.longitude, df.latitude))

Great! Now we’ve got a GeoDataFrame with all our data. Let’s take a quick look to see how it looks!

avo_gdf.plot()
Hmmmm, I’m not sure about this one yet.

OK — I’ll admit this doesn’t look too inspiring yet, but I promise you, we’re getting there! If you squint you can see that the majority of the data does kind of look like it could be in the United States. We have one outlier way down in the bottom right, which will be explained and dealt with shortly.

But for now, onwards!

Using Shapely to display a map, as a basis for presenting our data of interest

As we saw above, just getting our GeoDataFrame did not lead to the immediate production of the beautiful choropleth of our dreams. We need to give Python (and Matplotlib) instructions on how we want to present that information. After all, we’re the Data Analysts, we’re making the decisions on how to present our data and there’s a huge range of ways we might choose to do so. This flexibility means we still have some work to do!

So, we want to superimpose our GeoDataFrame of avocado prices onto a map of the continental United States. How can we do that? There is a very useful Python library called shapely that can help us here. mattymecks has a super helpful post about all of this stuff here — I strongly recommend giving the whole post a read if you’re trying to get your head around this.

A Shape file stores a shape as a polygon, which is simply a series of co-ordinates, each connected to the next. So a triangle would be saved as 3 points identified by their co-ordinates, a square by 4, and more complex shapes by thousands or more points.

Anyway, for our purposes you just need to know that Shape files exist and can provide us the map that we are seeking. But where to find the best ones? Of course the best way is with some Furious Googling™, but I can perhaps save you some pain and point you to the TIGER/Line Shapefiles offered by the US Census Bureau. If those guys don’t know what the shapes of the US States, Territories and Dominions look like, then I don’t know who would!

Download those files, unzip them and keep all of them in your active folder. Then import Point and Polygon from Shapely and you’re ready to rock!

from shapely.geometry import Point, Polygonusa = gpd.read_file('tl_2017_us_state.shp')

OK, let’s take a look at that bad boy.

usa.plot()
Alright — not perfect, but we’re getting there.

We can see here the foundations of what we need. Now why does this map have so much white space? I thought the USA was just in the Western Hemisphere? Well, without getting into too much geopolitical detail here, let’s just say there is some territory which belongs to the United States of America that is in other geographical locations than would first come to mind.

Let’s take a closer look at our Shape data here.

print(usa.shape)
print(usa.NAME)
----(56, 15)0 West Virginia
1 Florida
2 Illinois
3 Minnesota
4 Maryland
5 Rhode Island
6 Idaho
7 New Hampshire
8 North Carolina
9 Vermont
10 Connecticut
11 Delaware
12 New Mexico
13 California
14 New Jersey
15 Wisconsin
16 Oregon
17 Nebraska
18 Pennsylvania
19 Washington
20 Louisiana
21 Georgia
22 Alabama
23 Utah
24 Ohio
25 Texas
26 Colorado
27 South Carolina
28 Oklahoma
29 Tennessee
30 Wyoming
31 Hawaii
32 North Dakota
33 Kentucky
34 United States Virgin Islands
35 Commonwealth of the Northern Mariana Islands
36 Guam
37 Maine
38 New York
39 Nevada
40 Alaska
41 American Samoa
42 Michigan
43 Arkansas
44 Mississippi
45 Missouri
46 Montana
47 Kansas
48 Indiana
49 Puerto Rico
50 South Dakota
51 Massachusetts
52 Virginia
53 District of Columbia
54 Iowa
55 Arizona
Name: NAME, dtype: object

We already confirmed that our avocado price data is confined to the continental United States (except for that one weird outlier, which I promise we’re getting to), so we can get rid of some of these other entries.

to_drop = ['Commonwealth of the Northern Mariana Islands', 'United States Virgin Islands', 'Hawaii', 'Alaska', 'Guam', 'Puerto Rico', 'American Samoa']for index, row in usa.iterrows():
if row['NAME'] in to_drop :
usa.drop(index, inplace=True)
usa.plot()
That’s more like it!

Now we have the Shape files for the US ready to form the basis of our map, and we have our price data geographically located and ready to go. But, there were a few weird things going on, specifically with that outlier on our map a few steps back. Now it’s time to investigate that and do some sanity checks on our data.

Cleaning our data — final steps

If we display our whole DataFrame (it’s too big to do here, but if you’re following along at home, I’d encourage you to do it), we can see that some of our co-ordinates are NaNs (e.g. entry 25 — Midsouth), some are suspiciously round (e.g. entry 29 — Northeast), and some are way off (e.g. entry 52 — West — this was our outlier on the map!). What’s happening here is that our original CSV file from the Hass Avocado Board (by way of Justin Kiggens) contains some meta-region summary data for some over-arching regions (such as Midsouth, Northeast and West).

When we used our geolocator (in our case Bing Maps, but this problem would have happened whichever service we used), the service tried its best to provide an accurate location. In some cases (like Midsouth) it wasn’t able to, so we ended up with NaN. In others (like Northeast) it found something very general and went with that. In yet others (such as West) it was able to locate somewhere with that name, but not the place we were looking for (in this case it returned the co-ordinates for Western, Zambia). This is why it’s important to check your data where possible by plotting it in different ways and investigating things that look suspicious.

For our purposes, we don’t need this meta-region data, so we can just get rid of it en masse.

to_drop = ['Midsouth', 'Northeast', 'Plains', 'SouthCentral', 'Southeast', 'TotalUS', 'West']for index, row in avo_gdf.iterrows():
if row['region'] in to_drop :
df.drop(index, inplace=True)

Bringing it all together with Matplotlib

Now that we’ve improved df, our DataFrame, and we have our shapefile saved as usa, let’s put these together using the same CRS — Co-ordinate Reference System. This will ensure that our GeoDataFrames use the same reference system, so as long as the co-ordinates for each of our regions are accurate, they will be accurately represented on the US map. Here I used epsg:4326, as that’s what the Shapefile was already using.

crs = {'init': 'epsg:4326'}
avo_gdf = gpd.GeoDataFrame(df, crs=crs)
usa_gdf = gpd.GeoDataFrame(usa, crs=crs)

This is it! We’re ready to go!

With the help of a few nice features from Matplotlib we can make this very beautiful indeed.

from mpl_toolkits.axes_grid1 import make_axes_locatable 
# This is a function to allow us to make the legend pretty
fig, ax = plt.subplots(figsize = (20,16))
plt.title('Avocado Prices by region in the United States', fontsize=26, fontfamily='serif')
#this part makes the legend the same size as our map, for prettiness
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
usa.boundary.plot(ax=ax, color='DarkSlateBlue')
# using the boundary method here to get the outlines of the states, an aesthetic decision
avo_gdf.plot(cmap='Greens', column='AveragePrice', legend=True, ax=ax, s=2000, alpha=0.7, cax=cax)
# this saves a copy of the viz as a jpg so we can easily share it with our friends on twitter!
plt.savefig('Avocado Prices.jpg', format='jpg')
et voila!

Now we have made real the beautiful choropleth that until now existed only in our dreams! Congratulations to us! It looks like we want to avoid the built-up Northeastern USA and San Francisco, and instead head down to Ohio or Texas to live that cheap avocado dream.

I hope this helped you to see how we can use some powerful tools in GeoPandas, GeoPy and Matplotlib to take data in text and number form, use an API to find the co-ordinates, and then create an appealing and clear visualisation to show people at a glance the information we want to highlight.

Thanks very much for taking the time to come with me on this journey. I always welcome your feedback — please get in touch with me on Twitter at @craigdoesdata and let me know how I could have done this more effectively! I am very much still learning and I want to learn your secrets too, so please share them with me.

Again, the whole Jupyter Notebook and associated files are available on my Github page, and I encourage you to play around with them.

Until next time!

More like this? visit craigdoesdata.de

--

--

Data Scientist, based in Berlin, Germany. Python, SQL, R, Tableau. He / him. I’m the one on the left in the photo. http://www.craigdoesdata.de