
Recently, in my data science program, we were given a project where we needed create a model to predict house prices. The training data set we received included property data relating to square feet, bathroom count, construction grade, etc. The data I was most excited to begin my analysis with was the location data. We were given columns containing the latitude and the longitude coordinates and instantly the first thing I wanted to do was to create a heat map showing the distribution of property prices.
Let’s take a peek at our data:
import pandas as pd
import numpy as np
import Geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon
df = pd.read_csv('kc_house_data_train.csv')
df = df[['id', 'price', 'sqft_living', 'waterfront', 'zipcode', 'long', 'lat']]
df.head()

Shapefiles
Before we can plot these coordinates, we need a ‘shapefile’ to plot them on top of. If you’re like me, and new to plotting geospatial data, ‘shapefiles’ were a very foreign topic. Luckily for you I found this great description :
Basically, coding languages like Python utilizing Geopandas can read shapefiles and transform them into functioning maps that you are able to plot on. Much to my surprise, shapefiles are readily available from a lot of open-source databases online. Shapefiles are designated as ‘.shp’ – the shape format/geometry file; but they also depend on the associated ‘.shx’ – shape index format, and the ‘.dbf’ – attribute format files to properly function. So, in order to utilize the ‘.shp’ file in GeoPandas, you must also save the other two mandatory file extensions in the same directory. Let’s take a look at our shapefile in Python.
kings_county_map = gpd.read_file('kc_tract_10.shp')
kings_county_map.plot()

Now we have the outline of our dataset’s location (King County, WA.) You might notice that our axis is not referring to latitude and longitude coordinates. This can be adjusted using the Coordinate Reference System or CRS. The most common CRS used is WGS84 lat/long projection.
kings_county_map.to_crs(epsg=4326).plot()

GeoPandas Dataframe
Now that we have our shapefile mapped to our proper coordinates we can start to format our real estate data. In order for to see where the properties in our dataframe are located, we first need to reformat our data into a ‘GeoPandas Dataframe.’
crs = {'init':'EPSG:4326'}
geometry = [Point(xy) for xy in zip(df['long'], df['lat'])]
geo_df = gpd.GeoDataFrame(df,
crs = crs,
geometry = geometry)
To match the ‘shapefile’ we need specify the same CRS. Next we need to utilize Shapely to transform our latitude and longitude data into geometric points. Finally passing our original dataframe with our ‘crs’ and ‘geometry’ variables into the GeoDataFrame function will create our ‘geo_df’ that is a copy of our original data frame but with the newly created ‘geometry’ column.
geo_df.head()

Visualization
Now that both our shapefile and GeoPandas dataframe are properly formatted we can begin the fun part of visualizing our real estate data! Let’s plot the location of our property data on top of our shapefile map. By utilizing Matplotlib’s subplots we can plot both of our data sets on the same axis and see where the properties are located.
fig, ax = plt.subplots(figsize = (10,10))
kings_county_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(ax=ax)
ax.set_title('Kings County Real Estate')

We can see that the majority of King County properties are located in the western part of the county. Though, it’s hard to tell if there are more houses north or south due to how dense are points are being plotted.
fig, ax = plt.subplots(figsize = (10,10))
kings_county_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(ax=ax, alpha = .1 )
ax.set_title('Kings County Real Estate')
plt.savefig('Property Map')

By reducing the alpha in our GeoPandas plot, we can now see there are more houses located in the northwestern part of the county as the blue plots are a lot more dense compared to the south.
Now this is great and all – we can see the locations of the properties, but what is the top factor when determining the right house? Price. We are able to use this same plot, but we can specify the price column in our dataframe to create a heatmap! (Note, we log transformed our price data to reduce how outliers were skewing our data.)
geo_df['price_log'] = np.log(geo_df['price'])
fig, ax = plt.subplots(figsize = (10,10))
kings_county_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(column = 'price_log', ax=ax, cmap = 'rainbow',
legend = True, legend_kwds={'shrink': 0.3},
markersize = 10)
ax.set_title('Kings County Price Heatmap')
plt.savefig('Heat Map')

Finally, we can now easily see where the higher priced homes are located in Kings County! This makes sense as that circle of red in the northwest encapsulates Seattle and Medina (Home to both Bill Gates & Jeff Bezos.) If we wanted to see this same plot but for square feet of the properties we can:
geo_df['sqft_log'] = np.log(geo_df['sqft_living'])
fig, ax = plt.subplots(figsize = (10,10))
kings_county_map.to_crs(epsg=4326).plot(ax=ax, color='lightgrey')
geo_df.plot(column = 'sqft_log', ax=ax, cmap = 'winter',
legend = True, legend_kwds={'shrink': 0.3},
alpha = .5)
ax.set_title('Sqft Heatmap')

We can see now that houses tend to have more square feet east of longitude -122.3 as we move further away from Seattle.
Conclusion
I hope this quick tutorial was helpful in understanding how to use GeoPandas and understand why it is such a powerful tool. Using GeoPandas is a great way to visualize geospatial data during any exploratory data analysis. It provides a way to make inferences about your data set and most importantly it brings data to life for everyone to see.
Sources
- https://geopandas.org/index.html
- https://en.wikipedia.org/wiki/Shapefile
- https://www.kingcounty.gov/services/gis/GISData.aspx
- https://onlinestatbook.com/2/transformations/log.html
- https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
- https://www.cnbc.com/2019/06/25/medina-wash-home-to-jeff-bezos-and-bill-gates-running-out-of-money.html