Creating Choropleth Maps with Python’s Folium Library

How to make choropleths with different data structures in Python

Alex Mitrani
Towards Data Science

--

A choropleth of available rental apartments in NYC, April 2019 (GitHub)

Choropleth maps are used to show the variations in data over geographic regions (Population Education). I’ve used choropleths to show the number of available rental apartments across ZIP Codes in New York City and to show the number of mortgage transactions per ZIP Code over a given period. Python’s Folium library enables users to build multiple kinds of custom maps, including choropleths, which you can share as .html files with external users who do not know how to code.

Loading and Reviewing Geographic Data

U.S. government websites often have the geographic data files necessary to create maps. NYC’s OpenData site and the U.S. Census Bureau’s site have geographic boundary files available in multiple datatypes. Python allows you to load multiple filetypes, including GeoJSON (.geojson) files and shapefiles (.shp). These files contain the spatial boundaries of a given location.

Folium’s documentation for the folium.Choropleth() method states that the geo_data parameter accepts GeoJSON geometries as a string to create the map, “URL, file path, or data (json, dict, geopandas, etc) to your GeoJSON
geometries” (Folium documentation). No matter how we load the file we must convert the geometry data to function properly with this method. The key_on parameter of this method binds the data for each specific location (GeoJSON data) with the data for that location (i.e. population).

GeoJSON

GeoJSON files store geometric shapes, in this case the boundaries of a location, and its associated attributes. For instance, the code to load the GeoJSON file with the boundaries of NYC ZIP Codes (referenced above) is as follows:

# Code to open a .geojson file and store its contents in a variablewith open ('nyczipcodetabulationareas.geojson', 'r') as jsonFile:
nycmapdata = json.load(jsonFile)

The variable nycmapdata contains a dictionary with at least two keys, where one of the keys is called features, this key is holding a list of dictionaries where each dictionary represents a location. The excerpt of the main GeoJSON structure with the first location is below:

{'type': 'FeatureCollection',
'features': [{'type': 'Feature',
'properties': {'OBJECTID': 1,
'postalCode': '11372',
'PO_NAME': 'Jackson Heights',
'STATE': 'NY',
'borough': 'Queens',
'ST_FIPS': '36',
'CTY_FIPS': '081',
'BLDGpostal': 0,
'@id': 'http://nyc.pediacities.com/Resource/PostalCode/11372',
'longitude': -73.883573184,
'latitude': 40.751662187},
'geometry': {'type': 'Polygon',
'coordinates': [[[-73.86942457284177, 40.74915687096788],
[-73.89143129977276, 40.74684466041932],
[-73.89507143240859, 40.746465470812154],
[-73.8961873786782, 40.74850942518088],
[-73.8958395418514, 40.74854687570604],
[-73.89525242774397, 40.748306609450246],
[-73.89654041085562, 40.75054199814359],
[-73.89579868613829, 40.75061972133262],
[-73.89652230661434, 40.75438879610903],
[-73.88164812188481, 40.75595161704187],
[-73.87221855882478, 40.75694324806748],
[-73.87167992356792, 40.75398717439604],
[-73.8720704651389, 40.753862007052064],
[-73.86942457284177, 40.74915687096788]]]}}, ... ]}

The key_on parameter of the folium.Choropleth() method requires users to reference the unique index key in the location dictionaries within the GeoJSON file as a string:

key_on (string, default None) — Variable in the geo_data GeoJSON file to bind the data to. Must start with ‘feature’ and be in JavaScript objection notation. Ex: ‘feature.id’ or ‘feature.properties.statename’.

In the above case the index key is the ZIP Code, the data that associates with each location must also have a ZIP Code index key or column. The key_on parameter for the above example would be the following string:

‘feature.properties.postalCode’

Note: The first portion of the string must always be the singular word feature, it is not plural like the parent dictionary holding the list of each individual location dictionary.

The key_on parameter is accessing the properties key of each specific location. The properties key itself is holding a dictionary with eleven keys, in this case the postalCode key is the index value that will link the geometric shape to whatever value we wish to plot.

GeoPandas

Another way to load geographic data is to use Python’s GeoPandas library (link). This library is useful when loading shapefiles, which are provided on the U.S. Census’ website (Cartographic Boundary Files — Shapefile). GeoPandas works similarly to Pandas, only it can store and perform functions on geometric data. For instance, the code to load the shapefile with the boundaries of all U.S. states is as follows:

# Using GeoPandasimport geopandas as gpd
usmap_gdf = gpd.read_file('cb_2018_us_state_500k/cb_2018_us_state_500k.shp')
The head of the usmap_gdf dataframe

If you were to call the first row’s (Mississippi) geometry column in Jupyter Notebook you would see the following:

usmap_gdf[“geometry”].iloc[0]

When a specific geometry value is called you see a geometric image instead of the string representing the boundary of the shape, the above is the geometry value for the first row (Mississippi)

Unlike the contents of the GeoJSON dictionary, there is no features key with inner dictionaries to access and there is no properties column. The key_on parameter of the folium.Choropleth() method still requires the first portion of the string to be feature, however instead of referencing a GeoJSON’s location dictionaries this method will be referencing columns in a GeoPandas dataframe. In this case the key_on parameter will equal “feature.properties.GEOID”, where GEOID is the column that contains the unique state codes that will bind our data to the geographic boundary. The GEOID column has leading zeros, the California GEOID is 06. You may also use the STATEFP column as an index, make sure you are consistent with both the columns used, formats, and data types.

Reviewing Population Data For A Choropleth

Geographic data and the associated data to plot can be stored as two separate variables or all together. It is important to keep track of the data types of the columns and to make sure the index (key_on) column is the same for the geographic data and the associated data for the location.

I accessed the U.S. Census API’s American Community Survey (link) and Population Estimates and Projections (link) tables to obtain population and demographic data from 2019 to 2021. The head of the dataframe is as follows:

The head of the U.S. Census dataframe

I saved the data as a .csv file, in some cases this will change the datatypes of the columns; for instance strings could become numerical values. The datatypes when .info() is called are as follows:

The data types for the census data before and after saving and loading the data frame as a CSV file

Another important thing to note is that all leading zeros in the state column do not appear after loading the data frame. This will have to be corrected; the id must match and be the same data type (i.e. it cannot be an integer in one data frame and a string in another).

Basic Choropleth Maps Five Different Ways

As discussed above, Folium allows you to create maps using geographic datatypes, including GeoJSON and GeoPandas. These datatypes need to be formatted for use with the Folium library and it isn’t always intuitive (to me, at least) why certain errors occur. The following examples describe how to prepare both the geographic data (in this case U.S. state boundaries) and associated plotting data (the population of the states) for use with the folium.Choropleth() method.

Method 1: With Pandas and GeoJSON, without Specifying an ID Column

This method most closely resembles the documentation’s example for choropleth maps. The method uses a GeoJSON file which contains the state boundaries data and a Pandas dataframe to create the map.

As I started with a GeoPandas file I will need to convert it to a GeoJSON file using GeoPandas’ to_json() method. As a reminder the usmap_gdf GeoPandas dataframe looks like:

The head of the usmap_gdf dataframe

I then apply the .to_json() method and specify that we are dropping the id from the dataframe, if it exists:

usmap_json_no_id = usmap_gdf.to_json(drop_id=True)

Note: usmap_json_no_id is the variable holding the json string in this scenario

This method returns a string, I formatted it so it would be easier to read and show up to the first set of coordinates below:

'{"type": "FeatureCollection",
"features": [{"type": "Feature",
"properties": {"AFFGEOID": "0400000US28",
"ALAND": 121533519481,
"AWATER": 3926919758,
"GEOID": 28,
"LSAD": "00",
"NAME": "Mississippi",
"STATEFP": "28",
"STATENS": "01779790",
"STUSPS": "MS"},
"geometry": {"type": "MultiPolygon",
"coordinates": [[[[-88.502966, 30.215235]'

Note: The “properties” dictionary has no key called “id”

Now we are ready to connect the newly created JSON variable with the US Census dataframe obtained in a previous section, the head of which is below:

The head of the U.S. Census dataframe, called all_states_census_df below

Using folium’s Choropleth() method, we create the map object:

The code to create a Choropleth with a GeoJSON variable which does not specify an id

The geo_data parameter is set to the newly created usmap_json_no_id variable and the data parameter is set to the all_states_census_df dataframe. As no id was specified when creating the GeoJSON variable the key_on parameter must reference a specific key from the geodata, and that it works like a dictionary (‘GEOID’ is a value of the ‘properties’ key). In this case the GEOID key holds the state code which connects the state geometric boundary data to the corresponding US Census data in the all_states_census_df dataframe. The choropleth is below:

The resulting choropleth from the above method

Method 2: With Pandas and GeoJSON, and Specifying an ID Column

This process is almost exactly the same as above except an index will be used prior to calling the .to_json() method.

Theusmap_gdf dataframe did not have an index in the above example, to correct this I will set the index to the GEOID column and then immediately call the .to_json() method:

usmap_json_with_id = usmap_gdf.set_index(keys = “GEOID”).to_json()

The resulting string, up until the first pair of coordinates for the first state’s data, is below:

'{"type": "FeatureCollection",
"features": [{"id": "28",
"type": "Feature",
"properties": {"AFFGEOID": "0400000US28",
"ALAND": 121533519481,
"AWATER": 3926919758,
"LSAD": "00",
"NAME": "Mississippi",
"STATEFP": "28",
"STATENS": "01779790",
"STUSPS": "MS"},
"geometry": {"type": "MultiPolygon",
"coordinates": [[[[-88.502966, 30.215235],'

The “properties” dictionary no longer has the GEOID key because it is now stored as a new key called id in the outer dictionary. You should also note that the id value is now a string instead of an integer. As mentioned previously, you will have to make sure that the data types of the connecting data are consistent. This can become tedious if leading and trailing zeroes are involved. To fix this issue I create a new column called state_str from the state column in the all_states_census_df:

all_states_census_df[“state_str”]=all_states_census_df[“state”].astype(“str”)

Now we can create the choropleth:

The code to create a choropleth with a GeoJSON variable which specifies an id

The difference between this code and the code used previously is that the key_on parameter references id and not properties.GEOID. The resulting map is exactly the same as in method 1:

The resulting Choropleth from the above method

Method 3: With Pandas and GeoPandas’ Python Feature Collection

This method creates a GeoJSON like object (python feature collection) from the the original GeoPandas dataframe with the __geo_interface__ property.

I set the index of the usmap_gdf dataframe (US geographic data) to the STATEFP column, which stores the state ids, with leading zeroes, as a string:

usmap_gdf.set_index(“STATEFP”, inplace = True)

I then created a matching column in the all_states_census_df dataframe (US Census data) by adding one leading zero:

all_states_census_df[“state_str”] = all_states_census_df[“state”].astype(“str”).apply(lambda x: x.zfill(2))

Finally, I used the __geo_interface__ property of the us_data_gdf GeoPandas dataframe to get a python feature collection of geometric state boundaries, stored as a dictionary, similar to the ones from the first two methods:

us_geo_json = gpd.GeoSeries(data = usmap_gdf[“geometry”]).__geo_interface__

An excerpt of the us_geo_json variable is below:

{'type': 'FeatureCollection',
'features': [{'id': '28',
'type': 'Feature',
'properties': {},
'geometry': {'type': 'MultiPolygon',
'coordinates': [(((-88.502966, 30.215235), ...))]

Finally, we create the choropleth:

The code to create a choropleth with a GeoPanda's __geo_interface__ property

The map looks the same as the ones from above, so I excluded it.

Method 4: With Geopandas’ Geometry Type Column

Here we stick to GeoPandas. I created a GeoPandas dataframe called us_data_gdf which combines the geometric data and the census data in one variable:

us_data_gdf = pd.merge(left = usmap_gdf,
right = all_states_census_df,
how = "left",
left_on = ["GEOID", "NAME"],
right_on = ["state", "NAME"]
)

Note: all_states_census_df is a pandas dataframe of US Census data and usmap_gdf is a GeoPandas dataframe storing state geometric boundary data.

The code to create a choropleth with a GeoPandas dataframe is below:

The code to create a choropleth using a GeoPandas dataframe

In the above example the geo_data parameter and the data parameter both reference the same GeoPandas dataframe as the information is stored in one place. As I did not set an index the key_on parameter equals “feature.properties.GEOID”. Even with GeoPandas folium requires the key_on parameter to act as if it is referencing a dictionary like object.

As before, the map looks the same as the ones from above, so I excluded it.

Method 5: With Geopandas Geometry Type and Branca

Here we create a more stylish map using the Branca library and folium’s examples with it. The first step with Branca, aside from installing it, is to create a ColorMap object:

colormap = branca.colormap.LinearColormap(
vmin=us_data_gdf["Total_Pop_2021"].quantile(0.0),
vmax=us_data_gdf["Total_Pop_2021"].quantile(1),
colors=["red", "orange", "lightblue", "green", "darkgreen"],
caption="Total Population By State",
)

In the above code we access the branca.colormap.LinearColormap class. Here we can set the colors we use and what values to use for the color scale. For this choropleth I want the colors to scale proportionally to the lowest and highest population values in the US Census data. To set these values I use the vmin and vmax parameters as above. If I neglect to do this then the areas with no values will be considered in the color scale, the results without these set parameters are below:

A Branca choropleth without the vmin and vmax parameters set

Once the ColorMap object is created we can create a choropleth (the full code is below):

Creating a choropleth with a GeoPandas dataframe and the Branca library

I adapted the examples on folium’s site to use the us_data_gdf GeoPandas dataframe. The example allows us to exclude portions (appear transparent) of the geographic data which do not have associated census data (if the population for a state was null then the color on the choropleth would be black unless it was excluded). The resulting choropleth is below:

A choropleth made with Branca and GeoPandas

Branca is customizable but the explanations of how to use it are few and far between. The ReadMe for its repository states:

There’s no documentation, but you can browse the examples gallery.

You have to practice using it to make the kind of map you want.

Summary

Folium can be used to make informative maps, like choropleths, for those with and without coding knowledge. Government websites often have the geographic data necessary to create location boundaries for your data which can also be obtained from government sites. It is important to understand your datatypes and filetypes as this can lead to unnecessary frustration. These maps are highly customizable, for instance you can add tooltips to annotate your map. It takes practice to make use of this library’s full potential.

My repository for this article can be found here. Happy coding.

--

--

Data scientist with a passion for using technology to make informed decisions. Experience in Real Estate and Finance.