The world’s leading publication for data science, AI, and ML professionals.

Using the US Census API for Data Analysis (a beginner’s guide)

It's project week and we're pulling US Census data!

Using the US Census API and PUMS for Data Analysis

A beginner’s guide including mini-tutorial in Python.

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

It’s project week here at the Flatiron School data science bootcamp and we’re pulling US Census data!

There are about a gajillion (give or take) websites and links and resources and webinars and raw data regarding census data and it will take you days to go through it. Lucky for you, I spent all that time for you and have compiled my favorite resources to get you started! My goal is make the process as simple as possible so you can get to the good stuff: playing with some data!

As a newbie (literally new-born) Python programmer and data science student, I was intimidated by the thought of using an API to create and clean my own home-grown dataset, so I scoured the internet for days looking for one without needing to use an API. I knew what I wanted to focus on (marginalized and/or underrepresented communities, thank you for asking), but every data set with the variables I was looking for was too small, too old, too sparse, or WAY too convoluted (read: complicated) for my newborn coding brain.

Honestly, I’m glad I couldn’t find the perfect one because it led me to growing past my fear and teaching myself about APIs. Spoiler: they’re not actually that difficult of a concept. Once I went through a couple of tutorials and I felt like I had some base knowledge…. I found this amazing tool by watching a census webinar that was recorded in March 2020. A thousand thank you’s to Amanda Klimek, Survey Statistician at the ACS, for this golden nugget. She talks in depth about how to use census tools and data, but for you I’ve extracted and condensed the resources and tricks which I found most useful.

Long story short, the US Census Bureau has rolled out an amazing tool (still in beta at the time of this publication in 2021, yet infinitely useful) to help users pull API requests for EXACTLY the information you want.

I have 4 things planned for you in this census crash course:

  • Some very basic overall knowledge regarding the Census Bureau,
  • How to customize your own dataset before even stepping a foot into your code,
  • A quick Python code along to get you started on some analysis, and
  • Several clickable links for you to deepen your knowledge where you see fit

Side Note: all links cited within this article are also located at the bottom of the page in order to provide quick and easy reference.


Let’s talk about the Census Bureau

The US Census Bureau is a government agency that conducts ongoing surveys to gather data about the people and economy of the United States. A few things I love about this organization:

  • They provide unlimited data access to the public (just use an API key if you need to exceed 500 inquiries)
  • Datasets are amazingly thorough and complete (no sparse matrices here!)
  • Access is absolutely free to anyone and everyone.

You may be asking: "What do you mean ongoing? I thought the census only happened once every 10 years?" Well… kind of. There are 2 surveys we can talk about when we talk about the Census Bureau: the American Community Survey, and the Decennial Census. They are both surveys about the US population, but with a few differences:

The ACS (American Community Survey) is "an ongoing nationwide survey regarding social, economic, housing, and demographic data. This survey has an annual sample size of 3.5 million addresses, with survey information collected nearly every day of the year." That sample size covers about 1% of the population.

The Decennial Census is a shorter and more basic survey that is sent to everyone in the US every 10 years. On the 2020 Census, there were only 9 questions, whereas the ongoing ACS has closer to 70 questions. In comparison to the ACS which collects data over the entire year, the decennial census collects data for everyone in the US for a single point in time (conducted on April 1st of a census year). For reference, the 2020 census had approximately 60% participation from the US population.


What is PUMS?

The in-house statisticians at the Census Bureau provide a ton of aggregated data and useful tables to the public. However, what I really want to talk to you about is PUMS (Public Use Microdata Sample). As the word "microdata" suggests, the data found in PUMS is based on individual responses rather than aggregated data with predetermined parameters. This allows data lovers, control freaks, and Data Science bootcamp students to perform their own analysis in order to understand relationships between the variables of their own choosing.

Get the data you want!

Anyone can access PUMS files on the File Transfer Protocol (FTP) website which provides data in CSV and SAS formats, or using an API call on data.census.gov. (Use this link to access for both FTP and data.census.gov, details below)

Access on FTP site – get the whole dang set

This is a great option for checking out an entire dataset without worrying about parameters. Make your choices using the below syntax reference:

  • csv: indicates file type
  • p: indicates data on the population (as opposed to ‘h’ indicating information on housing)
  • The next two characters indicate a state’s code (for example "tx" for Texas), or find "us" for data on the entire US.
So easy... too easy?
So easy… too easy?

The downside of using these files? 542 MB of zip file turns into 2.28 GB of memory, which is not ideal if you’re low on storage space or computing power.

Access on data.census.gov – choose your variables

A third grader could use this site (and by the way, that’s a good thing). Peruse the 510 variables in the 2019 ACS, or choose an option from the drop-down menu for bite-sized variable listings grouped by topic (pictured below I’m searching for variables in "Education"). I just love the friendly little teal colored boxes I can click to choose my variables, and the drop-down menu of values for each under the details tab.

Isn't it beautiful?
Isn’t it beautiful?

At the top of the window, click on "SELECT GEOGRAPHIES" to choose a large geographical region, or drill down all the way to the county level. For my analysis, I’ve chosen to leave this tab selection blank because I would like to look at data for the entire US.

Side Note: If you are looking for a specific county that has a population smaller than 65,000, then when selecting a dataset you’ll need to choose "ACS 5-Year Estimates," which represent data collected over 5 years for these smaller counties, rather than 1-Year Estimates. More info here.

Play around with the sliding and variables in "Data Cart" or skip right over to the magical "Download" tab (pictured below on the top right). Today, I’m going straight to the "Copy API Get Query" button to paste the URL directly into my project notebook.

Ready for that Python Code Along I promised?

Now that you have copied you link from data.census.gov, we can use the .get() method to request a response. You’ll also want to run response.status_code to check that it outputs 200 (Success). This indicates your request went through with no errors. Additional response codes listed below.

A list of possible response codes and what they indicate:

  • 1xx (Informational): Communicates transfer protocol-level information
  • 2xx (Success): Client’s request was accepted successfully (fist pump)
  • 3xx (Redirection): Client must take some additional action in order to complete their request
  • 4xx (Client Error): Points the finger at the client (check your code!)
  • 5xx (Server Error): Server takes responsibility for error

Next, load and format your data and take a look!

Hmm… that’s nice, but the header needs to be reset and it would be great to change the variables to something a little more intuitive. We can determine what row our data frame begins on by using df[1:] and renaming columns using df.columns = []. Let’s do it!

A good practice before you get too far is to check out your data frame with df.info() to check what datatype is in each column. Census data will be loaded as objects, and I want several of my variables into integers so I can perform statistical analysis. Real quick, let’s create a list of variables I want to update to integers and run a For Loop.

Excellent. Now you’re all set up to begin playing with your census data! I hope this is helpful for beginners to realize just how easy it can be to use the census API to run your own statistical analysis.


Resources:

In order to make this process as easy-peasy for you as possible, I’ve listed all links mentioned and a few extra favorites below.

Data.census.gov

Census Resources for Developers

  • From this page you can also request an API key in case you wish to pull more than 500 inquiries
  • Join their Slack community – super friendly people and very helpful. Reading past discussions and clicking through posted links really helped me get a feel for how to use this amazing data set. The people there are also super responsive! While there may be other ways to get in touch with the experts, it looks like most Slack questions and concerns are responded to fairly quickly.

Accessing PUMS Data

More information on ASC 5-Year Data

PUMS Technical Documentation – see "PUMS Data Dictionary" for a list of variables on the ACS. (Choose .txt or .pdf to take a peek without needing to download.)

Glossary of Census Terms

ACS PUMS Handbook: Understanding and Using the American Community Survey Public Use Microdata Sample Files: What Data Users Need to Know

Webinar (2020): Introduction to the American Community Survey Public Use Microdata Sample PUMS Files


Related Articles