Accessing public datasets through Datacommons, Google’s API for stats
A new paradigm for querying public datasets
Ever wondered how Google is able to give such accurate responses to questions like "What is the median income in San Francisco?"

The statistical queries on Google Search are powered by DataCommons, an open-source repository of publicly-available datasets.
DataCommons provides a central API to work with disparate public datasets from The World Bank, US Bureau of Labour Statistics, Center for Disease Control and more. This tooling means you don’t have to spend most of your time finding, cleaning and joining together this data (the 70% of data analysis work that we don’t like to talk about). It’s a developer-friendly open standard, meaning anyone can access it from the tools they love, and can be really useful in the following domains:
- Scientific Research
- Journalism
- Kaggle Competitions
- Market Research
- Learning
Let’s find out what it can do when combined with our favourite Python visualization libraries!
A warmup example
Imagine we’re a journalist writing a feature comparing what it’s like to grow old in different US states. To start off, we’d want to know how age is correlated with other variables like income, health, air pollution, population density. We can get started straight away by looking at data from the 500 largest cities in the US:
The most important thing to remember about DataCommons is that everything is a node in a graph. In this case, we start by querying CDC500_City
which is a special group node and asking for its members
i.e. the other nodes which link to it. This gives us the datacommons IDs (dcid
s) for the cities in the CDC500 dataset. Google has done some pretty sophisticated entity resolution so that these locations are standardized across the different data sources. Next we pass these IDs plus a list of statistical variables into a dc.build_multivariate_dataframe
query and get a Pandas dataframe with data from the CDC, the Bureau of Labour Statistics and the Census, all magically cleaned and grouped by city.
I cannot emphasize how cool this is! Imagine how long it would take if you were to do this from scratch: crawling through different websites with data in incompatible formats, cleaning and standardizing the data, building a list of common names for entities, and merging it all together.
Running that code, plus the visualizations below gives us the following result:
It’s a great start, and seems like there is a weak positive correlation between median age and income, but we don’t know the actual names of the cities. To do that, we need to enrich the data.
Enriching the data
We’d like to get the city and state name for each location in our dataset. Usually, this would take hours, as we’d have to find another data source, standardize the names, do a merge, and then spend ages debugging slightly different outputs. In DataCommons all we need to do is get some more properties from the graph using their API:
It’s easy to get additional information about each node and its relationships using dc.get_property_values
. We pass in the parameter containedInPlace
to find the parent, and name
to find the proper name.
The data looks as follows:
Great, some much clearer trends are appearing! We can see that most of the richest cities are in California, and most of the oldest cities are in Florida. Sizing the points by population also makes it easier to see trends for the bigger cities like New York and Los Angeles.
Showing on a map
Finally, let’s say we want to help our readers visualize this data on a map. DataCommons has a handy built-in map explorer tool where we can select a statistical variable and a geographical level. If we want to visualize something more complex e.g. what % of households over the age of 65 have incomes of $200k or more per year, we can easily refetch the data and plug it into a geoviz tool like Folium:
In this code we refetch the data by state, download a GeoJson object, then join the state name using key_on=feature.properties.name
. Add a little syntactical sugar to display labels on hover, and we’re done. With data that’s clean, you’ll be serene!
It took me a little while to realize that the most flush retirees are actually in Hawaii, with almost 10% of households over 65 having annual income of 200k or more 😱 .
Summary
In about 100 lines of code I’ve pulled data from several different sources, ran analyses and built several visualizations in different plotting libraries. With DataCommons, most of your energy can go into actually analysing and visualizing the data, rathar than scraping, cleaning and munging it.
From here on you can explore some of the thousands of statistical variables gathered in DataCommons, or try their Graph Explorer. There’s no limit to the questions you can answer!
See the code for this Tutorial on Github, and the final dataset below: