Creating an Enhanced NYT COVID-19 Data Set

Adding Demographic, Electoral, and Mobility Information in BigQuery

Kelly Joyner
Towards Data Science

--

Covid infections in the US by population density (1% of cumulative population buckets) over time. Interactive version.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

I’ve been working in Google Cloud for ten months, but haven’t had any experience with our product offerings, so when I read that Google was adding COVID data, including the New York Times county-level data, to their public BigQuery data sets, it seemed like a good opportunity to learn how the tools work. (Disclaimer: This is a project I did on my own time, and is not associated with Google.)

I’ve been very anxious during the pandemic, and my main coping mechanism for dealing with anxiety is to try to understand the thing that is making me anxious. When I started this project in mid-March, there were a lot fewer sources of deep analytics regarding the virus, and I wanted to be able to pose questions of the raw data, to pose and test hypotheses, to better understand its spread and the implications for public health policy and my own actions.

I also wanted to learn some new stuff.

Don’t Use This Data Set

If you are looking for a reliable, tested, high-quality demographic data set for COVID-19 in the US, this is not that data set. I think it’s pretty great, but I did it myself, in my spare time, as a learning exercise and for my own use. You’d be much better off using a data set built by a team of people, with actual training, automated correctness tests, better metrics tests, and more adoption.

Why I Wrote This Article Anyway

I thought sharing my neophyte experiences building a reporting schema like this might be useful or interesting to others in the same way that the many articles I read in the course of doing so were helpful to me.

This project required learning to work with GIS and SQL geometry primitives, how Census Bureau data was put together, and some strategies for dealing with complex subsets of that data, such as city boundaries and parts of counties. I hope these strategies and tools might be useful to others who are learning to manipulate similar data sets.

How To Find This Data Set You Should Not Use

I’ve checked in some files on GitHub that can be used to recreate the reporting schema, if you’re into that sort of thing. I purposefully didn’t put in any integrity tests or installation automation. It’s not that I’m lazy, it’s for some other reason.

If you are interested in the data set itself, it is publicly available through BigQuery. I materialize a couple of tables every six hours or so, but you’re better off querying my views and materializing your own table if you want the most up-to-date data.

NYT County Data

There are several datasets of raw numbers reported by counties and states out there. At first, I quickly built a BigQuery importer for COVID Tracking Project data, which started out as a project of The Atlantic. It’s a widely-used data set at a state level and tracks hospitalizations and tests as well as infections and deaths.

It seemed to me, however, that the variation within states, however, was more important than the variation between states. This is a disease that, in the beginning at least, impacts dense urban areas much more than less dense rural areas, and that seems likely to continue due to the factors that increase its spread.

In addition to the density issue, 50 or so is a small number of samples. It represents an aggregation of a much larger number of county reports, and information is lost in that aggregation.

I wanted the finest-grained data I could find, and in the U.S., that means looking at reporting data for each county separately.

I wanted to be able to look at county-level data with regard to geographical location, as well as demographic and geographic data, such as total population and land area, and I wanted to join that data to MIT election data and to Google’s COVID-19 Mobility Report so that I could look for relationships between mobility and the spread of the disease, as well as partisan relationships with mobility data.

I chose the New York Times data for a few reasons:

  • It was already available in BigQuery.
  • It’s based on some level of on-the-ground reporting, and I have found the Times to be a generally trustworthy information source over the years.
  • Most dashboards at the time were using data from Johns Hopkins University, which has more metric series, but they had no county-level data in BigQuery at the time, and also I am a bit of a contrarian.

One of the things I wanted to be able to do was compare numbers from different data sets to see how much they agree, so my plan was (is) to work with all of them eventually. I decided to work first with what was easy and available.

NYT Data Characteristics

If you intend to work with the NYT data, you should go to the source on GitHub, carefully read, and fully understand the authors’ description of the data set. But in brief:

The NYT data set is very straightforward.

There are three dimension fields, date, state_name and county, that identify the U.S. County that the data applies to. There is another field, county_fips_code, that gives a unique FIPS code that is a widely used standard value for identifying counties.

The data contains data for all fifty states, plus U.S. territories.

There are also rows where the county name is “Unknown”. This is used to record deaths that are reported at the state or territory level that the Times was unable to assign to a county. There is no FIPS code for these rows, and in my data set they are not populated with any enhancement data from other sources.

There are only two data values for each row: deaths and confirmed_cases. These give the cumulative numbers reported by the county. The data set does not supply new cases or deaths, which must be calculated by comparing to the day before.

It happened, rarely, and especially in the early days, that a county would revise its estimates of death or confirmed cases downward from one day to the next. This causes the “new cases” for a day to be negative. In my analyses, I generally filter out these anomalies or set the floor for new cases to 0. It’s not a major issue if you are doing your analysis on daily cases. For comparing to other data sets that report things differently, it may require some remediation.

NYT Geographic Exceptions

The NYT data has a couple of geographic reporting exceptions that made for some of the most challenging and fun parts of this project, and gave me excellent learning opportunities to work with a lot of basic GIS operations and tools:

  • All cases for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) are assigned to a single area called New York City, and
  • Four counties (Cass, Clay, Jackson and Platte) overlap the municipality of Kansas City, Mo. The cases and deaths that we show for these four counties are only for the portions exclusive of Kansas City. Cases and deaths for Kansas City are reported as their own line.

FIPS Codes

FIPS codes are, technically, a deprecated federal standard for identifying counties, but are still widely used outside the government or in older data sources. Today, the Census Bureau uses a newer system, the GEOID, to identify counties and other types of places. Mappings from GEOID to FIPS codes are easy to find.

There are two types of FIPS code: state and county.

A state FIPS code is a two-digit number. Calfornia is 06. Texas is 48.

A county FIPS code is a three-digit number, unique within the state. Many data sources concatenate these two codes to form a five-digit code that is unique in the nation. Travis County, in Texas, for instance, has a combined fips code of 48453, because it is in Texas (48) and has a county FIPS code of 453.

In many web pages and data sources, this combined number is referred to as a “county FIPS code”, so you will have to check whether the value is a three-digit code with a separate state field, or a five-digit code. In addition, some sources will store these values as integers, others as strings. It is a good idea to write helper functions or have ready expressions for transforming between these representations of the same data.

Census Bureau American Community Survey

Every year, the Census Burea surveys about 3.5 million households every year. It collects information about ancestry, citizenship, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics, so it provides a very rich data source.

ACS Data Characteristics

Structure and Content

Like all Census Bureau data, ACS survey data is divided into several successively smaller levels. Each state is divided into a number of counties (or county-like division, such as parishes). Counties are broken up into census tracts, which are divided into block groups, which are composed of the smallest division, blocks. A block is an arbitrary division, but in cities it often corresponds to an actual city block.

For each block, the bureau records the actual number of whatever they are surveying (people, children, people of color, people within an age group, people within income ranges, males, females, etc.) They also record some aggregate values, such as the median income, median rent, et cetera, that cannot be directly computed.

The structure of the data is important, because in order to address geographic exceptions in the NYT data, I will need to add some counties together to build a new county, and subtract parts from other counties in order to eliminate overlap.

Missing Data

One other important thing to note is that because the ACS data is a survey, not a census, sampling error becomes a bigger factor at lower levels of the geographic hierarchy. When the error bars get too big, some fields are not included. The columns simply aren’t there. You cannot, for instance, at the block group level, get breakdowns of men by ethnicity and age. In some counties in Kansas City, I had to set these values to NULL because they were not present.

ACS geographical arithmetic

Census bureau divisions have two very important properties: they are disjoint, meaning they don’t overlap, and they are comprehensive, meaning that all the land in a state is in one of the counties, all of the land in a county is in a census tract, and so on.

Together, these properties mean that when you add up, say, the total population of all the blockgroups in the county, you get the total population of the county.

Importantly for our purposes, this also means that if you take only some of the blockgroups in a county (ones that are part of a city), and you subtract their population from the total population for the county, then what is left is the population of all the parts of the county that are not in the selected blockgroups.

Aggregating computed values (means, medians, and quartiles)

With median values, it is a bit more complex. You can’t subtract the median income of a city from the median income of a county to find the median income of the rest of the county: you’d likely get a negative number!

Likewise, the sum of the median values for blockgroups in a county (or counties in a state, etc.) does not give you the median value for the county.

In these cases, in order to estimate aggregate values of medians, etc., I took the population-weighted mean of the median values. For example, I used SUM(median_income*total_pop)/SUM(total_pop) over New York, Kings, Queens, Bronx and Richmond counties in New York state to estimate the median income for “New York City” as reported in the NYT data.

Because incomes in the US follow are distributed so unevenly, this population-weighted mean is not as accurate as using a “Pareto distribution”-based estimate, but they appear to be fairly close in most cases, based on what randos on the Internet say, and I had more interesting things to do than learn how to write such an estimation function. If anybody would like to do so, you are welcome to!

US County Geographic Metrics and Shape Data

BigQuery has public datasets that contain two attributes for counties and blockgroups that I wanted for my data: land area in square meters, and the geographic boundaries of the area. Each row contains a geo_id field, a county_fips_code field, and a state_fips_code field that identify the county or blockgroup, and are essential for mapping from the geo_id field in ACS data to the FIPS codes used by other data sets.

The geographic data is in the form of polygons where each point is mapped to an exact latitude and longitude on a spheroid model of the Earth. Such data has several representations. By default, if you SELECT a geographic field, it will show you the WKT representation:

POLYGON((-85.852932  32.475392, -85.853041 32.475265, -85.853711 32.475087, -85.853876  32.475043, -85.854102 32.474983, -85.854492 32.474879, -85.854615  32.474846, -85.854879 32.474887, -85.855306 32.474953, -85.855697  32.475028, -85.855797 32.475053, -85.856138 32.475139, -85.856702  32.475297, -85.856804 32.475326, -85.856873 32.47535, -85.857654  32.475617, -85.860029 32.476349, -85.861159 32.476446, -85.862489  32.476404, -85.862771 32.476263, -85.863495 32.475749, -85.86442  32.475496, -85.865332 32.475246, -85.866949 32.47506, -85.86719  32.475012, -85.870686 32.474311, -85.872695 32.473869, -85.874495  32.472924, -85.874783 32.472225, -85.874744 32.471279, -85.874726  32.470837, -85.874521 32.469859, -85.874504 32.469366, -85.874518  32.469039, -85.874609 32.468884, -85.874787 32.468792, -85.874887  32.468809, -85.874957 32.46828, -85.875196 32.468105, -85.875958  32.467548, -85.87736 32.466318, -85.878192 32.465182, -85.878512  32.464157, -85.878597 32.462615, -85.878571 32.460794, -85.87857  32.460749, -85.878545…

Google has a demonstration tool, BigQueryGeoViz, that will render query result shapes on a map and allow you to color and shade them using other attributes you can specify. This tool is incredibly handy, if a little basic, and is the only GIS visualization tool I used while building my data set.

BigQuery GIS Functions

I won’t spend much time on GIS functionality in BigQuery. I suggest consulting the documentation and Getting Started guide. They were enough to…get me started.

However, I will briefly mention the functions that BigQuery provides that were essential in building this data set:

  • ST_UNION and its aggregate function, ST_UNION_AGG, take a list of shapes and return a new shape that contains every point in all of them. I use this function to add blockgroups together to get an approximate version of Kansas City, Missouri, and to get the shape of New York City by taking the union of all the counties that make it up.
  • ST_DIFFERENCE and ST_DIFFERENCE_AGG Does the opposite of UNION: It takes two shapes and returns the shape of the second “subtracted from” the shape of the first, so that any overlapping areas are removed. I use this to “trim” parts of Kansas City from the counties they are contained in.
  • ST_INTERSECTS takes a shape and a second shape or point, and returns TRUE if the second argument lies partly or completely inside the first. This allows you, for example, to take a latitude and longitude and run a query that will tell you which U.S. county it is in! It’s very simple in theory, but magical in practice.
  • ST_COVERS returns true if the second shape it is given is entirely inside the first shape it is given. I used this to determine which blockgroups were entirely within the boundaries of Kansa City, Missouri.

New York City

New York is a fairly easy case. NYT reports on a county called “New York City” that covers five actual New York counties — New York, Kings, Queens, Bronx, and Richmond.

In order to fix this, I just need to aggregate the census data and the geometry data for these five counties.

New York, Kings, Queens, Bronx, and Richmond counties in New York

Kansas City, Missouri

Fixing Kansas City is harder. The New York Times reports on the municipality of Kansas City, Missouri, as if it were a county, and does not report deaths and infections that occur there as occurring in the four counties (!) that make up up the municipality.

The Kansas City metro area sprawls across two states:

Kansas City Metro Area and the four Missouri counties it overlaps.

The Kansas side of Kansas City reports its deaths as part of the county it is in, so I only need to correct the Missouri side.

Instead of just aggregating multiple real-world counties, I need to build an “imaginary” county that covers just the Missouri portion of Kansas City:

Then I need to “subtract out” the portions of each county that are also part of Kansas City:

When I’m done, I’ll have a set of five counties: four “real” counties that have Kansas City trimmed out, and a fifth county, “Kansas City”.

Building an Estimated Kansas City

The Census bureau’s data does not line up with city boundaries, because those boundaries change all the time. But it does have census tracts and block groups that we can use to do an approximate alignment.

I obtained an outline for the Missiour part of Kansas City by taking the ST_INTERSECTION of the bigquery-public-data:utility_us.us_cities_area row for Kansas City and the state of Missouri from bigquery-public-data:utility_us.us_states_area

It’s difficult to completely automatically find a set of blockgroups that approximate the area of Kansas City. If I use ST_INTERSECT between blockgroups I get a large number of blockgroups that are outside the city:

ST_INTERSECTS results, with the boundaries of Kansas City shown in black.

On the other hand, if I use ST_COVERS, I get an undercount:

Result of ST_COVERS with Kansas City boundary in black.

In order to get a better approximation, I started with the ST_INTERSECT set in BigQueryGeoViz; I then clicked on each blockgroup I wanted to remove, and pasted it into an exclude list in the query. It took a little bit, but the end result was pretty close to what I was looking for:

The nice thing about this method is that it’s easy to reload the query and check my work as I go.

Once I found a set of blockgroups that comprised my virtual “county” of Kansas City, Missouri, I estimated the ACS data using the arithmetic method above. I create a GIS boundary for it by calling ST_UNION_AGG to put all the blockgroup shapes together.

Trimming the Missouri Counties

Once I have a set of blockgroups that approximate Kansas City, I use it to adjust the four counties that make it up by subtracting out the portions that are also part of Kansas City.

To do so, I first join the ACS blockgroup data to Census Bureau data for blockgroups, which contains the county FIPS code for the county. I group the blockgroups by FIPS code and then aggregate them as specified above. This gives me one row for each county, where I have the total population for all the blockgroups in that county. I use ST_UNION_AGG to compute the shapes for these blockgroup aggregates.

The blockgroups that make up Kansas City, grouped by county.

I then use the techniques detailed above to subtract out the Kansas City portions of the county from the values for the county as a whole, and I am basically done!

Joining the Data

Now that I have synthesized the county data I need, I need join them to the NYT data. Because the other counties are joined on FIPS code, and that field is blank for the synthetic counties, this was the natural choice.

I want to avoid FIPS code conflicts between real counties and my synthetic counties. This is pretty easy: “real” fips codes are never more than five digits long. As long as the lower five digits of my synthetic FIPS code are ‘0’, they could not possibly be confused with a real FIPS code, because “00” is not a valid state FIPS code, and “000” is not a valid county FIPS code.

So I assigned “1000000” to “New York City”, “2000000” to Kansas City, and added “000000” to the FIPS codes for each of my trimmed counties — “29095000000” for the trimmed Andrew County Missouri, for instance.

I can then fix up the blank FIPS codes for these counties by creating a view on top of the NYT data. Counties in the NYT data that have a county FIPS code use that code, but if it is blank, the the state is “New York” and the county is “New York City”, then the FIPS code is “1000000”, and so on.

Each step in the construction and remediation of the data sets creates a new view that builds on top of the old views. Originally I was materializing tables for intermediate steps, but this unnecessarily consumes space and negatively impacts performance in an environment with a low number of queries and the low numbers of updates for the data. NYT data is updated daily, the ACS data does not change, and the county geographic data is updated once a year.

Instead, I materialize one large reporting table daily, at the end of the process for use with tools like Google’s Data Studio and BigQuery GeoViz. Mostly, I don’t actually use this table, but more on that later.

Adding Derived Metrics

The NYT data, raw, includes only the cumulative deaths and cases reported by each county each day. The first thing I wanted to do was compute some basic deltas and acceleration rates:

  1. Compute the daily incremental deaths and confirmed cases.
  2. Compute the 7-day trailing number of new deaths and cases.
  3. Compute the 7-to-14 day trailing number of new deaths and cases.
  4. Divide 2 by 3 to get the week-on-week acceleration.

I first compute (1) and (2) using LAG OVER PARTITION, and then in a separate view divide these values to get the acceleration.

I did these same steps (2–4) several times, time-shifted by a week, so that I have new cases 4 weeks back, and 3 bi-weekly values for case acceleration.

Sunburst chart of new cases in the last 7 days by state, as of 2020/06/18. The size of a state’s wedge is its fraction of the whole number of cases in the US in the last 7 days. A state’s color reflects the week-over week acceleration of cases in the state. Interactive version.

Using the Data

At first, I was using this data set in Google’s Data Studio, and in Tableau, to do basic data exploration and verify that I could see in the data what I expected to seek. That was useful as far as it went, and built my confidence in the data set, but ultimately those are tools that allow you to build visualizations quickly without learning to code.

The thing is, though, that I already know how to code, and knowing how to code lets you save little bits of functionality and numeric processing and come back to them later and combine them in interesting ways. It also allows you to build much more complex visualizations.

These days I’m using BigQuery’s iPython Magics to refresh this data set periodically into a Jupyter notebook. From there I use Pandas to derive new metrics and to transform data.

I have been using Pandas built-in Matplotlib integration to do quick-and-dirty visualizations, and Plotly to create interactive Javascript plots.

Using the GIS boundaries for the counties, I can create chloropleth maps in Plotly to show various metrics for counties geographically.

I am not a big believer in geographic maps for this purpose. They are maps of land, and the virus attacks people. People are very unevenly distributed across the land in the U.S. — 20% of the counties contain 80% of the people — and map charts often don’t give a good sense of the problem.

Chloropleth map of the US showing the ratio of 0–7 day trailing cases over the week previous. Many counties do not have enough data for this and are blank.
0–14 day case acceleration in estimated Kansas City and area.
0–14 day case acceleration in the synthetic county of New York City.

I was also gratified to see that my geometry for Kansas City and the surrounding area fairly closely matches chloropleth maps from the Times:

New York Times’ geography for the Kansas City area.

I can also use ACS data to identify counties with higher percentages of people of color, who are at higher risk of death from the disease, as in this tree map of total cases from 5/1:

So far, however, the big wins from joining in this data is from two fields: total_pop from the ACS, and area_land_meters from the GIS data. With the first you can get per-capita infection and death rates, and with both of them together you can get population density.

Other Data Sets

I also have created mapping tables the can be used to join this data to Google’s COVID-19 Mobility Report data, and to an MIT county-level data set for election data. Relating the NYT data to mobility data has so far not been very fruitful, but joining each separately to the county geography, election, and ACS data has given me a bunch of useful insights!

Conclusion

I hope this has been helpful to someone out there. If so, or if you have a project to which I might usefully contribute, let me know!

--

--

I am a software craftsman and leader from Texas, now living in the Bay Area.