The world’s leading publication for data science, AI, and ML professionals.

A Non-Epidemiologists Statistical Examination of US State-Level Open Data Sources on COVID-19…

On March 29, 2021 I received an e-mail invitation from the Indiana State Department of Health to take the COVID-19 vaccine. Coming into…

An attempt to synthesize a complex network of datasets

Image by: Nick Fewings
Image by: Nick Fewings

On March 29, 2021 I received an e-mail invitation from the Indiana State Department of Health to take the Covid-19 vaccine. I have not been paying close attention to the vaccine rollouts, and I am looking for learnings. My objective is to better understand the vaccination campaigns. At the time of this article, about 25% of the US population has been vaccinated. I am trying to get some background and tease apart what is fact and what is false in the world of COVID-19 misinformation. From my early examinations of the COVID datasets I am hoping to see an evolution in data practices. This has been a well-funded best shot at tracking a very important real-world phenomenon, so it’s a good example of data structures and "best-practices". As a data monkey, it’s hard to ignore. All the code is public, found here:

Data Sources

  1. NYTimes Covid-19 State-level Cases Tracker (cvd) maintains daily counts of new cases and deaths at the state level data definitions
  2. Oxford Covid-19 Government Response Tracker (OxCGRT aka oxf) describes the event timelines in less granular detail regarding government policies. data definitions
  3. OWID (vax) maintains daily counts of vaccinated and fully vaccinated people combining multiple sources data definitions
  4. US Census State-level Population (pop) is the most comprehensive and pure estimates of population for the United States based on the US Census efforts data definitions

Joining these data sources required cleaning transformation. The code for these transforms is summarized in this blog, where I won’t go into much detail. However the code is all found in the Public Repo.

Data Quality Report

NYTimes Covid-19

Starts: 2020–01–21 Ends: 2021–03–31 Today: 2021–04–01 Full State Record on Vaccination Response variable:

  • start: 2020–03–13
  • end: 2021–03–31
  • not full: [<=2020–03–12]
  • subset of states always reporting: [Washington]

Conclusions:

  1. We remove the data prior to 2020–03–13.

OxCGRT

Starts: 2020–01–01 Ends: 2021–03–30 Today: 2021–04–01 Full State Record on Vaccination Response variable:

  • start: 2020–09–14
  • end: 2021–03–29
  • not full: [2020–12–08, >= 2021–02–12]

Conclusions:

  1. We assume that prior to 2020–09–14, there were no known vaccinations, thus those NaNs are actually 0s

OWID Vaccination Data (State-level Data)

Starts: 2020–12–20 Ends: 2021–03–31 Today: 2021–04–01 Full State Record on Vaccination Response variable:

  • start: 2020–09–14
  • end: 2021–03–29
  • not full: [2020–12–08, >= 2021–02–12]

Conclusions:

  • This dataset may prove useful for some questions besides when each state started providing vaccinations.
  • Can’t be used to determine early timeline of vaccination campaigns, namely start dates.
  • Reflects policy change in how the CDC attributes vaccinations to states that was implemented starting Feb 19 2021. This will be glanced over in this blog, but the strategy required to account for this nightmare are worthy of a second article. Code is available in github.

Non Exhaustive Summary of Cleaning Tasks

  1. The key indexes on which we will join the datasets are date and state name so we conform dates and state names across datasets
  2. Encoding key variables as categoricals is required in the Oxford dataset, which describes the rough timeline associated with government policy implementations, including vaccination campaigns at that US state level
  3. Since vaccination datasets start after the COVID case tracking datasets, we fill NaNs with 0s for dates before the vaccination datasets are available.
  4. Anomaly removal. In the OWID dataset, the starting observations are anomalous. I think it’s because the data doesn’t actually cover the true start dates, so the first observations make an unjustified jump from zero.
  5. Population adjustment. Each of the source datasets may or may not already provide population adjusted estimates. We choose to use the non-adjusted source observations of vaccination, new cases and deaths and all population estimates are starting from the independently sourced 2019 US Census.

Investigative Objectives

The questions are defined at the outset to focus the work and avoid getting lost in the sea of data. Likely there are currently 1000s of data sources that form a complex web of inter-dependency between each other. It would be an interesting follow up task to attempt to measure that interdependency in some meaningful way. The 4 datasets were chosen to answer our questions, but they also include many other variables. So focus is key.

The investigation will center on 3 main questions:

  1. When did each state begin vaccinations, who was first, last, average? Statistically significant?
  2. Which states were fastest, slowest, average at vaccinating and fully vaccinating people? Statistically significant?
  3. Which states saw the biggest, smallest, average impact of vaccination on reducing new cases? Statistically significant?

The investigation will attempt to account for the following major confounds:

  1. What percentage of population had tested positive for coronavirus at the time that vaccinations start? (e.g. what was the size of the eligible population?)
  2. How much inter-personal high risk activity was happening in the population?
  3. How much mis-diagnosis was happening? How much noise in the dataset? Which states were best at reporting cases?
  4. Which variant of the vaccination was administered?
  5. To what extent did behavior change impact the reduction in cases/deaths vs the vaccine?

On what date did each state begin vaccinations, who was first, last, average? Statistically significant?

Procurement for the coronavirus vaccine occurred on a state level, and not all states started at the same time. Here we consider the date on which vaccinations became non-zero for each state ("0-No Availability"). The measure of statistical significance is based on the distribution of the dates, and we will describe anomalies and statistically significant anomalies. Charting the density, for each day, we count the number of states that started vaccinating. So, we need to compute the date on which vaccinations were first available.

Please feel free to leave comments if you know more about how this process was run, and why some states were slower than others to administer their first Vaccines. The author also leaves open the possibility that the Oxford dataset is wrong about the specific dates.

Early, Normal and Late States In Vaccination Access

First state: Oklahoma (2020–12–11) Last state: Arizona (2021–01–05) Average date: (2020–12–15) Median date: (2020–12–15) Standard Deviation: 3.74 days

Anomalies and Statistical Significance

The typical state started vaccinating its population on Dec 15, 2020. There were 3 states that started late, starting with the latest: Arizona, Nebraska and Missouri (in that order). Arizona and Nebraska could be considered statistically significant anomalies as they were more than 3 standard deviations from the mean.

vax_start_dates_oxf = []
for state, data in oxf.df.groupby('state_name'):
    a = data[data['vaccination_policy'] != 0.0]
    b = a.sort_values('date').iloc[0]
    c = b.to_dict()
    vax_start_dates_oxf.append(c)  
vax_start_dates_oxf = pd.DataFrame(vax_start_dates_oxf)
first_vax = vax_start_dates_oxf.date.min()
vax_start_dates_oxf['days_from_first'] = vax_start_dates_oxf['date'].apply(lambda x: (x-first_vax).days)
vax_start_dates_oxf['days_from_first'].hist(
    bins=vax_start_dates_oxf['days_from_first'].max(), 
    figsize=(16,3), grid=True,
)
Z-score for state vaccination start date (Image by author)
Z-score for state vaccination start date (Image by author)

Vaccination Pace: which states were fastest, slowest, average at vaccinating and fully vaccinating people? Statistically significant?

There are two datasets loaded up that contain information about state-wide rollout pace. The objective is to vaccinate as many of their citizens as possible, as soon as possible. So every day, we will examine a "pace leaderboard". We will compute a function for every date that returns the ordered list of states by number of vaccinated subjects and fully vaccinated subjects. For this, we will use the OWID data set which contains counts of vaccinated. We will add two new values to the dataset:

  1. Vaccinations administered by the 2019 census population of each state
  2. Days from start

First, we look at vaccinated subjects, then discuss the distributions that underlie the rank ordering in detail to examine significance.

The variables we investigate are derived from the the underlying data found in OWID.

adj_vax['adj_total_vaccinations'] = adj_vax.apply(
    lambda x: x['total_vaccinations'] / float(state_to_pop[x['location']]),
    axis=1
)
adj_vax['people_vaccinated_%pop'] = adj_vax.apply(
    lambda x: x['people_vaccinated'] / float(state_to_pop[x['location']]),
    axis=1
)
adj_vax['people_fully_vaccinated_%pop'] = adj_vax.apply(
    lambda x: x['people_fully_vaccinated'] / float(state_to_pop[x['location']]),
    axis=1
)
adj_vax['days_from_start'] = adj_vax.apply(
    lambda x: (x['date'] - state_to_start[x['location']]).days,
    axis=1
)

Pace Leads and Statistical Signficance

To determine if the lead is statistically significant, we will use the distribution of the "pace lead" as %difference between the (k-1)-th and k-th observed value for the pace measure. So there will be 50 per day observed, N=40*50=2000 observations bordering on a significant sample size to examine this question. Because we are calculating the pace lead based on the ordered list of daily ranks, we are modeling via order statistics, and will examine the assumption "a lead is a lead" by comparing the distributions of the j-th and k-th "pace lead" where j != k.

Data Quality Issue Arising from CDC policy change

(Bloomberg, 2021–01–21) describes the issue we observe where the cumulative sums of people_vaccinated and people_fully_vaccinatedare decreasing. The effective date of the CDC policy change impacting the OWID dataset are not all the same so it complicates comparisons between state vaccine regimes. I originally detected the effect of this policy change by looking for instances where the change-over was substantial enough to cause the people_vaccinated to be less than it was the day before the policy change. For the states where the series becomes non-monotonic, that day most certainly was the start and it makes sense that D.C. would experience a large change given it is the largest concentration of federal employees relative to population. However, that is most likely not the extent of the effect — other states may have had more subtle while still significant impacts. Has anyone developed methodologies for accounting for and quantifying the impact of the CDC policy change?

Some of the challenges found in dealing with the CDC policy change were:

  • States began the new CDC counting procedure on different days
  • States experienced different effect sizes, some of which are hard to detect, while Washington D.C. experienced a huge effect because the CDC began counting federally administered vaccines differently, other states may hve had very little impact. So how to measure this?

As a result, the countermeasure centers around a rule that can identify the start effect dates for the cdc policy so we can cohort the states depending on whether they were before or after the policy. We examine anomalies in dod_%diff_state_people_vaccinated by calculating the standard deviation after the CDC policy was announced to take effect, and finding quantities greater than 2 standard deviations from the average dod_%diff_state_people_vaccinated. The date that worked best was 2021–02–17

Current Standings

Naturally, the vaccine rollout is working better in some places than in others. The big question is why some states are so much faster than others, and whether some bottlenecks were based on political standing.

Fastest vaccine rollouts

Vaccinated at least once

The fastest to vaccination at least once as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)
The fastest to vaccination at least once as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)

Fully Vaccinated

The fastest to full vaccination as a percentage of 2019 US census population. (as of: 2020–04–07) (Image by author)
The fastest to full vaccination as a percentage of 2019 US census population. (as of: 2020–04–07) (Image by author)

Slowest vaccine rollouts

Vaccinated at least once

The slowest to vaccination at least once as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)
The slowest to vaccination at least once as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)

Fully Vaccinated

The slowest to full vaccination as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)
The slowest to full vaccination as a percentage of the 2019 US census population. (as of: 2020–04–07) (Image by author)

The leaders in vaccination change when we look at the percent of population fully vaccinated vs percent of population vaccinated at least once. While New Hampshire and Connecticut have been very good at vaccinating people at least once, they have not done as well vaccinating people to completion. Especially in the case of New Hampshire which has been 2nd best at vaccinating at least once, but is 22nd at vaccinating to completion. The laggards also change. For example, Mississippi is 3rd slowest in the country to vaccinate its population at least once, but it is only 8th slowest to vaccinate to completion. Other notable shifts include Washington D.C. which is not so bad, only 15th worst at vaccinating at least once, but is 5th worst at vaccinating to completion. For sure, the most extreme contrast is with New Hampshire, which is excellent at vaccinating once, but just ok at vaccinating to completion.

Which states saw the biggest, smallest, average impact of full vaccination on reducing new cases? Statistically significant?

The claim is that the vaccines Moderna and Pfizer are around 90% effective in preventing the virus. So we will chart the percent of the population that has been fully vaccinated against the number of new cases. We hope to see an immediate and obvious trend that reinforce these claims. We may look at the states that have been most effective at vaccinating to completion, so New Mexico, South Dakota and Alaska. However all of the top 3 in this case have a low population density. Connecticut and New Jersey may be good examples to look at because they’ve done well at fully vaccinating their population and are relatively densely populated and active.

Solution Design:

  1. Join the cvd dataset from NYTimes with the adj_vax dataset. We want a right join so we keep the index from cvd, and add in the observations from adj_vax
  2. Looking at columns: cases from NYTimes and people_fully_vaccinated_%pop from OWID

Accounting for existing immunity, herd and vaccination

The two main sources of immunity account for a significant reduction in the subset of individuals who are vulnerable to contracting COVID-19. Herd immunity happens after people have contracted the virus, where they are immune for a period of time. According to Susan Hopkins of the SIREN study hosted in Public Health England in London summarized in Nature, (14 Jan 2021), having had covid-19 makes it virtually impossible one will get it for several months. Vaccination immunity is when people are less likely to contract the virus as a result of being jabbed.

Taking the disease timeline into account led to the implementation of 3 quantities, which enable creation of an estimate of vulnerable people per state. The analysis is complicated by the aggregation of the dataset because we are missing key information. Namely we don’t know how many patients diagnosed at time, t died versus recovered. The data is challenging to account in this case because we don’t know the outcome of the disease for weeks after the disease is originally identified. But even in the cases where "someone should know" whether the patient who tested positive recovered or died, we don’t have the information. Maybe it’s available, but at the time of publishing, this quantity was not known.

We use 3 assumptions to control the immunized population, as follows:

  • vax_imm_lag = 14 : vaccinated people are immune starting vax_imm_lag days after the fully vaccinated
  • case_incub_pd = 10 : days between people contract covid-19 and start being symptomatic
  • case_imty_pd = 4 : months after a person has covid, they can get it again

We implement logic such leveraging these quantities below:

cvd_and_vax = pd.merge(
    left=cvd.df, right=adj_vax, 
    how='left', left_on=['date', 'state'], 
    right_on=['date', 'location']
).copy()
cvd_and_vax[['total_vaccinations', 'total_distributed', 'people_vaccinated',
       'people_fully_vaccinated_per_hundred', 'total_vaccinations_per_hundred',
       'people_fully_vaccinated', 'people_vaccinated_per_hundred',
       'distributed_per_hundred', 'daily_vaccinations_raw',
       'daily_vaccinations', 'daily_vaccinations_per_million',
       'share_doses_used']] = cvd_and_vax[['total_vaccinations', 'total_distributed', 'people_vaccinated',
       'people_fully_vaccinated_per_hundred', 'total_vaccinations_per_hundred',
       'people_fully_vaccinated', 'people_vaccinated_per_hundred',
       'distributed_per_hundred', 'daily_vaccinations_raw',
       'daily_vaccinations', 'daily_vaccinations_per_million',
       'share_doses_used']].fillna(0.,)
cvd_and_vax.dropna(subset=['state'])
del cvd_and_vax['location']
cvd_and_vax['cases_%pop'] = cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['cases'],
        b=apply_dic(state_to_pop, x['state']),
        fn=lambda a,b: a/b),
    axis=1
)
cvd_and_vax['deaths_%pop'] = cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['deaths'],
        b=apply_dic(state_to_pop, x['state']),
        fn=lambda a,b: a/b),
    axis=1
)
cvd_and_vax.index = cvd_and_vax['date']
### state-level calcs ###
# an adjusted population measurement accounting for people who have been immunized and 
# we have a challenge adjusting the population because if we adjust by deaths and cases, we are double counting
# this is a cost of having aggregated non-subject level data sets.  It's impossible to know how to account for 
# the cases and deaths in a harmonious way
# - increase people_immd by people_fully_immunized on (t-vax_imm_lag) days ago
# - increase people_immd by cases on (t-case_incub_pd) days ago
# - decrease people_immd by cases that happened more than case_imty_pd*30 days ago
# - don't do anything with deaths because not that many people have died in the grand scheme
vax_imm_lag = 14 # vaccinated people are immune starting X days after the fully vaccinated
case_incub_pd = 10 # days between people contract covid-19 and  start being symptomatic
case_imty_pd = 4 # months after a person has covid, they can get it again
state_level_cvd_and_vax = []
for state, data in cvd_and_vax.groupby('state'):
    tmp = data.copy()
    tmp['new_daily_cases'] = tmp['cases'] - tmp['cases'].shift(1)
    tmp['new_daily_deaths'] = tmp['deaths'] - tmp['deaths'].shift(1)
    tmp['people_fully_vaccinated_%pop_immLagAgo'] = tmp['people_fully_vaccinated_%pop'].shift(vax_imm_lag)
    tmp['new_daily_people_fully_vaccinated'] = tmp['people_fully_vaccinated'] - tmp['people_fully_vaccinated'].shift(1)
tmp['people_immd'] = 
        tmp['new_daily_cases'].shift(case_incub_pd) 
        + tmp['people_fully_vaccinated_%pop_immLagAgo'] 
        - tmp['new_daily_cases'].shift(case_imty_pd*30)
tmp['people_immd_%pop'] = tmp.apply(
        lambda x: compute_fn(
            a=x['people_immd'],
            b=apply_dic(state_to_pop, x['state']),
            fn=lambda a,b: a/b),
        axis=1
    )

    state_level_cvd_and_vax.append(tmp)
state_level_cvd_and_vax = pd.concat(state_level_cvd_and_vax, axis=0)
state_level_cvd_and_vax['new_daily_cases_%pop'] = state_level_cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['new_daily_cases'],
        b=apply_dic(state_to_pop, x['state']),
        fn=lambda a,b: a/b),
    axis=1
)
state_level_cvd_and_vax['new_daily_deaths_%pop'] = state_level_cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['new_daily_deaths'],
        b=apply_dic(state_to_pop, x['state']),
        fn=lambda a,b: a/b),
    axis=1
)
# compute vulnerable population which is the (census "POP" - people_immd)
state_level_cvd_and_vax['vpop'] = state_level_cvd_and_vax.apply(
    lambda x: compute_fn(
        a=apply_dic(state_to_pop, x['state']),
        b=x['people_immd'],
        fn=lambda a,b: a-b),
    axis=1
)
state_level_cvd_and_vax['new_daily_cases_%vpop'] = state_level_cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['new_daily_cases'],
        b=x['vpop'],
        fn=lambda a,b: a/b),
    axis=1
)
state_level_cvd_and_vax['new_daily_deaths_%vpop'] = state_level_cvd_and_vax.apply(
    lambda x: compute_fn(
        a=x['new_daily_deaths'],
        b=apply_dic(state_to_pop, x['state']),
        fn=lambda a,b: a/b),
    axis=1
)

Essentially we’re looking at the percent of the population that was immune to COVID-19 14 days ago against the new daily cases and the new daily deaths. However, instead of measuring new cases and new deaths by the total population, we’re subtracting the people who are fully vaccinated and the people who had covid within the last 4 months. The comparison across all states is expressed in the following charts:

from scipy.optimize import curve_fit
from scipy.stats import linregress

def do_linregress(df, xcol, ycol):
    linfn = lambda x,a,b: a*x+b
    expdata = df.dropna(subset=[xcol, ycol], how='any').copy()
    return linregress(
        x=expdata[xcol],
        y=expdata[ycol]
)

ax1 = state_level_cvd_and_vax.plot.scatter(
    x='people_fully_vaccinated_%pop_immLagAgo', y='new_daily_cases_%vpop', figsize=(16,7), s=2)
cases_reg = do_linregress(
    df=state_level_cvd_and_vax, 
    xcol='people_fully_vaccinated_%pop_immLagAgo', 
    ycol='new_daily_cases_%vpop'
)
plt.plot(
    state_level_cvd_and_vax['people_fully_vaccinated_%pop_immLagAgo'], 
    state_level_cvd_and_vax['people_fully_vaccinated_%pop_immLagAgo']*cases_reg.slope+cases_reg.intercept,
    'g--', label='linear fit: slope={}, intercept={}, r^2={}'.format(cases_reg.slope,cases_reg.intercept,math.pow(cases_reg.rvalue,2)))
ax1.legend(loc="upper right")
ax1.set_ylim(0.,.00125)
print('Assumption: immLag = {}: days after second vaccination for immunity'.format(vax_imm_lag))
The percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
The percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)

We see there is a clear downslope indicating that as states get more vaccinated, they are experiencing less prevalence of COVID-19. Let’s look at some of the best vaccinators vs the worst vaccinators. This is great news, consistent with what we hope, that the vaccine is working. Let’s drill down. One concern with this approach is that while the overall trend is negative, the states see a different reality.

High population density states: Leaders, New Jersey(8) and Connecticut (5) vs Laggards, District of Columbia (-5) and Georgia (-1)

More dense populations should see more concentrated effects because people come into contact with each others more (Coskun, Yildirim and Gunduz, 2021). In that study 94% of the spread is explained by population density and wind. New Jersey and Connecticut are dominated by the New York City Metropolitan area, while D.C. is D.C. and Georgia is Atlanta.

Connecticut was the 5th fastest, having fully vaccinated 24.1% of its population. New Jersey was the 8th fastest, having fully vaccinated 22.6% of its population. We see that Connecticut has a slightly negative effect from the vaccine on new cases, and we notice that the explained variance of the trend is very low. New Jersey is actually seeing a positive trend in new cases as vaccination increases.

Connecticut: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
Connecticut: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
New Jersey: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
New Jersey: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)

Georgia was the slowest state in the nation, having fully vaccinated 14.1% of its population. The District of Columbia was the 5th slowest, having fully vaccinated 16.0% of its population

Georgia: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
Georgia: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
District of Columbia: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)
District of Columbia: the percent of the population fully vaccinated as of 14 days ago versus the prevalence of new daily cases in the non-immune population. (Image by author)

The strongest effect was in Georgia which has been the slowest full vaccinator in the country. This suggests that much of the explanation lies in other factors aside from the vaccine.

Conclusions

After having examined all the states in this way, it seems that the effects of the vaccine and its ability to mitigate new cases is tenuous at best. There is no clear and obvious effect that can be seen. And comparison between some of the best and worst vaccinators in the country has yielded a confusing result. It seems that the overall trend of new cases is negative, however if it’s not because of the vaccine, then it will just be a temporary downtick resulting from the myriad government policies in place. As those policies expire, we may see the resurgence of the virus. We accounted for some of the big confounds but many more exist. I can see how this would be the full time job of people to examine and keep track of the various data collection, data cleaning and modeling concerns that may exist here.


Related Articles