The world’s leading publication for data science, AI, and ML professionals.

CovidSimNet: Introducing a Method to Determine “Similar” COVID States

Using data to determine a COVID state similarity matrix to estimate policy effects on similar states and "high risk" states

Data has always played an integral part in analyzing and proving hypotheses. With the advent of highly optimized and easy-to-use frameworks, data is collected almost every second. The world’s most valuable resource is no longer oil, but data (The Economist). It is estimated that by 2025, about 463 exabytes (one exabyte holds 50,000 years of DVD quality video) of data would be generated each day.

John Snow, an English physician and one of the founders of modern epidemiology, in the 1800s discovered the source of Cholera was contaminated public water pump. He discovered this by plotting cases of Cholera on a map and investigated that the majority of cases were of people close to the water pump.

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

Introduction

In this article, we present a plausible method that can be used to identify similar COVID-hit states by creating a state-by-state similarity matrix. This matrix can further be used for estimating policy effects for many states at once and also for determining high-risk states.

The Data

We will be using the COVID data for each US state comprising of confirmed cases, deaths, and the population for each state. We retrieve the dataset from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

For each state and date, we have the confirmed cases (similarly for deaths) (Image by Author)
For each state and date, we have the confirmed cases (similarly for deaths) (Image by Author)

To extract the relevant information from these Pandas DataFrame, we first aggregate the confirmed numbers and death numbers for each date and then merge it with the population for each state. Upon performing all the preprocessing and aggregations, we retrieve the final DataFrame.

Our final DataFrame (Image by Author)
Our final DataFrame (Image by Author)

Now we derive the normalized quantities for the given columns by scaling the values between 0 and 1. We use the following formula and the accompanying code to compute the required values

Image by Author
Image by Author
final['death_normalized'] = (final['total_deaths'] - final['total_deaths'].min()) / (final['total_deaths'].max() - final['total_deaths'].min())
final['confirmed_normalized'] = (final['total_confirmed'] - final['total_confirmed'].min()) / (final['total_confirmed'].max() - final['total_confirmed'].min())
final['population_normalized'] = (final['Population'] - final['Population'].min()) / (final['Population'].max() - final['Population'].min())
The normalized DataFrame (Image by Author)
The normalized DataFrame (Image by Author)

Note: To use the death term and the confirmed term, we add the two quantities so as to get a term that quantifies information for both of the columns. This is a subjective decision and can be catered to how the objective is set.

We further get the minimum between 1 and the total to limit the maximum value to 1 (Image by Author)
We further get the minimum between 1 and the total to limit the maximum value to 1 (Image by Author)
def get_min(row): return min(1, row['total'])
Computed DataFrame (Image by Author)
Computed DataFrame (Image by Author)

Formulation of the Metric

We define the metric as the minimum between 1 and the addition of the normalized columns per population. For each state, we divide the addition of the normalized columns by the population. For a state, "risk factor" denotes how "risky" the state is. We use the following code and formula to compute the metric.

Risk Factor Formula (Image by Author)
Risk Factor Formula (Image by Author)
final['risk_factor']=final['combined']/final['population_normalized]
final = final.replace(np.inf, 0)
final['risk_factor_softmax'] = softmax(final['risk_factor'])

Note: that this is a subjective metric and can be tweaked according to the objective. Alternate methods include get the population per total or using the maximum population.

Introduction to Softmax

Upon computing the "risk factor", we can see that the values are not scaled between 0 or 1. We want this as probabilities such that we can build a co-occurrence matrix for each state that represents similarity.

Risk Factor values are not between 0 and 1 (Image by Author)
Risk Factor values are not between 0 and 1 (Image by Author)

The softmax function normalizes the given vector into a probability distribution such that each component will be in between 0 and 1 and the components will add up to 1 so that they can be interpreted as probabilities, the larger input components will correspond to larger probabilities.

Softmax Function (Image by Author)
Softmax Function (Image by Author)

Now to convert our values as probabilities and ultimately create a co-occurrence matrix of states, we apply softmax to the vector.

Softmax Applied (Image by Author)
Softmax Applied (Image by Author)
Final DataFrame (Image by Author)
Final DataFrame (Image by Author)

Defining the "High Risk Area"

A high risk area is defined as the state that exceeds the tolerance factor. If a state is equal or greater than the "tolerance factor", we classify that state as a "high risk area". For our current task, we set the "tolerance factor" as 0.02 but this can vary.

Note: The Tolerance is based on the values and can change for different data (For example, if instead of the US, we are analyzing India’s COVID data).

A detailed algorithmic expression for determining high risk areas (Image by Author)
A detailed algorithmic expression for determining high risk areas (Image by Author)
tolerance_factor = 0.02
def is_risk_area(row):
   if row['risk_factor_softmax'] >= tolerance_factor:
     return True
   else:
     return False
final['is_risk'] = final.apply(is_risk_area, axis=1)
A plot representing the counts of "high risk" areas (Image by Author)
A plot representing the counts of "high risk" areas (Image by Author)
DataFrame consisting of the transformed columns (Image by Author)
DataFrame consisting of the transformed columns (Image by Author)

Building the co-occurrence matrix a.k.a similarity matrix

To calculate how similar are two states, in terms of "risk factor", we define a term called "difference factor". A difference factor represents the minimum difference required for the states to be considered similar. For our current task, the difference factor was set as 0.01.

Note: The Difference factor is based on the values and can change for different data (For example, if instead of the US, we are analyzing India’s COVID data). The metrics defined are subjective and can be tweaked according to the different preferences.

A detailed algorithmic expression for determining similarity (Image by Author)
A detailed algorithmic expression for determining similarity (Image by Author)

Tip: To extract a more thorough similarity score, we can return the complement of the probability such that we have values between 0 and 1 representing the strength between two states. We can also include other factors as well.

Computed DataFrame (Image by Author)
Computed DataFrame (Image by Author)

Deriving the co-occurrence matrix

To derive the co-occurrence matrix from the above DataFrame, we create a coo_matrix using the scipy.sparse module. This function will create a matrix of dimension (number of states, number of states) i.e (58, 58) in our case. For each state, we would have the score corresponding to another state.

from scipy import sparse
user_items = sparse.coo_matrix((cmat.score.astype(float),(cmat.state1.astype('category').cat.codes,
cmat.state2.astype('category').cat.codes)))
Sample of the final DataFrame (Image by Author)
Sample of the final DataFrame (Image by Author)
A heatmap representing how similar each state is to another state based on the confirmed and death numbers (Image by Author)
A heatmap representing how similar each state is to another state based on the confirmed and death numbers (Image by Author)
An example of how we see if two states are similar (Image by Author)
An example of how we see if two states are similar (Image by Author)
Demo output for two similar states (based on the risk factor) (Image by Author)
Demo output for two similar states (based on the risk factor) (Image by Author)

Conclusion and Further Applications

We demonstrate a plausible method to determine "high risk" states and build a map of similar states. The metrics defined are subjective and can be tweaked on the basis of the objective. The applications of the method are limitless as this can be applied to:

  • Estimating policy effect for states. (For example, determining the effect of closing restaurants on weekends for a state and approximating effect on another using the co-occurrence similarity matrix)
  • Detecting High Risk states or cities

and a lot more….


Related Articles