Tracking COVID-19 Transmission in India

Examining a dynamic rate of transmission to track the spread of COVID-19 throughout the subcontinent

Published in

Towards Data Science

5 min readMay 8, 2020

Kevin Systrom recently published an article titled “The One Metric We Need to Track for COVID-19”. That metric is known as the variable rate of transmission, designated as R_t. Using public American COVID-19 data, Systrom calculated an effective rate of R_t for every single state and explored the effectiveness of R_t in tracking the spread of the virus.

Along with the article, Systrom published a Python Notebook so that others could plug and play with his implementation. Take that with Kaggle’s recent data set of India COVID-19 cases by state and territory — and there was a whole world to explore.

Bettencourt & Ribeiro

The specific algorithm that Systrom implemented was originally published in 2008 by Bettencourt and Ribeiro — a team of epidemiologists. They proposed Bayesian approach to calculating a dynamic R_t , which would take into account the previous R_t and new observations — the number of new cases/day — and adjust accordingly. If you’re interested in the math and the full derivation behind the algorithm, I recommend looking through Python Notebook linked earlier or even Bettencourt and Ribeiro’s original paper.

The long and short of it, is that the current R_t is found by the following equation:

P(R_t|k) ~ product L(R_t|k)

Systrom made a modification. The original model tracking R_t incorporated since day 0. That had some long term effects. The model on day 30 wouldn’t forget the high value of R_t on day 1. This means that 1 acted like an asymptote and every state had R_t going all the way to 1 and never lower.

To adjust for this, Systrom had his model simply use a moving window of 7 days. The value of R_t on any given day would be based on the R_t of the previous 7 days. That likelihood is bounded by a highest density interval to account for the thoroughness of the data. Given more sample sizes — a more complete data set-we’ll get a smaller and more concentrated highest density interval.

Mapping Systrom’s Algorithm to India

An open source effort in India has resulted in an open data set of COVID-19 cases in a state by state basis. The modular code made it easy to clean the India data and plug it through the notebook. The two minor adjustments to make were:

Dealing with a different date format
Changing the daily new case count to a cumulative case count.

From there it was straight forward, and I was able to produce a running R_t graph for all Indian states and territories:

A current look at the R_t for each state in India. States with less than 5 days worth of data were excluded. Unassigned cases are those COVID-19 diagnoses which don’t have a state attached.

How Effective is R_t?

India has categorized districts into red zones, orange zones, and green zones in its latest list released in May 1, 2020. Red zones are districts identified with a high growth rate of the virus and demarcated as a hotspot. Green zones arre districts where there are no new COVID-19 cases. Orange zones are districts that have reported cases, but with fewer cases than red zones. You can find that list here. Using those demarcations, I’ve organized the states into three categories:

Red States: those that have more red zones than orange or green zones
Orange States: those that have more orange zones than red or green zones
Green States: those that have more green zones than red or orange zones.

R_t is a measure of how many people each COVID-19 diagnosis will spread the virus to. The higher R_t will mean the state will be orange or red since the number of cases continue to spread.

A bar/whisker plot of each states R_t. Those with a highest density interval greater than 1 were excluded

Looking at this figure, we can see that states with mostly Red and Orange zones tend to have higher values of R_t. For the green states Chandigarh and Chhattisgarh, they both have large highest density intervals. This is a result of less data, and definitely a lack of daily data. As for orange states like Punjab and Tamil Nadu, the districts which are red are high population density areas where it’s easier to spread from person to person which the high R_t reflects.

Looking at R_t from a state-level paints a moderately accurate picture. Those states with higher R_t’s are likely to have also been designated by the Indian government as COVID-19 hotspots.

Keeping Context in Mind

This is a valuable look into how a virus permeates into the populace and now preventative measures like the lockdown effect that spread. There are some important factors to still keep in mind.

This model doesn’t take testing into account. It accepts that the k new cases believing that number is accurate. In reality, India’s testing setup is not universal, and only a percentage of the population has been tested. There are undoubtedly undiagnosed cases, which would have a significant impact on R_t.

This data set is tough to track. India is a massive county with more than a billion people. There were millions of people in India who were stranded due to the lockouts, some of who walked hundreds of miles to return to their homes from major urban areas. This data set is a phenomenal open source effort, but it’s hard to guarantee that it’s 100% accurate and should keep that in mind when tracking derivations like R_t.

Future Work

Something I’m going to be looking for in the future will be how to work through this model to account for the proportion of the population that’s been tested. Taking that into account would provide us with an even more accurate idea of COVID-19’s transmission rate.

Thanks for reading! If you liked it, feel free to give me a follow! You can find me on Twitter at @EswarVinnakota. If you’d like to talk, feel free to reach out through DMs or connect with me on LinkedIn.