
Data for Change
Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the Coronavirus pandemic, you can click here.
We are living in the most unprecedented times that we have never imagined in our lifetimes. Our lives have long departed from what used to be normal and unfortunately, the world will never be the same again. Instead of stressing a lot about what has changed, it would be a wise idea to accept the new normal and move on, up and running again.
The 19th century has seen the three powerful things that changed the world for ever— Industrialization, the Internet, and the Infectious diseases!
Alright, let’s address the elephant in the room – Coronavirus which originated from China infected nearly 22 million worldwide and claimed 770K deaths. Here is the current situation in the world as of August 17th, 2020.

Here is the agenda for this article. You may choose to jump to any of the following section you may like:
- COVID-19 Data Visualizations
- Exponential growth, Logistic growth? what the hell are they talking anyway?
- Understanding the Basic SIR model and building blocks of a pandemic
- How various governments come up with different strategies?
All the code used in this article can be downloaded from this GitHub repository
COVID-19 Data Visualizations
Data Visualization #1: Total cases by country
The below data visualization runs through the timeline from Feb 2020 until the end of July 2020 and shows how the top 20 countries shifted places in terms of the total number of reported cases.
Data Visualization #2: Total deaths by country
Another data visualization shows hows the top 20 countries shifted places in terms of the total number of reported deaths due to Covid-19.
Data Visualization #3: World Heatmap showing the number of days until the first reported case
The following data visualization shows how quickly it spread to the entire world.

From the above interactive data visualization, we can see that:
- Thailand is the first one to be hit (within the first 13 days) and then Japan (by 15 days) and South Korea (by 20 days).
- Within a month, the virus spread to India, Southeast Asian countries, Australia, the United States, and Canada.
- Most of Western Europe except Portugal and Ireland contracted the virus within the first 1 month, whereas Eastern Europe managed to be safe until around 2 months.
- Russia got infected around the first month itself.
- Within Scandinavian countries, Norway was the last one to get infected whereas Sweden and Finland got it within the first 1 month.
- Relatively it took longer for the spread to hit South American and African continents.
- North Korea – No data, as usual, Don’t ask me why? 😛
Data Visualization #4: World Heatmap showing the number of days until the first 1000 cases
Now, let us see how fast it took to reach the first 1000 cases within each country:

From the above interactive data visualization, we can see that to reach the first thousand cases:
- Though Thailand and Japan were the first ones to be infected, they took nearly 2 months to reach the first thousand cases milestone.
- Turkey, Iran, and Tajikistan took just 2 weeks.
- South Korea took a little over a month.
- Most of Europe took nearly a month.
- United States, India, and Australia took nearly 2 months.
- Brazil and most of the South American countries took less than a month.
- South Africa took less than a month.
Though the virus spread relatively slow in some countries, it spread very fast in some other countries. This rate of growth in different regions can be captured by exponential growth.
In the next section, we will be covering the different growth models used to model the virus, and also we will be discussing how governments come up with various mitigation strategies.
Exponential growth, Logistic growth? what the hell are they talking anyway?
In the past couple of months, I bet you have heard these terms probably a zillion times over the news channels. They are using these terms in explaining how the virus is spreading, etc. But you might be wondering what are they talking anyway?
Alright, to make sense of what exponential growth really is about, let’s deal with a hypothetical situation. Let’s say I offer you to stay at an Airbnb site that I own for a full one month for an amazing deal – You will have to pay 1 cent on the first day to start with, but every other passing day until one month you will have to pay me a triple amount of what you paid the previous day. Is this a deal too hard to resist? Am I looking like an idiot? It’s fine, you can have your silent laughing moment, but if you come back and sign up, I bet you are in need of a math class very badly.
What I used is so-called exponential growth to trick you in. With exponential growth, my daily rate goes from $0.01 on the first day to more than $2 trillion by the end of one month. If we amend our deal so you only double what you pay me each day, your final bill on the last day is reduced to $10 million, and if we amend the deal again so you only increase what you pay me by 50% each day, your final bill is a much more reasonable $1,917.51.
The following table summarizes these results:

Here is the simple exponential equation which we used to calculate the numbers in the above table:

In the above formula, x(t) is what you pay me on day t, x₀ is the initial value (what you pay me on the first day), r is the percentage increase.
Gotcha?
Scientists and Epidemiologists use this exponential growth in the context of COVID-19 to understand how the virus is spreading. In other words, they try to fit the data points from the daily recorded cases into this model and try to estimate the growth factor and thus predict the future. More specifically we can find out how fast the virus is spreading in one country versus another.
But there is a problem with exponential growth! If you haven’t noticed yet, the growth doesn’t stop. As long as you put in some input for the parameter t, there is some output value. But that is not the case with all the previous virus outbreaks because we know they all stopped at some point or another.
Here comes our next guest, the Logistic growth!
Unlike exponential growth, logistic growth doesn’t increase indefinitely.
It follows three phases – grows very slowly in the first phase, very rapidly in the middle phase, and in the third phase again starts to grow slowly and finally flattens out. This is exactly the kind of pattern that any virus outbreak follows.
Here is the simple formula that explains logistic growth.

Scientists also study something called Gompertz growth which is a special kind of logistic growth and also similar to a sigmoid function. Unlike logistic growth which is perfectly symmetrical in the first and last phases, the Gompertz function grows slowly and flattens out slowly. The below formula represents Gompertz growth

In both logistic and Gompertz growth formulas, the parameter c represents the asymptote for the logistic function. In other words, the value of c is the maximum number of cases we will have before the end of the epidemic.
The below chart gives you a sense of how the growth rate varies in all three kinds of models.

Scientists use both exponential and logistic functions to study what phase we are currently crusing while in an epidemic. The first phase of an epidemic can be better represented by the exponential function, however, the second phase can be better represented logistic function.
Using the COVID-19 reporting data compiled by John Hopkins University, we can do curve fitting to the exponential and logistic models.
Here is what we will be doing – we will fit all the data to the model except for lat 1 day (29th July 2020) and we try to predict for that day. We will use [scipy.optimize.curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html)
a Scipy function that uses non-linear least squares regression to fit a function to data for a selected country. We can use the R2-score, which is a common metric to test the goodness of fit of a model and measures how well the model is capable to predict new points of the curve. A value very close to 1, means that the model can predict very well the data, while a value close to zero or even negative, means that the model is a very bad fit.
Here is the code for the same:
Here are the results for some selected countries:
China
From the below result, China has a negative R2 score for the exponential model and a positive R2 score for its logistic model. This indicates that China has crossed the exponential phase and entered the logistic phase. R2 score being very close to 1 indicates a very good fit for the logistic model. In fact, the flattening of the curve happened on 13th March 2020.
**** Prediction for China-CHN as of 2020-07-29 ****
Initial no. of cases on 2020-01-19 -> 216.0
No. of cases on 2020-07-28 -> 86783.0
~~ Exponential Model ~~
['p=0.035', 'sigma_p=0.001']
MSLE: 45.617, Exp of RMSLE: 9.670, R2 score: -9.730
~~ Logistic Model ~~
['a=10.000', 'b=0.113', 'c=84166.683'] ['sigma_a=1.115', 'sigma_b=0.005', 'sigma_c=390.112']
MSLE: 0.418, Exp of RMSLE: 1.909, R2 score: 0.963
Day of flattening of the infection curve -> 13, Mar 2020
South Korea
The same is the case with South Korea. The flattening of the curve happened on 1st May 2020.
**** Prediction for South Korea-KOR as of 2020-07-29 ****
Initial no. of cases on 2020-02-21 -> 155.0
No. of cases on 2020-07-28 -> 14203.0
~~ Exponential Model ~~
['p=0.032', 'sigma_p=0.001']
MSLE: 31.421, Exp of RMSLE: 9.670, R2 score: -11.591
~~ Logistic Model ~~
['a=3.858', 'b=0.074', 'c=12089.302'] ['sigma_a=0.468', 'sigma_b=0.005', 'sigma_c=113.041']
MSLE: 0.132, Exp of RMSLE: 1.439, R2 score: 0.881
Day of flattening of the infection curve -> 01, May 2020
France & Italy
For both France and Italy, logistic models indicate the goodness of fit. The flattening of the curve happened in June for both countries.
**** Prediction for France-FRA as of 2020-07-29 ****
Initial no. of cases on 2020-03-01 -> 100.0
No. of cases on 2020-07-28 -> 183079.0
~~ Exponential Model ~~
['p=0.056', 'sigma_p=0.001']
MSLE: 37.960, Exp of RMSLE: 9.670, R2 score: -3.601
~~ Logistic Model ~~
['a=10.000', 'b=0.055', 'c=166804.289'] ['sigma_a=1.059', 'sigma_b=0.003', 'sigma_c=1840.738']
MSLE: 1.355, Exp of RMSLE: 3.203, R2 score: 0.961
Day of flattening of the infection curve -> 21, Jun 2020
**** Prediction for Italy-ITA as of 2020-07-29 ****
Initial no. of cases on 2020-02-24 -> 132.0
No. of cases on 2020-07-28 -> 246286.0
~~ Exponential Model ~~
['p=0.054', 'sigma_p=0.001']
MSLE: 43.168, Exp of RMSLE: 9.670, R2 score: -3.856
~~ Logistic Model ~~
['a=10.000', 'b=0.056', 'c=243030.882'] ['sigma_a=0.895', 'sigma_b=0.002', 'sigma_c=2068.458']
MSLE: 1.156, Exp of RMSLE: 2.931, R2 score: 0.973
Day of flattening of the infection curve -> 13, Jun 2020
New Zealand
The government’s response of New Zealand was much talked upon as one of the best responses. For New Zealand, flattening happened by 18th April 2020
**** Prediction for New Zealand-NZL as of 2020-07-29 ****
Initial no. of cases on 2020-03-23 -> 102.0
No. of cases on 2020-07-28 -> 1207.0
~~ Exponential Model ~~
['p=0.024', 'sigma_p=0.001']
MSLE: 18.770, Exp of RMSLE: 9.670, R2 score: -22.147
~~ Logistic Model ~~
['a=5.567', 'b=0.213', 'c=1159.461'] ['sigma_a=0.323', 'sigma_b=0.006', 'sigma_c=2.736']
MSLE: 0.005, Exp of RMSLE: 1.072, R2 score: 0.985
Day of flattening of the infection curve -> 18, Apr 2020
United States, India, and Brazil
While the rest of the world started to see some light at the end of the tunnel, United States, India and Brazil are still stuck hard in the middle of the pandemic. The new cases keep rising.
While the logistic model of the United States indicates a very good fit (R2 close to 1), the models for India and Brazil are somewhat not a great fit, but still are considered to be in the logistic phase because R2 > 0. Below predictions indicates all three countries are expected to flatten the curve by the end of November 2020.
**** Prediction for Brazil-BRA as of 2020-07-29 ****
Initial no. of cases on 2020-03-15 -> 121.0
No. of cases on 2020-07-28 -> 2442375.0
~~ Exponential Model ~~
['p=0.080', 'sigma_p=0.000']
MSLE: 32.941, Exp of RMSLE: 9.670, R2 score: -0.713
~~ Logistic Model ~~
['a=9.998', 'b=0.025', 'c=2159468.826'] ['sigma_a=2.364', 'sigma_b=0.006', 'sigma_c=637699.122']
MSLE: 5.820, Exp of RMSLE: 11.161, R2 score: 0.735
Day of flattening of the infection curve -> 19, Nov 2020
**** Prediction for India-IND as of 2020-07-29 ****
Initial no. of cases on 2020-03-17 -> 125.0
No. of cases on 2020-07-28 -> 1483156.0
~~ Exponential Model ~~
['p=0.076', 'sigma_p=0.000']
MSLE: 27.423, Exp of RMSLE: 9.670, R2 score: -0.560
~~ Logistic Model ~~
['a=9.994', 'b=0.024', 'c=1066388.221'] ['sigma_a=3.159', 'sigma_b=0.008', 'sigma_c=441540.841']
MSLE: 6.032, Exp of RMSLE: 11.657, R2 score: 0.650
Day of flattening of the infection curve -> 27, Nov 2020
**** Prediction for United States-USA as of 2020-07-29 ****
Initial no. of cases on 2020-03-03 -> 103.0
No. of cases on 2020-07-28 -> 4290263.0
~~ Exponential Model ~~
['p=0.079', 'sigma_p=0.001']
MSLE: 45.859, Exp of RMSLE: 9.670, R2 score: -1.501
~~ Logistic Model ~~
['a=10.000', 'b=0.024', 'c=4262579.365'] ['sigma_a=1.038', 'sigma_b=0.003', 'sigma_c=503190.213']
MSLE: 5.149, Exp of RMSLE: 9.670, R2 score: 0.903
Day of flattening of the infection curve -> 17, Nov 2020
Understanding the Basic SIR model and building blocks of a pandemic
While exponential and logistic models help us understand what stage we are in while in the middle of an epidemic, however, there are certain limitations:
- They do not take population size into consideration unless we create some bounds for these functions.
- They only model infections but do not consider recoveries and deaths.
- They do not take the infection rate or recovery rate into consideration.
- They do not consider the effect of improved treatment for faster recovery.
- They do not consider the effect of vaccination if there is one during the course of time.
To analyze such complex scenarios and simulations, scientists and epidemiologists use another model called SIR (Susceptible-Infected-Recovered) model.
This model was first published by Kermack and McKendrick in 1927. In this model, the entire population in a region is divided into three compartments called Susceptible (S), Infected (I), Recovered/Removed (R).
Susceptible (S) compartment includes all the population who are vulnerable to infection. At any point in time, this is the entire population minus those who are infected, recovered, and dead.
Infected (R) compartment includes the people who are currently infected with the virus and are capable of spreading the disease. Technically, we need to have at least 1 infected individual to start an epidemic.
Recovered/Removed (R) compartment includes all the people who are no more infected and cannot infect anyone. This includes people who have recovered, developed immunity, or people who have died. Hence this group is sometimes called removed.
In general, when a person recovers, their body develops antibodies and thus gain immunity against the disease and hence they will never be infected again. Unless the virus is capable of mutating, this may apply to most of the infectious diseases.
Next, we need to understand how a virus spreads.
The first building block to understanding how a virus spreads is knowing what the Reproductive Number (R0) of a virus.
The Reproductive Number is the average number of individuals that one infected person will infect during the course of the illness.
If the R0=2.0, that would mean that every infected person on an average spreads the disease to two more persons. For any virus outbreak to become a successful pandemic, it needs to have a very high value of R0.
R0 is always published as a range because it keeps changing depending on various factors. Below is the R0 for some of the common virus outbreaks,

It can be observed from the above table that scientists have estimated R0 for COVID-19 is between 1.5–3.5. It is not as bad as some of the deadly epidemics such as Measles, Smallpox, etc but still, the reason for the gigantic scale could be nature of transmission.
For any epidemic to slow the spread and gradually dissappear, it needs to maintain and R0 ≤1
The next building block is the Recovery Time of a virus.
The Recovery Time is the expected amount of days it takes for an individual to recover from the virus.
Decreasing the recovery time would also slow the spread of the virus.
Using these two building blocks we can calculate the following parameters:
- Gamma (γ): This represents the daily recovery number. Suppose a person takes 14 days to recover, then γ = 1/14
- Beta/Infection rate (β): This is the daily reproduction number of the virus. Suppose if R0=4, then β=4/14
Below is the illustration of a basic SIR model with the transition rates from each state of the population.

As time goes on, people transition between Susceptible to Infected to Removed. The following happens on each given day:
- Susceptible population is reduced by the amount of newly infected.
- Infected population increases by the amount of newly infected and decreases by the amount of newly recovered.
- Recovered/Removed population increases by the amount of newly recovered.
These rate of changes can be mathematically expressed by the following differential equations:

Using the code below we can simulate a simple SIR model for a small population of 1000 and with 10 initial infections and 0 recoveries.
Here is the result of the simple simulation.

It can be seen that as the time passes on, the following happens:
- At the start of an epidemic, the percentage of Susceptible Individuals in the population is very high. This would increase the rate of infection as more of the interactions from the Infected would be with susceptible individuals. At this point, the number of Recovered Individuals is very low as not many have been exposed.
- In the middle of an epidemic, the percentage of Susceptible Individuals in the population starts to get lower. Soon enough the Newly Recovered individuals each day becomes greater than **** the Newly Infected ones, and the number Infected starts to decrease.
- At the end of an epidemic, the Number of Recovered (R) starts to level off near the Population Size (N) as everyone has had the virus, and recovered. The Number of Infected and Number of Susceptible (S) starts to level off near 0 as no more people are susceptible to infection.
How various governments come up with different strategies?
In the previous section, we have seen a simple simulation of a basic SIR model for a small population, now let us see the same on a big scale.
Let us consider a hypothetical situation that a big metropolis with a population of 8 million is stuck with a pandemic with R0=5 and the local government is under pressure to get things under control.
Here is the quick snapshot of the disease:
- Reproduction number, R0=5.0
- Normal Recovery time for infected is 14 days.
- No. of initial infected people is 1000
- No. of initial recovered people is 0
The first thing that the government wants to see is how big of a deal is this thing. The best scientists in the town run through some quick simulations (see the chart below) and show that about 3.75 million (47.8% of the population in the city) would be infected within 37 days outbreak. In other words, the cost of not doing anything is very very huge!

So, the government got a sense of what’s coming. so the next thing on their mind is how can we get this under control?
Scientists run through some additional simulations to show the impact of R0 and they emphasize the fact that R0 needs to be brought down to 1 or less to slow the spread of the virus and to eventually stop the pandemic.
After all, it’s not rocket science. By looking at the chart below, the government would understand that decreasing R0 makes a huge huge difference – with every 1 unit decrease in R0, the peak of infection gets smaller and also delayed by few weeks, thus government can buy some time to improve medical facilities or implement a strategy or develop a vaccine if that is feasible.

The government doesn’t want the economy to shut down so they want to see if some kind of handwashing or public conduct policy restriction would help?
The scientists run through some simulations and make the following comments:
- The nature of the virus is that it spreads through successful interactions between the infected and the susceptible population.
- Handwashing and public contact policy changes only reduce the likelihood of the spread of the virus by limiting the number of successful interactions. Hence R0 is only reduced to some number lets say 2.5.
- Even in this scenario, at least 23% of the population gets infected by 88 days. So this time, there is some delay before infections go to the peak.
So, the government understands that this is not going to work, so rejects the idea.

The next thing that the government can think of is the lockdown but they need to know the following:
- How soon they need to implement lockdown – within 1 week or 2 weeks or 3 weeks?
- How long they should hold this lockdown – 30 days or 60 days?
- How should be a level of strictness for the lockdown – moderate or strict?
In addition to finding answers to the above questions, the government also needs to optimize the following:
- Minimize the peak of infection – see to it that a very less percentage of people get infected overall.
- Need to select an optimal lockdown option so as to not hurt the economy too much, because a really hurt economy would come back and hit us equally harder later.
- Buy some time where ever possible so as to ramp up medical facilities, perhaps also give that time for scientists to develop a vaccine or something.
Again, scientists help the government by running some simulations for various scenarios and come up with the following results,

How should you read the above chart?
- Each row represents the day of the start of lockdown – for example, all the charts on the first row correspond to lockdown started on 7th day after outbreak meaning within 1 week, the second row is all about lockdown started in 2nd week, etc.
- Within each row – there is a 30 day vs 60 days, moderate vs strict options for simulations.
- Each chart displays the R0, how soon is the peak infection day, and also the share of the infected population on the peak infection day.
- Of course, we want the lowest share of infections but at the same time, we also need to see how soon is the peak infection day.
We can make the following observations from the above chart:
- In each of the lockdown option, the peak of the infection is only delayed but does not disappear completely.
- All the strict lockdown options buy us more time when compared to the corresponding moderate lockdown option. For example, if we consider a 30-day lockdown from the 14th day, the strict lockdown results in 47% infections in 75 days whereas a moderate lockdown results in 45% infections within 60 days (relatively sooner) – Buying some time is critical.
No matter how bad the tide is, we would rather prefer to be hit later than sooner because that way we can make some safety preparation or map out an escape route.
- In all moderate lockdowns, right after the lockdown is lifted, the peak of infections rises.
- Waiting until 35 days to implement a lockdown looks too late because the peak of infection rises – so this option is ruled out completely.
By taking all the optimizing factors discussed previously into consideration – the government comes up with a 30-day strict lockdown plan to be implemented from day 14.
We are now in the middle of a 30-day strict lockdown that started from the 14th day. Using the time bought, the government quickly ramps up the medical facilities within 2 weeks and started deploying improved treatment options from day 28.
Here is the comparison of the normal lockdown scenario vs the one with improved treatment option:

Here is the observation we can make from the above chart
- The improved treatment option drastically reduced the recovery time by 50% – which is great.
- The peak of infection is reduced by 50% which means now only 22% of people get infected – which is great.
- The peak of infection is delayed meaning we can again buy a few more days – which is great.
With the improved treatment option in place, the government wants to know if we now extend the lockdown to 60 days, would there be any impact on the peak of infections?
The below chart shows the same improved treatment option but in the context of a 60-day strict lockdown.

How is a 30-day lockdown different from the 60-day lockdown with the improved treatment option?
Even though the same percentage of the population gets infected at the end of the day, however with a 60-day lockdown we buy more time (135 days) – This is critical for mapping out the next steps.
So, the government extends the lockdown to 60-days.
The government now realizes that there is no point in delaying the tide forever, so it works in parallel with a team of scientists by providing some funding for vaccine development.
What if there is a vaccine? The next question is how much should be the daily capacity to effectively control the infections?
Another simulation in the context of 60-day lockdown is shown below:

The numbers look promising.
It appears that for the scale of a big metropolis (8 million population) that we are dealing with, we need 30K to 50K daily vaccinations to keep the infection rate low.
So, now the government can work towards making necessary infrastructure available hoping for the vaccine to be out sooner.
In this article, for the sake of simplicity, we discussed only the basic SIR model. However, this model doesn’t take into consideration the following scenarios:
- Vital dynamics – new births and new deaths happening every day.
- Immigration happening on a daily basis.
- Most viruses have an incubation period – this can be modeled by SEIR model which is a variant from SIR, introducing another compartment called ‘Exposed’.
- Some people are asymptotic and are not in infected state etc.
- Hospitalizations, ICU, ventilators, etc.
In the real world scenario, scientists take the most accurate data such as a number of newly recorded births or deaths, the number of deaths happening for other reasons, the number of people in ICU, etc, and thus make accurate predictions.
So, in this article, we have seen how to fit the COVID-19 data to exponential and logistic models and thus make predictions and also deduced when a given country has flattened the curve, or if not when is the expected flattening is to happen.
Also, we have seen how scientists use SIR model to simulate various scenarios to help the government formulate different virus prevention strategies.
All the code used in this article can be downloaded from this GitHub repository