Forecasting COVID-19 cases in India

How many cases are going to get detected by 7th April 2020 in India?

Published in

Towards Data Science

8 min readMar 29, 2020

COVID-19 has taken the world by storm after first being reported in Wuhan, China in December-2019. Since then, there has been an exponential growth in the number of such cases around the globe. As of 28th March 2020, the total reported cases reached 668,352 out of which 31,027 have died. The below figure shows the outbreak of the virus in various countries

Credits: Raphaël Dunant, Link: https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic#/media/File:COVID-19_Outbreak_World_Map_per_Capita.svg

It can very well be observed that most countries have reported cases of new Coronavirus. As of now, it has affected 192 countries and 1 international conveyance (the Diamond Princess cruise ship harbored in Yokohama, Japan).

Considering India, the cases emerged towards the end of January in the state of Kerala, when three students returned from Wuhan, China. However, things escalated in March, after several cases were reported all over the country, most of whom had a travel history to other countries. The below figure shows the number of cases detected between 30th January and 28th March.

Total Confirmed Cases (India) till 28th March

After first emerging in late January, the number remained constant until the beginning of March, post which, it grew exponentially. As of 28nd March, the number of total cases has reached 987 cases with 24 deaths. Given the current rate of growth, where can the cases expect to reach in the next 10 days, if no specific precaution is taken?

We can do a time series analysis to create a model that helps in the forecast. The dataset is available on Kaggle. We use R-Programming for our analysis.

We load the necessary packages and attach the dataset

#COVID India — Analyzing/Forecasting Total Cases#Packages
library(dplyr)
library(ggplot2)
library(hrbrthemes)
library(tseries)
library(forecast)attach(Data)> head(Data)
# A tibble: 6 x 4
  Date                `Total Confirmed Cases` Cured Deaths
  <dttm>                                <dbl> <dbl>  <dbl>
1 2020-01-30 00:00:00                       1     0      0
2 2020-01-31 00:00:00                       1     0      0
3 2020-02-01 00:00:00                       2     0      0
4 2020-02-02 00:00:00                       3     0      0
5 2020-02-03 00:00:00                       3     0      0
6 2020-02-04 00:00:00                       3     0      0

Let’s have a look at the first 6 observations. The columns include:

Date: The date on the observations are recorded.

Total Confirmed Cases: Number of confirmed cases as of the given date.

Cured: Patients recovered as of the given date

Deaths: Patients died as of the given date

Since, we are to analyze only the Total Confirmed Cases, we create a new data frame including only that.

#Creating a new data frame with Confirmed Cases
df <- Data[2]

The next step is to convert the data frame to a time series object.

#Converting it to Time Series Object
tsdata <- ts(df$`Total Confirmed Cases`)
ts.plot(tsdata)

We get the time series plot with each observation recorded on a daily basis. The next step is to check if the time series is stationary. Putting it simply, a stationary time series is one whose statistical properties such as mean, variance, auto-correlation are constant over time. It is important for a time series observation to be stationary as it then becomes easy to get accurate predictions. We use the Augmented Dickey Fuller Test to test the stationarity of the time series observations. The null hypothesis (Ho) for the test is that the data is not stationary whereas the alternate hypothesis is that the data is stationary. The Level of Significance (LOS) is taken to be 0.05.

> adf.test(tsdata)Augmented Dickey-Fuller Testdata: tsdata
Dickey-Fuller = 2.2994, Lag order = 3, p-value = 0.99
alternative hypothesis: stationaryWarning message:
In adf.test(tsdata) : p-value greater than printed p-value

The p-value turns out be 0.99. We thus fail to reject our Ho and conclude that the data is not stationary. We now have to work on the stationarity of the data. There are various methods to make our time series stationary depending on its behavior. The most popular is the method of Differencing. Differencing helps stabilize the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality. Mathematically it is given as:

In the above expression, B is the Backshift operator and d represents the differencing order. The code used in R is as follows

> tsdata1 <- diff(tsdata, differences = 2)
> adf.test(tsdata1)Augmented Dickey-Fuller Testdata:  tsdata1
Dickey-Fuller = -4.3764, Lag order = 3, p-value = 0.01
alternative hypothesis: stationaryWarning message:
In adf.test(tsdata1) : p-value smaller than printed p-value

After two rounds of differencing, we again perform the Augmented Dickey Fuller Test (ADF Test). The p-value is smaller than 0.01. We can thus reject our null hypothesis and conclude that the data is stationary. Since the order of differencing is 2, d is 2.

The next step is to plot the ACF and PACF graphs.

Complete Auto-Correlation Function (ACF) gives us the auto correlation of any series with its lagged values. In other words, ACF is a chart of coefficients of correlation between a time-series and its lagged values.

Partial Auto-Correlation Function (PACF) gives us the amount of correlation between two variables which is not explained by their mutual correlations, but with the residuals. Hence, we observe if the residuals can help us generate more information. A more comprehensive guide on the topic can be found here.

#Plotting the ACF plot and PACF
acf(data1)
pacf(data1)

The ACF plot above shows a significant correlation at lag 3(q) while the PACF plot below shows significant correlation till lag 1(p).

Using ARIMA (1, 2, 3)

Based on the ACF and PACF, we choose an ARIMA (1,2,3) model. We take d as 2 since it took two differencing to make our time series stationary. We will now fit the model based on the above chosen parameters

#Fitting ARIMA (1,2,3) model
fit1 <- arima(df$`Total Confirmed Cases`, c(1,2,3))
fit1#Forecasting
forecast1 <- forecast(fit1, h = 10)
forecast1
plot(forecast1)

We get the following results

Point Forecast for the next 10 days (h = 10)

We forecast the Confirmed Cases for the next 10 days, that is, until 7th April 2020. The Point Forecast reaches 2278 on the 69th day, that is, by 7th April 2020, total confirmed cases is likely to reach 2278. Also, it is important to note that the cases double almost every 10 days. The figure plot shows the predicted cases. The blue line represents the forecast and the silver shade around it represents the confidence interval.

Using auto.arima()

R also has at auto.arima() function that automatically selects the optimum (p,d,q) values for ARIMA model. Let us fit using that.

#Fitting auto.arima ()
fit2 <- auto.arima(df$`Total Confirmed Cases`, seasonal = FALSE)
fit2#Forecasting
forecast2 <- forecast(fit2, h = 9)
forecast2
plot(forecast2)

We get the following results

The model using auto.arima(1,2,0) is pretty similar to our previous model, however, the q value is 0. We get an Akaike Information Criterion (AIC) of 486.42 which is less than the AIC for the previous model (491.62). From the point estimates, we can see that the Total Number of Cases reach 2264 on the 10th day, that is, by 7th April 2020, marginally less than 2278 predicted by the previous model. Cases more of less double every 10 days. Plotting the predictions, we get:

To estimate Model Adequacy, we do Residual Analysis. For a model to be adequate, the residuals should be Independent and Identically Distributed (I.I.D.) and should be uncorrelated. To test for correlation, we use the ACF plot and look for significant correlations.

There is no significant correlation observed at any lag except 0. We can deduce that the residuals are not correlated. To test if the residuals are normally distributed, we use Shapiro-Wilk Test of Normality. The results show that the residuals are normally distributed.

Conclusion

We come to the end our Time Series modelling on the Total Number of Cases in India. The number is increasing every passing day, however, the rate at which it grows will slow down in the next few days. The lock down in India implemented on 24th March will prevent community spread of the virus if the steps are taken seriously by the general public. Right now, it’s a major issue troubling every one in every corner of the world. We as a community have the potential to stop it and we can stop it by preventing the transmission of virus.

Thank you for the read. I sincerely hope you found it helpful and as always I am open to constructive feedback.

Drop me a mail at: icy.algorithms@gmail.com

You can find me on LinkedIn.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Forecasting COVID-19 cases in India

How many cases are going to get detected by 7th April 2020 in India?

Using ARIMA (1, 2, 3)

Using auto.arima()

Conclusion

Written by Ishan Choudhary