The world’s leading publication for data science, AI, and ML professionals.

Forecasting the COVID-19 trend using the SEIR model

Write a Java program to analyze COVID-19 data and forecast a trend.

Image by author. The graph of daily COVID-19 death count and 7-day average from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of daily COVID-19 death count and 7-day average from April 1, 2021 to October 2, 2021 in Thailand

Thailand handled COVID-19 remarkably well in 2020. The numbers of cases and fatalities were relatively low compared to other countries. The situation has changed since the latest wave of the pandemic starting around April 2021. The death count has risen rapidly. Although, in recent months, the number has declined.

We will discuss:

  • The SEIR model
  • Modeling SEIR in code
  • Retrieving public COVID data
  • Searching for the model parameters
  • Forecasting by using the model
  • Visualizing the result
  • Visualizing the variation of the reproduction rate

The SEIR model

The SEIR model is a prevalent model describing various diseases, including COVID-19. Several articles discuss the SEIR model already. Such as [here](https://docs.idmod.org/projects/emod-hiv/en/latest/model-seir.html) and here. We would briefly introduce it to get to the main topic quickly.

The model divides the population into these groups:

  • Susceptible(S). The persons in this group are not infected yet.
  • Exposed(E). The individuals in this group are infected but not yet infectious
  • Infectious(I). The group could infect other people.
  • Recovered(R). The persons in the group have recovered.

Five differential equations describe how the numbers of people in these groups change with time. We only consider the simplified version rather than the rigorous one because we ignore the birth rate and the death rate from other causes.

Image by author. Generated by CODECOGS. These are the SEIR differential equations.
Image by author. Generated by CODECOGS. These are the SEIR differential equations.

The SEIR model parameters are:

  • Alpha(𝛼) is a disease-induced average fatality rate.
  • Beta(𝜷) is the probability of disease transmission per contact times the number of contacts per unit time.
  • Epsilon(𝜺) is the rate of progression from exposure to infectious. It is the reciprocal of the incubation period.
  • Gamma(𝜸) is the recovery rate. It’s the inverse of the infectious period.

To use the model to forecast, we need to know the above parameters as well as the following values at the beginning:

  • Exposed count
  • Infectious count
  • Recovered count
  • Death count

We need to know the initial susceptible count too. However, we could calculate it from the population and the above initial values. Susceptible count equals the population minus exposed count, infectious count, recovered count, and the death count. We transform SEIR differential equations into the difference equations here.

Image by author. Generated by CODECOGS. These are the SEIR difference equations.
Image by author. Generated by CODECOGS. These are the SEIR difference equations.

It’s easier to implement in code this way.

There are two metrics related to COVID-19. The reproduction rate (R0) is the average number of secondary cases generated by an infectious individual. If we ignore the natural birth rate and fatalities from other causes, the following equation describes the reproduction rate:

Image by author. Generated by CODECOGS. This is the Reproduction rate equation.
Image by author. Generated by CODECOGS. This is the Reproduction rate equation.

If R0 is less than 1, the disease stops. If R0 is greater than 1, the disease continues spreading. The other one is infection and case fatality rate (IFR). The following equation describes IFR:

Image by author. Generated by CODECOGS. This is the infection and case fatality rate equation.
Image by author. Generated by CODECOGS. This is the infection and case fatality rate equation.

Modeling the SEIR in code

First, we model SEIR parameters in one class. Then, we represent the differential equations in another.

The ModelParameter class

Here we encapsulate both SEIR parameters and the initial conditions. The member variables are alpha, beta, epsilon, gamma, initial exposed, infectious, recovered, and death count.

We store the population in this class because we use it to calculate susceptibles. The population value is constant. So, in the code, it is static, and we could set it only one time.

The SEIR model class

The SEIR model class has one constructor that takes the model parameter as an argument.

The run method takes one parameter, the number of days we want the model to predict. The run method does the calculation to predict the numbers of individuals in all groups in the future. The run method has a nested loop. The outer loop processes a day. We divide each day into small intervals, ten intervals in our case. The inner for loop would compute increments in the numbers of individuals in each group. We use the SEIR difference equations to do the calculation. We put the result into the SEIRResult object.

Retrieving public COVID data

Our COVID data comes from this site. Thank Dylan Jay and other contributors.

We use the Tablesaw library to do table and column data processing. The library could handle statistics Analysis as well as some data visualizations. More details on Tablesaw are available here.

Population

Thailand population data comes from the excel file here. Although the file offers a lot more information, we only need the total population of the country. The following is the code.

The code is simple. We create a table from the excel file. Then, we grab the first row of the Total column. It is the population value. Finally, we set it to the model parameter object.

Cases Briefing

Daily new information is available in CSV format here. The following is the sample content.

There are a lot of data though we are interested in only these columns:

  • Cases
  • Deaths
  • Recovered

The following is the code:

  • First, we create a table by reading the online CSV file.
  • The table has a lot of columns. We remove all except the Date, Cases, Recovered, and Deaths columns.
  • We compute a 7-day rolling average of the Deaths column and store it in a new column. We will not use the new column in the SEIR model. To visualize only.
  • Next, we filter the table. Take only the data that is on or after a specific date. The date is the beginning of the latest COVID wave in the country.
  • We create a new Infectious column from the Cases column subtracts the Recovered column.
  • Cases, Infectious, Recovered, and Deaths, are daily. For each, we create a cumulative value column.
  • Finally, we plot the graphs.

These are the three graphs:

  • Deaths and 7-day average deaths. This graph is the first image of this article.
  • Cumulative cases, infectious and recovered counts
  • Daily cases, infectious and recovered counts
Image by author. The graph of cumulative cases, infectious and recovered count from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of cumulative cases, infectious and recovered count from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of daily cases, infectious and recovered count from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of daily cases, infectious and recovered count from April 1, 2021 to October 2, 2021 in Thailand

Searching for the model parameter

To predict by using the SEIR model, we need the SEIR parameters as well as initial conditions. Here are what we are going to do:

  • Identify unknown independent model parameters and initial conditions.
  • Identify possible values of each unknown parameter.
  • Implement the generic grid search
  • Implement the SEIR parameter grid search

Identifying unknown independent parameters

These are:

  • Alpha
  • Beta
  • Epsilon
  • Gamma
  • Initial exposed count. It is not known.
  • Initial infectious count. Although the official value is available, it might be too low because the authority may not have enough COVID testing. We shall treat it as unknown.
  • Initial recovered count. For the same reason as above, the value from data might be too low. We will treat it as unknown.

The initial death count is available from the data we retrieve. It should be reliable enough.

In conclusion, we have seven unknown independent parameters to find.

Identifying possible values of each unknown parameter

For each model parameter, we identify or list possible values.

  • Alpha ranges from 0.000005 to 0.0001.
  • Beta ranges from 0.05 to 1.0.
  • Epsilon is an inverse of the incubation period. The average incubation period is 5.2 days. We make it a list of values starting from 1/5.6 to 1/4.8.
  • Gamma is an inverse of the infectious period. It’s believed that the period is a week or less. We make it a list of values starting from 1/9 to 1/4.
  • Initial exposed count. We do not have the exact number. We guess it will start from 2 times official infectious to 20 times official infectious count.
  • Initial infectious count. We estimate it will start from the official initial infectious count to 10 times the value.
  • Initial recovered count. We do have the official number; however, the correct number might be higher. We guess it will start from the official recovered count to 10 times the official recovered count.

The GridSearch class

The GridSearch class has the run method as the only one. Note that it’s a generic class to make it is reusable in the future. The type parameter is the ModelParameter class in our case.

The method parameters are:

  • A parameter grid.
  • A constraint of ModelParameter object.
  • An objective function that we need to minimize. We typically supply an error function between the model output and the actual data.
  • A results set size we want the function to return.

The method would evaluate the objective function with all possible values in the grid. Then, return a sorted set of ModelParameter objects that have the lowest values of the objective function.

We use the producer-consumer design pattern for this. We have a single producer to create a lot of parameters and put them on a bound deque. We have multiple consumers to take the parameters of the deque and calculate the objective function value.

Here is some explanation of the code

  • First, the method calculates the number of consumer tasks to create. It is the number of available processors minus 2.
  • Next, it constructs a blocking deque. Deque is a queue that an element can be added to or removed from either the front or back. The deque is bound. It means we could add up to a certain amount of elements. The producer task will block if it reaches the limit.
  • The method creates and submits a producer task.
  • The producer task tests the constraint for each ModelParameter object in the grid and then puts it to the deque.
  • For each consumer, the producer task puts a poison pill to the deque. Consumers will know when to stop.
  • The run method creates and submits consumer tasks.
  • Each consumer task creates a bound tree set. We sort the tree set by objective function value. Also, it keeps only the lowest designated number of elements. The tree set source code is here.
  • Each consumer takes a ModelParameter object from the deque. If it is not a poison pill, evaluate the objective function and add a pair of ModelParameter and the function value to the set. When a consumer task finds a poison pill, it stops.
  • Once all consumers have finished, the run method combines the result from all and returns it to the callee.

The SEIRParameterGridSearch class

This class uses the data table from cases briefing to search for the model parameter.

  • It uses the first row in the table as starting guesses for initial conditions.
  • Next, we create a parameter grid. It does the cartesian product of all parameters. For each combination, it builds a model parameter object and then calls a consumer.
  • Then, we define an objective function which is a function of a model parameter that we want to minimize. We use an error function to compare model output and observed values. The values we use to compare are the daily death count because it is more reliable than the infectious count or the recovered count.
  • Finally, we call the grid search.

The grid search and the result

The code above does the parameter search. Later, we transform the result into a table. Then, save the table as a CSV file as below.

  • We use 61 days of data: from August 3. 2021 to October 2, 2021.
  • The lowest RMSE value is 29.9. We think the value makes sense, given that the data is noisy.
  • The reproduction rate is 1.6.
  • IFR is 0.032%. It seems too low. But according to this site, the lowest death rate is around 0.07%.

Visualising the result

Let’s see how the model fits the observed data in graphs. The following is the code.

  • The first graph is the daily death count from actual data and our models. We could see the models fit quite well with the data. Also, all model results are very close.
  • The 2nd graph is a similar one except that we look 60 days into the future. We could see that the predicted death count gradually decreases. Still, not quite zero yet.
  • The 3rd graph plots infectious count from the observed data and the SEIR models. Here we see the model values are much higher than the official infectious count.
Image by author. The graph of observed daily death count and model results from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of observed daily death count and model results from April 1, 2021 to October 2, 2021 in Thailand
Image by author. The graph of observed daily death count and model results from April 1, 2021 to October 2, 2021 in Thailand. Also, we predict 60 days in to the future
Image by author. The graph of observed daily death count and model results from April 1, 2021 to October 2, 2021 in Thailand. Also, we predict 60 days in to the future
Image by author. The graph of observed infectious count and model results from April 1, 2021 to October 2, 2021 in Thailand. Also, we predict 60 days in to the future
Image by author. The graph of observed infectious count and model results from April 1, 2021 to October 2, 2021 in Thailand. Also, we predict 60 days in to the future

We could see that the situation seems to be better for the country.

Visualising the variation of the reproduction rate

Let’s see how the reproduction rate varies month to month. We will do these:

  • Divide the case data by month.
  • Run the parameter grid search by using monthly data.
  • Compute the monthly reproduction rate and plot the data.

The following is the code:

The graph is below.

Image by author. The variation of the reproduction rate from April 2021 to September 2021
Image by author. The variation of the reproduction rate from April 2021 to September 2021
  • In April, the reproduction rate is less than 1.
  • Then, R0 keeps increasing until July.
  • In August, R0 declines a bit but is still greater than 1.
  • In September, R0 decreases by a lot. The value now is less than one again.

With this reproduction rate variation, we are more confident that the pandemic situation is getting better.

Conclusion

We analyze and forecast the Covid-19 trend using the SEIR model and the public online daily data. The result looks promising. The model output quite fits the observed data.

There are some things we could do to improve the model:

  • We could have included the natural birth rate and the death rate from other causes.
  • We could have added migrant worker numbers to the equations.

All the source code files are available on GitHub here.

Citation: Dylan Jay. Thailand COVID-19 Data. Retrieved from https://github.com/djay/covidthailand. It is licensed under Creative Commons Attribution 4.0 International License


Related Articles