The world’s leading publication for data science, AI, and ML professionals.

Attempting to model COVID-related deaths

(and partially failing)

Like most of the world who is closed in his home because of COVID-19, I am curious about the evolution of the pandemic. And like many other data scientists, I am trying to use data science to have a glance at the future of the pandemic. I this post I describe such an attempt. More especially, I describe my attempt to predict the number of COVID-related deaths from the number of patients in ventilator. It is a (mostly) failed attempt. Now, you might be wondering why wasting time on a failure. I can offer two reasons. First, in the process I quickly describe some of the most popular methods and give full code in R. Second, I give some explanations on why my models failed and what someone could do to make them work. Hence, the reader of this post will gain some useful insights about data modeling.

As mentioned above, we are going to use the number of COVID-19 patients in ventilators to predict the number of COVID-19 related deaths. As I live in Greece, the data we will use are about this country. I have written a medium post about how I used Python to scrape the data from Greece’s National Public Health Organisation (EODY) which you are welcome to read. We are not going to use the most common number that is reported on the news: confirmed COVID cases. This is because the number of confirmed cases is highly dependable on several other factors. Factors that we have no data about. For example, we do not know how many tests were performed, in what conditions and of what kind. You can find the dataset that we are going to use here.

Data loading and review

As said before, our dataset was created by scrapping Greece’s National Public Health Organisation announcements. It contains an index column, the url, the title and the timestamp of the announcement, the number of COVID patients in ventilator and COVID related deaths for the corresponding date. (You might notice that the first entries of the index column ("X1") are empty. This is because I’ve scrapped the data until a certain date and then I’ve added manually a couple of more days)

First ten rows of our dataset
First ten rows of our dataset

The R code for loading the data, sorting in ascending order by timestamp, creating a date column and plotting the scatter plot of number deaths by date and number of patients in ventilator by date is displayed below.

It is clear that from the middle of October the situation got worse. By the way, the step of ordering the data by timestamp in ascending order is crucial. We will need the data to be ordered this way in the next step. (A mistake I have to confess that I made when I first wrote the code and which messed up my models)

It is said that there is a lag between COVID-19 infection, hospitalization and death. Hence, we are going to use the number of deaths and patients in ventilator up until 20 days ago. For this reason, we create columns ventilator_1, ventilator_7,…, ventilator_20 with number of patients in ventilator 1, 7,…, 20 days before. And columns deaths_1, deaths_7,…, deaths_20 with number of COVID deaths 1, 7,…, 20 days before. This is done with the following code. Note that we are programmatically creating new columns.

Next, drop index, url, title and timestamp columns and split into train and test sets.


Models

We are going to use columns ventilator_1, ventilator_7,…, ventilator_20 and deaths_1, deaths_7,…, deaths_20 to predict deaths and ventilator. Hence, we are going to build two models. One predicting deaths and the other ventilator.

Linear Model

We begin with simple linear models. The model for predicting COVID deaths has a mean square error of 63.607 while the model for patients in ventilator 45.523. The respective R² are 0.9729 and 0.9983 suspiciously high.

According to the model summary, for predicting deaths, the most important variables are ventilator_12 and deaths_1. That is for predicting the number of deaths today, the most important information is the number of deaths yesterday and the number of patients in ventilator 12 days ago. While for predicting patients in ventilator, the most important variables are ventilator_1, ventilator_12, deaths_6 and (curiously) deaths_19.

My interpretation of the results is that:

  • the number of deaths depends on the number of patients in ventilator 12 days ago and number of deaths the previous day because a patient will stay in a ventilator for a couple of days before either dying or recovering. If a steady proportion of these patients dies after some days, then the model uses this number (ventilator_12) to predict the number of deaths and adjusts with the patients that died yesterday.
  • the same logic applies to the model predicting patients in ventilator. This number depends on how many patients there were yesterday. The numbers in ventilator_12, deaths_6 most probably are used to calculate how many patients are for several days in ventilator. As for the number deaths_19, I have no clue.

I would love to read your interpretation in the comments.

Based on the important variables that we detected, we can create linear models using only them. This time, the model for predicting COVID deaths has a mean square error of 22,452 while the model for patients in ventilator 28,641. The respective R² are 0.9298 and 0.9983. It seems that these models perform better.

Regression tree

Since we do not know if there is a linear dependency between our variables, we could try another type of predictive model. One of the most popular are decision trees. Since we are interested in predicting a numerical variable and not a class, the exact term is regression trees. The theoretical details of creating a regression tree out of scope. The interested reader can read more about them in Wikipedia. The model for predicting COVID deaths has a mean square error of 6.878 while the model for patients in ventilator 1974.847. The last one is huge in comparison to that of the linear regression model.

One of the advantages of using decision trees is that they are easy to interpret. The code above creates a graphic representation of the two models. (Note that you can use fancyRpartPlot from rattle package). From the first one, we can see that the tree uses the number of patients in ventilator for the first splits and then the number of COVID related deaths. While the second model uses mainly the number of patients in ventilator the previous day. It should be noted that the outcome of each tree can only be a number in an ending point. I.e. the regression tree for the number of deaths can predict 71.5 deaths at the most.

Regression tree for the prediction of COVID related deaths
Regression tree for the prediction of COVID related deaths
Regression tree for the prediction of COVID patients in ventilator
Regression tree for the prediction of COVID patients in ventilator
Regression tree for the prediction of COVID related deaths (plot using fancyRpartPlot)
Regression tree for the prediction of COVID related deaths (plot using fancyRpartPlot)

Random forest

The basic idea behind random forests is to train many decision trees by using only a sample of the training set and a portion of the available features/variables. The prediction of the random forest is the class that is predicted by most of the trees in case we are building a classification model, or the average of the regression trees prediction in case we are building a regression model. Random forests, tend to outperform simple decision trees. But they are harder to interpret. R’s randomForest library provides commands to calculate the importance of each variable but I recommend using randomForestExplainer library. A key indicator of a variable’s importance is in what depth (i.e. how soon) it is in every tree. We can see that for predicting deaths the most important variables are the number of patients in ventilator 6, 10 and 7 days before. For the model predicting the number of patients in ventilator, the most important variables are the number of patients in ventilator 1, 3 and 2 days before. RandomForestExplainer creates an html as output. You can find both the file related to death prediction and patient prediction in Github.

The model for predicting COVID deaths has a mean square error of 22.556 while the model for patients in ventilator 373.394.

Distribution of minimal depth and mean of minimal depth for most important variables of death predicting model. The smaller the depth, the more important is the variable
Distribution of minimal depth and mean of minimal depth for most important variables of death predicting model. The smaller the depth, the more important is the variable

XGBoost

Finally, we will use XGBoost to create our models. As stated in its documentation:

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.

Generally speaking, in gradient boosting one creates a sequence of models. The first one makes a prediction and each model that comes after tries to improve the performance of the model by adding a factor to the prediction that reduces the prediction error. In XGBoost, the models can be either decision trees or linear models.

We will try both with tree models and linear models. When using tree models in XGBoost, then the model for predicting COVID deaths has a mean square error of 24.123 while the model for patients in ventilator 79.412. When using linear models the respective numbers are 36.025 and 84.208.

According to the model summary, for predicting deaths, the most important variables are ventilator_7, ventilator 9 and deaths_4. While for predicting patients in ventilator, the most important variables are ventilator_1, ventilator_13 and deaths_1.

Variable importance from XGBoost's model for COVID death prediction
Variable importance from XGBoost’s model for COVID death prediction
Variable importance from XGBoost's model for patients in ventilator prediction
Variable importance from XGBoost’s model for patients in ventilator prediction

Model performance

The table below summarizes the mean square error on the test set for the various models we tested. We can see that the regression tree has the smallest error when predicting deaths and the highest when predicting patients in ventilator.

+-----------------+------------+----------------+
|   Models type   | Deaths MSE | Ventilator MSE |
+-----------------+------------+----------------+
| Linear          |     63,607 |         45,523 |
| Simple Linear   |    22,4519 |       28,64136 |
| Regression Tree |     6,878  |     1.974,847  |
| Random forest   |    22,556  |       373,394  |
| xgboost tree    |   24,12343 |       79,41168 |
| xgboost linear  |   36,02501 |       84,20822 |
+-----------------+------------+----------------+

A possible way to improve the models performance would be:

  • to use one type of model to predict deaths and another to predict patients in ventilator,
  • to use a mix of methods by using ensembles,
  • to use more data, like age distribution of patients in ventilator.

The two plots below display the COVID-19 related death predictions of the models for the next 8 and 30 days.

COVID-19 related deaths predictions of the models for the next 8 days
COVID-19 related deaths predictions of the models for the next 8 days
COVID-19 related deaths predictions of the models for the next 30 days
COVID-19 related deaths predictions of the models for the next 30 days

We can see that in the long run, the models that are based on decision trees are stable. This has to do with the fact that they have only a finite set of outcomes. The stability of tree-based models is an indicator that they are wrong.

As for the linear based models? Well, I hope that they prove wrong because they predict an exponential rise in the number of deaths. Also, at the time of writing this Greece is already in a two-week lockdown and the number of patients in ventilator is starting to level up. Hence, one would expect the number in death predictions to stop rising.

Then again, the new death toll was announced two hours ago. It was 102. An unexpectedly high number, both for my countrymen and for my models.

With this, please stay safe and wait patiently until this passes.


Related Articles