Analysis of COVID-19 using per capita data

The number of cases per million inhabitants is usually not provided by media reports and this changes everything!

Rubens Mau
Towards Data Science

--

Daily we are being bombarded by tables, charts, and articles on the spread of the COVID-19 virus; the table reproduced below is a typical example.

However, simply showing the total number of infections and fatalities does not provide us with much information, since we are comparing completely different population sizes. China’s population is more than 20 times the population of France.

If we want to compare different countries’ actions, we need to take into account the disease spread by the number of cases per million inhabitants. This fact may seem to be obvious, but, it is almost never applied by the media or by official organizations.

Moreover, since more than 80% of the deceased are above the age of 65, we need to go even further and check the number of cases/deaths per million inhabitants over the age of 65. This is true when comparing different countries, or between different neighborhoods in a single city.

It is important to note that the are other problems that difficult a thorough understanding of the COVID-19 spread. Mainly, we have many problems getting reliable data :

  • Many cases are asymptomatic, so even if we could test all sick people we could not get a reliable figure on actual cases.
  • Many countries are not managing to test all sick people, and there is a lag in getting the results. Therefore, these results should be reported retroactively.
  • The fatalities are not equally tallied because many countries, such as the UK, only report deaths in hospitals

Due to the high variability in the quality of the data between countries and to further improve our comparison it is necessary to:

  • Average the data over a period of time to reduce the noise in the data
  • Plot the data on the same time basis, e.g. time zero is when 100 cases were detected.

I describe now the results I got using some simple data tools. The data was last updated on April 20th, 2020; however, you can run the notebooks mentioned below to get updated charts.

Figure 1: averaged number of deaths per million inhabitants

LAST DAY 2020–04–20

I was astonished by Figure 1 chart that presents the daily number of deaths, averaged over 5 days. Belgium has the highest death rate, but you don’t read about it even in the local media. You can confirm my results accessing https://www.statista.com/statistics/1104709/coronavirus-deaths-worldwide-per-million-inhabitants/

Figure 2: averaged number of deaths per million inhabitants over 65

LAST DAY 2020–04–20

Figure 2 presents the same data as figure 1; however, it only considers the population over 65. In this case, we can see that Italy had fewer old people deaths than Spain, although they have a larger share of older people in their population — as the media was trying to explain a few weeks ago about the mortality rate in Italy.

Figure 3: accumulated deaths per million inhabitants

LAST DAY 2020–04–20

In Figure 3, we can see that Belgium is still in bad shape, although reducing the rate of increase.

Figure 4: Belgium: averaged daily number of deaths/confirmed cases

BELGIUM

In figure 4, we focus on the confirmed cases and deaths in Belgium. We see that the daily number of deaths has stopped increasing. Important to notice these charts are based on a moving average over 5 days.

Figure 5: São Paulo City: number of deaths per 100k inhabitants

In my opinion, the most important analysis is at the local level.

Figure 5 is a map of the neighborhoods of the city of São Paulo, Brazil displaying the number of deaths per 100 thousand inhabitants. Here we can see that at the local level there is great variability of the affected areas. If the death rate in the neighborhoods is correlated to other factors such as access to health facilities, population density, and income level, we could gain a better understanding of why the disease is spreading and how to contain it.

Unfortunately, the detailed data this chart is based upon is not public. It was requested from Sistema Eletrônico de Informação ao Cidadão e-SIC, and granted to me a single time.

You can access this interactive chart by accessing https://rubensmau.github.io/saopaulo_deaths_neighborhood.html

CONCLUSION

I hope this post can help the media and other organizations in understanding the necessity to report cases per million inhabitants, and thus improve the quality of their reports. Moreover, this type of analysis could be of great value to inform future policies at the local and national level. It is very important to provide not only images but also the data these reports are based upon.

You can access two notebooks that generated these charts. They are accessible through Kaggle, where you can run them online (see instructions below) and access the data. No programming skills required.

https://www.kaggle.com/rubensmau/covid-19-deaths-per-capita

https://www.kaggle.com/rubensmau/sao-paulo-city-data-on-covid-19

HOW TO RUN A KAGGLE NOTEBOOK

  1. Free registration in https://www.kaggle.com/account/login
  2. Click on the above links
  3. Click on “Copy and Edit” button at the top right of the page
  4. Click on “Run All” to run the notebook
  5. You can access the data by clicking on the data tab, at the right of the page.

Charts built using https://matplotlib.org and https://altair-viz.github.io/

--

--