DATA SCIENCE — COVID-19 — VISUALISATION — PROGRAMMING

Comparison of COVID-19-data from different locations: Normalization — Showcase — Programming

The Anh Vuong, Dr.-Ing.

Published in

Towards Data Science

8 min readJun 23, 2020

How many infected people are still in our near environment, in our country, and further to neighboring countries?

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

The COVID-19 pandemic looks like a world war, spreading to 213 countries and regions around the world [worldometer.info], bringing deaths, sickness, fear, sadness, disaster, and chaos to the world. An immense volume of COVID-19 data streams daily to us as messages from the front of the battle with our invisible enemy, SARS-CoV-2 virus. From this data volume, I have asked for myself a question:

“How many infected people are still in our near environment, in our country, and further to neighboring countries?”

The number of currently infected patients is important; it is helpful for our living, planning, working, and preventing. From this motivation, I am extending my research interest from the visualization of data, and the estimation of the undiscovered infection cases, to the comparison of the active infection cases from different locations.

In this article, I want to share with you my method “Normalization of accumulated active cases” for analysis data from multi-country.

Because of my data resource, which delivers data from many countries, I could be working at a “high level”: compare data from the countries around the world. However, you could use my methods and my open-source software package writing in Python to analyze data from the other geographic locations too.

I presented here some showcases to demonstrate my developing method. They are not professional reports (as found in WHO, CDC, RKI), but it could be useful to help us to understand what is going on by COVID-19 Pandemic, beyond the immense volume of data.

Which COVID-19 data is important to compare?

Before we do the development of the comparison of data, I would like to introduce the different types of COVID-19 data. They are:

The number of daily confirmed infection cases, the “new cases”.
The number of currently infected Patients, the “active cases” (s.here)
The number of estimated active cases, which include the undiscovered cases, it could be estimated from my Vuong-algorithm or other.
The numbers of Recovered / Discharged patients.
The death cases.
etc.

Those data will be visualized as a graph of :

Daily numbers
Accumulated numbers

Daily numbers y(t) are discrete values, so the function y (t) is a “time series”. I used here y[x] instead of y (t), thereby x is the date-time (see Python document: “DateTime module”). It is making it easier for programming and plotting later.

{y[x] / x =x0, x1,…, xn,… xN ; xn is date-time }

Accumulated numbers s[k], is the sum of k numbers of y{x].

Table 1 shows an example with y[x], the number of daily new cases and its accumulated number

The accumulated active numbers are the numbers of currently infected patients, important information, using for our planning, working, and preventing infection.

These numbers are very difficult to find because they depend on many parameters: the infected numbers, which is difficult to detect exactly because of the problem of asymptomatic patients, the recovered cases, and the death cases, which are also difficult to find exactly. But we could get the numbers of accumulated active cases in many ways:

The numbers were collected directly from reports of the professional institutions, e,g, test centers, health care centers, CDC
The numbers could be estimated indirectly from infection cases using SIR -Model (e,g, Brian Collins, Dynamic Modeling of Covid-19), or using my Simulator with Vuong-algorithm for estimation of infected Cases from numbers of daily confirmed new cases and deaths.

How to compare the data from different countries?

Let us examine the chart in Fig.1!

Fig. 1: Graphs of active accumulated cases from different countries. Data Source: https://ourworldindata.org/coronavirus-source-data

It shows the graphs of active accumulated cases from different countries. If you want to compare them together, there are the following problems:

It would look like you are comparing an elephant (World’s numbers) with a bear (USA’s numbers), or a rabbit (Germany’s numbers). This is the main problem with the comparison.
The amplitude of the signal is unknown. The accumulated number of infection cases was increasing to 1 Million on 04–02–2020, 2 Million on 04–15–2020, 4 Million on 05–09–2020, and 8 Million on 06–15–2020!
The production of infectious cases is a non-linear system.
The COVID- Infection transmission is different from each county.

Therefore, I propose a method for Corvid-Data Analysis: Normalizing data of accumulated active cases.

Normalizing data of accumulated active cases

The concept has the following steps::

Step1:

The data series must be in the same period for each country, e.g. from Jan 2020 to Jun 30, 2020.

Step 2 :

For each country, we do:
- We store the accumulated data field s[].

- We search the local Maximum A of s[k]
A = max (s[k], k =0,…N) , N= number of elements of field s[k], range (s[])
- Then we calculate the raw normalized data
sn[k] = s[k] /A
- We could use the raw calculation sn[k], but I would like to propose to use the percentage calculation sp{n], which is easier to understand.
sp[n] = 100* s[k] /A

Table 2 is an example that shows the calculation of normalizing S[k] from two countries, thereby the maximum of s[k] from country A is 10, and from country B, 500.

Table 2: Calculation of data from two countries

Data Analysis with COVID19-VuongSimulator

For easier communication, I would like to call

“the graph of normalized data of active cases” to “the normalized active cases”,
“the graph of normalized data of estimated active cases with Vuong-algorithm” to “the normalized V-active cases”,

To interpret the graphs of normalized data, I have generated two showcases with the VuongSimulator using the multi-country-mode (s. VuongSimulator Command).

Showcase 1: The normalized active cases from one country
Showcase 2: The normalized active cases from multi-country: Germany, Italy, Sweden, the USA, and the world.

Showcase 1: The normalized V-active cases from one country

You could read from the showcase 1 “The normalized V-active cases from one country” the following information (s. Fig. 2):

The percentage of active cases against its maximum at the date April 1,
The outbreak date of infection in a country is the date-time, where graphs reached its max 100% (see Fig. 2)
At the last date-time, you could read the percentage of currently infected patients, still existed in the country, 14%.

Fig. 2: Show Case 1: Analysis of Covid-19- Data [GERMANY,] , Data Source: https://ourworldindata.org/coronavirus-source-data

VuongSimulator comand for showcase 1:./data/vmodel_testlist.csv
Germany $ python ./covid19-VuongSimulator.py -c ‘World’ -o test.png -n ./data/new_cases.csv -d ./data/new_deaths.csv -g 0.98 -r 14 -t 7 -s 76

Showcase 2: The normalized V-active cases from multi-country: Germany, Italy, Sweden, the USA, and the World

You could get from the showcase 2 (Fig.3 ) the following information:

The active cases from different countries are now normalized. Each graph has the same maximum at 100%.
The percentage of active cases of each country are against its country-maximum 100%.
We could compare at any date-time the “status” of each country against others, of course only in percentage.
We could explore the breakout date of each country, where graphs reached a maximum of 100%.
We could find the tendency towards reduction, 14% by Germany, and 34% by Italy. The percentage numbers are easier to understand as absolute values.
We could find the tendency standby, in a plateau, 74% by the USA.
We could find the beginning of the breakout by Sweden because graphs reached its maximum of 100% at last date-time.
We see the world’s graph had a local maximum in the middle of April, and the graph is increasing upwards and reached 100% on the last date of the chart. We could expect the next maximum, a second wave in the future.

Using Vuong-Simulator, you could compare data from different countries according to your wish countries-list; there are 212 countries infected to compare! (s. “Install and Starting” in this paper).

Fig. 3: Show Case 2: Analysis of Covid-19- Data [Multi-country], Data Source: https://ourworldindata.org/coronavirus-source-data

VuongSimulator comand for showcase 2:./data/vmodel_testlist.csv
World,Germany,Italy,Sweden,United States$ python ./covid19-VuongSimulator.py -c 'World' -o test.png -n ./data/new_cases.csv -d ./data/new_deaths.csv -g 0.98 -r 14 -t 7 -s 76

Covid19-VuongSimulator with Normalizing mode for multi-country

I have implemented the normalizing accumulated active cases from multi-country in the covid19-VuongSimulator.py. The simulator has only command lines to reduce dialogs. The command parameters are described in the wiki of the project.

The VuongSimulator needs a countries-list to generate output charts with multigraphs. This list is the CSV-file ./data/vmodel_testlist.csv

Default setting:

./data/vmodel_testlist.csv
World,Germany,France,Belgium,Sweden,United Kingdom,Russia,Brazil,United States

To perform analysis of data from other countries, you have to edit the countries-list with a normal text editor

Install and start

The VuongSimulator was described in detail in my posts [1, 2] and the wiki of project tavuong/covid19-datakit, so I would like to write here a brief installation and a quick start to make showcase as I have done, which could bring motivation for you.

$ github clone https://github.com/tavuong/covid19-datakit.git$ pip install numpy$ pip install Matplotlib$ cd ~/covid19-datakit/$ python ./covid19-VuongSimulator.py [by PC]$ python3 ./covid19-VuongSimulator.py [by Raspberry PI]

Quick Start

For a quick start after the installation, you could use this following command. It will be shown a showcase with normalized accumulated active cases from multi-country (Fig 4). To make the other showcase, you edit the country-list and restart the command. The CSV-files (.\data\new_cases and .\data\new_deaths.csv) are in folder .\data, downloaded from open sources (here).

VuongSimulator comand for Quick Start:./data/vmodel_testlist.csv
World,Germany,France,Belgium,Sweden,United Kingdom,Russia,Brazil,United States$ python ./covid19-VuongSimulator.py -c 'World' -o test.png -n ./data/new_cases.csv -d ./data/new_deaths.csv -g 0.98 -r 14 -t 7 -s 76

Fig. 4: Quick Start: Analysis of Covid-19- Data [Multi-country] , Data Source: https://ourworldindata.org/coronavirus-source-data

Summary

According to the development of the Vuong-algorithm for the analysis, and the estimation of COVID-19 data, I have written my software package with news functions for multi-country, thereby a method for Normalizing data of accumulated active infection cases has been developed, one could compare the data from different locations, and estimate geographic spreadings of the virus. It would be great if my proposed method could raise the interest of users in different professions or even by professional institutions.

Please don’t hesitate to contact me to consult current development and maybe to contribute your modules to my open-source and MIT licensed project tavuong/covid19-datakit over Github.

Have fun!

Acknowledgments for review: Prof. Dr. Kien Pham

Acknowledgments for support and coffee cake motivation: my wife Thi Chung Vuong