COVID-19 Cluster Analysis

Clustering World Countries affected by Coronavirus

Published in

Towards Data Science

5 min readApr 14, 2020

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Introduction

During the past 2 months, new major epidemic foci of coronavirus disease 2019 (COVID-19), have been identified and are rapidly spreading in the whole world. By March 16, 2020, the number of cases of COVID-19 outside China had increased drastically and the number of affected countries, states, or territories reporting infections to World Health Organization (WHO) was numerous.

On the basis of ”alarming levels of spread and severity, and by the alarming levels of inaction”, on March 11, 2020, the Director-General of WHO characterised the COVID-19 situation as a pandemic [1]. This fact and the rapid spread of the coronavirus has led many countries around the world to take strict measures to restrict citizens.

On the other hand, many scientists from various fields try to provide insights on this pandemic level disease. Many organizations and universities including the WHO and John Hopkins University (JHU) made available to the public, data and visualizations presenting the spread of coronavirus [2, 3]. JHU Center for Systems Science and Engineering and its researchers created a data repository in Github, providing time-series data from countries of the world regarding the coronavirus. These data include the total confirmed cases of the virus, the deaths and the people recovered from it on a daily basis.

This writing attempts to provide insights utilizing clustering techniques and visualizations on the data provided from the abovementioned data repository in Github.

Data Analysis

First of all, the data consists of three datasets, namely confirmed cases, deaths and recovered. Each dataset is comprised of time-series from 01/22/2020 to 04/11/2020 for each country in the set. There are also some extra columns in each dataset, including Province/State of each country and Latitude-Longitude coordinates. The below figure provides a glimpse of one of the datasets, before any preprocessing.

Figure 1: Dataset before any preprocessing.

After performing some data preprocessing, we end up with clean and ready to be analysed datasets. In addition, the sourced data are cumulative time-series, by means that the next day’s values are the previous day’s plus the new cases. Thus, for this analysis, the time-series are transformed to represent the new cases of each day.

Moreover, the trendline coefficient is calculated for each of the transformed time-series, by performing linear regression. In order to have statistically significant results at a 95% level of significance, the countries which have not statistically significant trend (i.e. p-value of the trendline coefficient greater than the α level of 0.05) are dropped from the analysis.

Furthermore, cluster analysis has been performed on the trendline coefficients of the corresponding countries. Regarding the algorithmic part, the K-means algorithm is utilized on standardized coefficients. This algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or alternatively the within-cluster sum-of-squares. The number of clusters is determined using the elbow method [4] based on inertia and the average silhouette method [4].

The clusters are visualized in interactive world maps in order to provide a better understanding of the situation. Please note that the uncoloured countries on the maps are not present in this analysis because either their trendline coefficients are not statistically significant or there is not time-series data available for them. Figures 1, 3 and 5 display the clustering of countries based on confirmed cases, deaths and people recovered respectively, while figures 2, 4, 6 show the average trend coefficient of each cluster in percentage values regarding the clustering on confirmed cases, deaths and people recovered respectively. Hovering over the graphs 1, 3, 5, there is the country name, the cluster they belong and their trendline coefficient in percentage. Besides, in figures 2, 4, 6, there is the cluster name, the average value of the trend and the countries that belong to this cluster.

Confirmed Cases

Figure 1: Clustering on Confirmed Cases

Figure 2: Average Trend of each Cluster — Cases

In terms of interpretation of the clusters, the United States seems to have a tremendous increment every day in their confirmed cases. On the other hand, the countries in cluster 3, namely France, Germany, Italy and Spain exhibit a modest increment on a daily basis. Finally, Turkey, Iran and the United Kingdom portray an increment of about half of the cluster’s 3. All the other countries in the set, show a stable low increment.

Confirmed Deaths

Figure 3: Clustering on Confirmed Deaths

Figure 4: Average Trend of each Cluster — Deaths

Regarding clusters in deaths dataset, it is obvious that the United States and Italy have the highest increasing trend than the other countries. Following, in cluster 5, Spain and France with a slightly lower trend than the US and Italy. Cluster 3 includes the United Kingdom, while cluster 4 includes Germany, Iran, Belgium and the Netherlands. All the other countries in the set, show a minor increment.

Confirmed Recovered

Figure 5: Clustering on Confirmed Recovered

Figure 6: Average Trend of each Cluster — Recovered

Following, the clusters regarding the trend in the people recovered from the coronavirus can be seen above. Cluster 3, which includes Germany and Spain, portrays the highest daily increment. Cluster 2 comes second including Iran and China. The third cluster in the rank of increment is cluster 5 and involves France, Italy and the US with about half value in percentage increment than the leading cluster 3. Also, cluster 4 includes counties such as Austria, South Korea, Belgium, Canada and Switzerland, and cluster 5 includes all the other countries in the set, which show a minor increment.

Results

According to the above analysis of trendline coefficients regarding confirmed cases, deaths and people recovered from the coronavirus, there are some identifiable clusters in their daily growth. Specifically, in the case of confirmed cases worldwide there are four distinct clusters. In the case of confirmed deaths and people recovered worldwide, there are five clusters in each set respectively. Please note, that the countries having a low amount of new cases on a daily basis, will also have a low amount of deaths and recovered and thus their increment will also be lower.

Conclusion

To sum up, this article endeavours to provide insights on the circumstances regarding the COVID-19 virus utilizing clustering techniques and visualizations on the data provided by the JHU Center for Systems Science and Engineering. The data consists of three datasets, specifically the confirmed cases, deaths and people recovered from the virus. Each dataset includes time-series of world-countries. The clustering method, using the K-means algorithm, on the trendline coefficients of each time-series brings to the fore meaningful clusters of world countries experiencing somehow the same situation.

References

[1] WHO Director-General’s opening remarks at the media briefing on COVID-19–11 March 2020. Retrieved from: https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---11-march-2020

[2] Coronavirus (COVID-19). World Health Organization. Retrieved from: https://who.sprinklr.com/

[3] CORONAVIRUS RESOURCE CENTER. John Hopkins University. Retrieved from: https://coronavirus.jhu.edu/

[4] Yuan, C., & Yang, H. (2019). Research on K-Value Selection Method of K-Means Clustering Algorithm.