The world’s leading publication for data science, AI, and ML professionals.

Create an interactive COVID-19 report using R, host it for free and automate its update

Engaging Plotly visualisations, a Shiny UI and GitHub actions to create the perfect report!

This post is part 1 of a series of 4 publications. Refer to part 1 for an overview of the series, part 2 for an explanation of the data sources and minor data cleaning, part 3 for the creation of the visualisations, building the report and the deploy the document into ShinyApps.io and part 4 (soon to be ready) for automatic data update, compilation and publishing of the report.[Project repo]

Each article is self-contained, meaning you don’t really need to read the other parts to make the most out of it.

Navigation of the interactive report
Navigation of the interactive report

All news outlets and government bodies have been bombarding the population with Covid-19 statistics and graphs since March 2020. This medium article by Tomas Pueyo published in early March was one of the first to explore COVID data in plenty of depth revealing many insights and getting tens of millions of views in just a few days. The John Hopkins University in the US has also provided a very useful dashboard as well as their data which gets updated daily. Their data combines multiple sources such as the WHO, the US Center for Disease Communication(CDC), and the EU CDC among others.

Early on, the internet was infested with a multitude of reports from universities. I realized that I would not be able to make a world-class dashboard or report but at least I wanted to make my own and learn how to make one.

Table of Contents

· Teaser · ggplot, Plotly and Shiny · Data sources · Plan · See you soon!

Teaser

Before we get started I want to show you what the end product will look like. We will end up with an interactive cloud-based HTML report (with a Shiny backend) showing different COVID19 related plots. You can interact with the final report by following this link → https://lucha6.shinyapps.io/covideda/

We will be making the following (interactive) plots and baking them into a beautiful interactive RMarkdown document that gets updated daily without any extra work from our part.

Sample R (Shiny) code and output displaying a leaflet map where the countries are coloured by the number of confirmed cases
Sample R (Shiny) code and output displaying a leaflet map where the countries are coloured by the number of confirmed cases

ggplot, Plotly and Shiny

ggplot is in my opinion the nicest plotting library out there. This library is based on the Grammar of Graphics (GoG) by Leland Wilkinson and is implemented and distributed as an R package. I recommend reading the original ggplot 2010 paper by the almighty Hadley Wickham on the first implementation of ggplot and a comprehensive but concise guide on the theory behind the GoG by Dipanjan (DJ) Sarkar.

Plotly is a great cross-language library (originally written in JavaScript) that allows making interactive plots. In R, which is the language we will be using today, they have a function called ggplotly() that allows you to turn any ggplot plot into a Plotly with no overhead.

Shiny allows R users to make great looking dashboards and interactive reports without needing to know a bit of JavaScript or HTML, all with R alone. RStudio also hosts up to 5 free dashboards websites per user which is a friction-less way to build your first dashboard, web-app, or interactive R notebook.

Data sources

When I started this project back in March, the data source I started using was from Kaggle: the Novel Corona-Virus dataset uploaded and maintained by the user SKR. As described in the dataset description, this is simply a mirror of the JHU dataset, which is also available in GitHub. Unfortunately, I discovered late in the project there were several artefacts with this data source. For example, in Spain, -10000 confirmed cases were reported on April 24th and -2000 deaths were recorded on May 25th, as discussed in this GitHub issue (out of the 1300+ open issues in the JHU repository). Similar other problems were present in the data and the fact that the repository maintainers did not step up to give an explanation for this made me doubtful about the use of this data. Furthermore, the COVID daily cases of plots from Google or Our World in Data(OWiD) did not show these artefacts.

I decided to switch my data source away from the JHU and use OWiD instead. To my great satisfaction, the OWiD data was very clean and full of pre-computed features and statistics (such as statistics on testing, the population counts, life expectancy per country…).

Plan

I decided this project was too big for a single article hence I decided to write a 4 part series (including this introduction). The 3 next articles will go as follows:

  • We will first be covering the data cleaning, which is a necessary step to make each of our subsequent plots. We focused mainly on having daily and cumulative counts of deaths and confirmed cases by country. This is not exciting or insightful enough to write a Medium article about it, hence I published this step for those interested in RPubs (read here). You can also explore my struggles to clean and wrangle the JHU data in this other RPubs notebook. For learning purposes, my struggles through the 2nd notebook are much more useful. Data cleaning is an extremely important step and it is often overlooked because it is not flashy. Despite this, all data scientists should become familiar with it.
  • Next, we will make each of our plots using ggplot2, plotly and leaflet and learn how to make and deploy a Shiny RMarkdown document. We will also see how to store plotly plots as regular files to speed up the loading of our Shiny document.
  • Finally, we will write automating tools (bash scripts and Github Actions) so that our report gets updated daily without us having to put in any extra work.

See you soon!


Related Articles