Opinion
The Covid-19 global crisis has had a strong impact on the daily life of most people on Earth. It has also generated an unprecedented time in which several hundred million people started to analyze plots, data points and epidemiology metrics on a daily basis.
There are always good things coming out of any situation, no matter how bad it is. And in addition to great demonstrations of solidarity, the Covid-19 crisis taught many people how to look at and interpret data. And often, they realized it was not always as simple as it seems.
Data collection is not trivial
From an outsider perspective, it would seem collecting data is a straightforward task. Since databases have been used for multiple decades, there must be a system as simple as sending an email to update a central database. We all wish.
In practice, this isn’t as easy. There are many database systems out there, many ways to store data, many ways to encode some information and also many ways to categorize your values. And when you want to create a central database collecting all data about, let’s say, Covid-19 cases and deaths you have to make sure all the entities you collaborate with use the same tools, software versions, definitions, category names, file format, encoding,…
Due to the extremely rare nature of a pandemic, most countries or states had no similar systems in place and had to improvise something in few weeks, generally relying on employees trying to do their best but who were by no means data warehouse architects.
And, often, the quickest and most natural solution seems to be Excel. It is widely used and is seen as a standard format. Different regions, states, hospitals or nursing homes then send their daily data as an Excel file (and that’s still OK when you know others sent PDFs) to a central entity whose unfortunate task is to compile all this data together.
![You got the message. [Via this generator]](https://towardsdatascience.com/wp-content/uploads/2020/11/1Xlzhsu-QIeFFJICkSebzjw.png)
Excel can be a great tool for many tasks but it is not a database. It knowingly messes around with data type (it’s such a common issue that some human genes had to be renamed because of Excel) or has a limited number of entries it can handle. It was recently advanced that using an outdated version of Excel which could not handle more than 65,536rows is the reason more than 16,000 Covid-19 cases were not notified and taken into account in the UK. Using inappropriate software and data collection practices literally had a major impact on Public Health.
Definitions matter
Let’s now assume we got the data collection right, this is only the beginning of the road. Getting the right definitions is another task which seems trivial at first but isn’t in practice. Let’s for example start with this simple question:
What counts as a Covid-19 death?
It seems easy right? Something like "Someone who died after being infected by the virus" would sound like an appropriate answer. In practice, things haven’t been as smooth. If we look at the plot below, we can see the number of excess deaths in the US in 2020, presumably attributable to Covid-19. In practice, around a third of those excess deaths are not attributed to Covid-19.
![Attributed and unattributed excess deaths in US. Plot extracted from Weinberger et al. [link]](https://towardsdatascience.com/wp-content/uploads/2020/11/1tYl3wkWtE6moF3M8jAzvMQ.png)
Most of these deaths are likely to be linked to Covid-19 (although potentially indirectly), otherwise, we would have a major issue with another unidentified cause of excess mortality out there.
The reasons of this discrepancy are diverse and depend on the countries or states. In some cases, deaths at nursing homes or outside hospitals were not counted or communicated only days later. Deaths due to non-pneumonia symptoms might have been unreported. Also, social isolation sometimes makes it difficult to identify the cause of a death due to lack of close witnesses. And that’s not mentioning politics who might be tempted to under-report deaths.
This example shows us that getting simple definitions right it not easy and requires more planning and verifications than expected. Also, looking at the data from different perspective (e.g. not only looking at reported Covd-19 death, but global deaths too), helps getting a clearer vision of the reality.
Choosing the right metric isn’t straightforward
Getting the definition right is still not the end goal. You also have to make sure you are looking at the right metric to assess a situation and take the right decisions.
During the first wave of the pandemic, people started to closely track multiple indicators to better grasp the gravity of the situations. Some looked at number of active or cumulated cases, the number of deaths or, in more rare cases, the occupancy rate of Intensive Care Units (ICUs).
All those metrics show a different facet of the pandemic but they all rely on different assumptions and methodological biases we have to consider as, for example:
- How do we count deaths? (see above)
- What is a positive case? (symptoms only? test result needed? If so, which test?)
To better understand how to interpret a metric, we have to understand how it is generated. The example below shows the evolution of Covid-19 cases (red) and deaths (black) in Spain. If you look without context at the red curve, you will think that the situation is even worse now than what it was like in March and April, when Spain was already one of the most impacted country.
![Cumulative Number of Confirmed Covid-19 Cases in Spain. [Source]](https://towardsdatascience.com/wp-content/uploads/2020/11/1OjO9d7wAJ-innRjmsdtlPQ.png)
But looking in more details, the reality is different.
First, we can see that most deaths accumulated early and that, during the second wave of contagions, much less people are dying (but some still are).
![Cumulative Number of Covid-19 deaths in Spain. The shape is quite different than the Cumulative Number of Cases (see Above). [Source]](https://towardsdatascience.com/wp-content/uploads/2020/11/1AJgpGlnlIsGaO5CBXkTqtA.png)
Second, the curve does not explain how the detection methods have evolved. At the beginning of the pandemic, there were little resources to detect cases and processes were not properly defined yet. Now, PCR tests (the most sensitive ones) are operated massively on a daily basis. Even asymptomatic persons are tested (around 40% of the positive cases in Spain during the second wave) which was not the case in March and April.
This might have been one of the major learnings from this pandemic. Many people now do not look blindly at a metric but also question how it was generated and have the reflex to look at different indicators to understand the current situation.
Sampling matters too
The Covid-19 outbreak generated a deluge of data anyone could access, interpret or debate. It then becomes tempting to try to compare all those data points to make our own understanding of the situation and, who knows, outsmart the experts.
But, in practice, those comparisons are often tricky, often due to differences in sampling methods. Sampling is one of the first concept taught in Statistics classes because it plays a central role on how we can or can not interpret the data afterwards and the Covid-19 data has shown us many examples of comparisons or analyses hard to make due to how the data was generated.
Example 1: We can not compare testing data for different time periods
As the pandemic progressed, so did countries capacities to mass test the population. Back in April, very few people were tested and they were mostly people with symptoms compatible with Covid-19. During the recent weeks, the number of tests performed is 5 to 10x higher than what it was initially.
![Number of daily tests per thousand people in France, UK and Spain - [Source]](https://towardsdatascience.com/wp-content/uploads/2020/11/1D-1JeOZ6CwY5vMw5ge4sGw.png)
This means that the criteria to test someone are now different as we can afford to test more people. While we initially mostly tested highly susceptible persons, it is now common to test people just because they were in contact with infected patients or purely randomly for mass screening campaigns.
This means we can not directly compare data like percentage of positive tests or number of asymptomatic cases as the sampling procedure has changed drastically over time. It is normal to have different values over time as 1) the prevalence of the virus evolves and 2) we sample different populations chosen with different criteria. After all, the aim was not to design a neat experimental process but to offer the best possible response to a pandemic.
Example 2: We can not (directly) compare mortality risk between countries
As the outbreak became global, data from different countries was compared in order to better understand which strategies were working or not. And it quickly became obvious that some metrics such as mortality risk widely differed between countries with countries like Spain, Italy or Sweden displaying much higher rates.
![Evolution of Mortality Risk in different countries - [source]](https://towardsdatascience.com/wp-content/uploads/2020/11/1tn_8U-xI2sLOzbVHvWfItg.png)
However, it also became clear we could not draw clear conclusions from these data as so many parameters were varying from one country to the other, such as:
- Population age structure (e.g. Italy has one of the highest average population age in the world and Covid-19 has higher mortality risk for elder people)
- Social interaction levels
- Genetic diversity within the country (some genetic variants make people more or less susceptible to diseases, Covid-19 included)
- Medical infrastructures (e.g. number of ICU beds)
- Response to the pandemic (e.g. Italy and Sweden both have high mortality rate while they took very different approaches to fighting this pandemic)
- Quality of data (as seen before, not all countries collect or define data the same manner)
While you can correct for some of these factors (e.g. population age structure) using appropriate statistical modelling, it becomes more complex when different data collection methods or population behavior are involved.
In general, people got more aware that there are many cases where "we just can’t compare" and that’s probably a good thing as this is a key skill to have when approaching any data.
Let’s not forget
Thankfully, as things get better, we will soon no longer have to worry about daily numbers of new Covid-19 cases or other similar data points. But there will always be data to interpret to better understand our world and to face other challenges such as, for example, climate change. Then, let’s not forget what we learned from this crisis and let’s always ask ourselves when interpreting data:
- How was the data obtained (samples) and collected?
- What does a definition means?
- Which metric would make more sense to look at?
And hopefully, this helps us all take better decisions.