Thoughts on Probability and Statistics | The 2020 Novel Coronavirus Outbreak

Why The Early Reported Data On 2019-nCoV Is Misleading

The uncertainty of an estimate doesn’t always go both ways. For 2019-nCoV, one-sided uncertainty certainly indicates underestimation.

Andy Chen
Towards Data Science
8 min readJan 25, 2020

--

The Wuhan coronavirus (2019-nCoV) is currently spreading throughout China and the world. As of Jan 25, 12am EST, there are 1,354 confirmed cases and 41 confirmed deaths throughout the world. While there are questions about whether these statistics are true or not, people are not appreciating the fact that these statistics are not like the other statistics they encounter, and this is because of the way data is collected.

Let me demonstrate the difference, with two examples.

(1) Statistics with Two-Sided, Symmetric Uncertainty

Consider an election, which is not like a disease outbreak. You might want to know what proportion of voters support a certain presidential candidate in a national election. Obviously, you cannot ask the entire country, so what you do instead is you ask a sample of people and get their opinions. Using the results, you calculate a statistic — in this case, the proportion of people in the sample who support a candidate — and you report it as a point estimate. This entire process is easier than asking the entire population, however, it may not actually be true, because your sample is not the same as the wider population. To account for this uncertainty, you incorporate other information to produce a margin of error. You could perhaps say that you have an interval estimate of the true proportion of supporters.

For example, take the latest Emerson Poll for the 2020 Democratic Primary (24 Jan 2020). It states that Andrew Yang has the support of 8% of voters. However, this is not the voting-age population of approximately 250 million. This is only 497 people. The poll has accounted for the uncertainty, and gives a margin of error of +/-4.1%. So they think Andrew Yang has the support of between 3.9–12.1% of the population.

8% is not a bad estimate. it is in the right ballpark, and it is sufficiently informative for ordinary people.

(2) Statistics with One-Sided Uncertainty, Potentially (but not always) Extreme

Now consider the outbreak of Novel Coronavirus (2019-nCoV) in Wuhan. As of Jan 25, 12am EST, 1,354 cases of infection have been confirmed.

You might trust this figure and conclude that 1354 people have been infected. This would be wildly wrong. Not because of anything like bad statistics or bad reporting though, but because of asymmetry in the detection process of infections.

The probability of someone being labelled ‘confirmed’ without actually having the coronavirus is near 0. Therefore, 1354 would be the minimum number of infected people.

Now consider the other side. There are people not labelled ‘confirmed’, but have the coronavirus. The National Health Commission counts:

  • 1,965 suspected cases (24 Jan).
  • 15,197 close contacts with infected patients, with 13,967 of these people under quarantine (24 Jan).

A decent proportion of these cases are going to turn out to be coronavirus. If we consider these numbers now, and we assume:

  • 90% of suspected cases turn out to be coronavirus.
  • 20% of people being medically monitored turn out have caught it as well.
  • Add this onto the figure of 1,287 which was the number of confirmed infections at that time.

We get 5849 true cases of infection. Of course, this is a wildly speculative figure and likely to be inaccurate.

There are still additional cases which are not counted in this estimate.

  • Cases which are suspected or being monitored may end up infecting more people. These newly infected people may go on to infect more people. This is how diseases work — they spread across social networks, which are very difficult to keep track of.
  • It is very difficult to detect asymptomatic carriers (people who are contagious but do not show symptoms) and very difficult to account for people who have not reported their symptoms to the authorities.

So even if the assumed 90% and 20% are correct, the estimate 5,849 still ends up being the minimum number of true infections.

While it is nearly impossible to understand the severity of the problem from official figures, it is possible to make estimates based off other (not necessarily numerical) information.

  • Without human-to-human transmission, transmission generally stops after intervention. But in this outbreak, human-to-human transmission was confirmed, which greatly increases the probability of more undetected infections (confirmed by the National Health Commission). R0, the basic reproduction number (expected number of additional infections from a single contagious person), has so far been estimated to be around 1.4–2.5 people.
  • There are videos of hospitals, where waiting lounges are packed and waiting time is reported to be in the hours.
  • Inferences can also be made based on migration patterns. For example, it would be unlikely for 100 people to be infected in Wuhan and 200 people to be infected across North America, since people from Wuhan do not travel to North America that often. Researchers from the Imperial College London used this sort of idea to estimate a total of 4,000 cases of people with symptomatic 2019-nCoV by 18 Jan 2020, when there were only 121 confirmed cases at this time. They estimated this based off the number of outbound travellers from Wuhan International Airport, and the 7 confirmed cases overseas.
  • There are numerous people who have already passed, with unconfirmed cause of death. These people are not likely to be tested for coronavirus as it would be a waste of resources, given the length of hospital queues.

So the reported figure is not correct, and the estimated figure is also not correct. The true figure is likely to be higher.

Edit (31/1/2020): To clarify, the reported number is a minimum. The unknown true number can be modeled with a distribution that depends on your current information, and this distribution updates as you learn more. For example, if you assume no human-to-human transmission and then later find out this assumption is wrong, it changes your idea of where the true figure could be, and the distribution widens to include larger values. If you have a number of suspicious cases, you can guess the true number of cases, but if you later find out that the hospital had a limited number of testing kits and prioritized the most severe cases, then you might suspect that the the remaining untested cases will turn out to be disproportionately negative, and then you might revise your guess to be a lower figure.

The Difference Between The Two Types of Statistics

(1) The first type of statistic: symmetric error

Students are often taught that statistics are estimates of population parameters.

In one type of scenario, the statistic does approximately match the population parameter, with only a small two-sided error in the estimate.

Population mean is estimated by the sample mean.

In this case, if the error is small enough, a good statistic can be treated like the population parameter.

(2) The second type of statistic: asymmetric error

But in other types of scenarios, the statistic does not, and is not expected to match the population parameter. The statistic does not serve as a point estimate, but as a lower or upper bound to the parameter. In this case, the statistic is a lower bound.

Number of true cases is equal to confirmed cases plus undetected cases, which is a positive error term.

The error is asymmetric and can be extremely large. It has its own distribution, which can be estimated based off existing information. And as the authorities gather more information, resolve more uncertainty, and update the official figures, the number N only grows and grows — in a single direction.

In this case, the statistic cannot be treated like the true population parameter.

Top: Symmetric Error in Parameter Estimate. Bottom: Asymmetric Error.

Origin of the Asymmetric Error: Problems in Detection

Asymmetric statistics can show up anytime there is a problem in detection. Two examples:

  • Death counts: If there is a disaster, authorities will publish the number of confirmed deaths, instead of reporting a large estimate without confirmation (in order to avoid unnecessary grief). Counted deaths do not turn out to be alive, but there will be deaths which are still uncounted.
  • Social research on sensitive, personal issues: if doing X is embarrassing, you may have people who do X not admitting to it, but you would not get people who don’t do X admitting to it. You have many false negatives but no false positives.

The Effect on Behavior and Real-World Decision Making

When people say that the official figure is an underestimate, they forget this: the official figure isn’t an estimate for the true number of cases. The official figure isn’t even trying to estimate the true number of cases, because it will be wrong anyway. It only reports the number of confirmed cases. The number of suspected and quarantined cases are also reported — these figures are the ones to focus on, to get a better idea of the situation. And even though these figures could still be misleading, it seems like people are not necessarily misled by these figures. People act fearfully and take precautions that seem unnecessary, but they know intuitively that the reported numbers are not the true numbers. This is why face masks are selling out in China right now — they know the true figures are worse and they are preparing for it.

It is important to be forward thinking — literally. Actions must be calibrated, not with the reported figure, and possibly not even with the true level, even if it is estimated accurately. Given the level of uncertainty of the true figure, it could be necessary to calibrate action with a level above the estimated true level — a “safe” level. By doing this, we create a margin of safety for the whole system.

Precautions in China

Special hospital in Wuhan with capacity for 1,000.

Medical teams are being sent from other provinces to the aid Wuhan hospitals.

Outbound transportation has been indefinitely suspended to slow down the spread of coronavirus.

--

--

Mathematics, statistics, science, data. Currently working in a real environment.