5 Wrong Ways to Do Covid-19 Data Smoothing

Much of Covid-19's data analysis is based on flawed smoothing techniques

Published in

Towards Data Science

7 min readOct 1, 2020

You might think that raw data is more accurate than smoothed data. But in the case of the Covid-19 pandemic, smoothed data reduces reporting anomalies and is a more accurate representation of timing than the raw data is. But only if the smoothing is done correctly.

Wrong Way #1: Not Using Smoothed Data to Expose Trends

Raw state-level data is noisy, and it’s difficult to see trends in raw data. The example below shows the current raw data report from Hawaii. The light blue lines represent positive tests, and the red lines represent deaths.

Are tests going up or down? It’s virtually impossible to tell from this depiction of the data.

In contrast, what does the figure below tell you about whether positive tests are currently up or down? Visually, it’s clear that positive tests have been flat to slightly increasing for about a week.

Wrong Way #2: Not Using Smoothed Data to Reduce the Effects of States’ Data Corrections

Most states have made corrections to their data over the course of the pandemic, and, in many cases, states dump weeks or months worth of corrections into the data pool all on one day.

New York’s data (the figure below) includes a correction they made to death data in early May (the tall red line). If you take this data literally, 1000 people died in one day. But 1000 people didn’t really die in one day; New York just reported a correction of that size in one day. This sort of spike significantly undermines analysis for the period that includes the spike.

The smoothed data (below) is still affected by this correction — you can see the hump in May from the 1000-death correction — but the smoothed data is less affected by the spike.

Wrong Way #3: Using Smoothing Periods Other Than 7 Days

Many states do not report results daily. Many states show weekly cycles of under-reporting on certain days and over-reporting on other days. For example, my state (Washington) underreports on Sundays and then overreports in the days that follow. You can see the weekly rhythm of underreporting and overreporting for Washington in the figure below.

Smoothing periods shorter than 7 days or longer than 7 days risk disproportionately weighting the days on which results are under-reported or over-reported. Here are the typical deviations from the trend by weekday at the national level:

As one example of why a 7-day period is needed, suppose smoothing of deaths data was done on a 3-day basis for the period Saturday through Monday. In that case, the 5-day period would be 71% of trend, because of typical underreporting on Sunday and Monday.

Similarly, if smoothing of deaths was performed for the 3 days of Tuesday through Thursday, that 3-day period would be 124% of trend. A full 7 days needs to be included to obtain an accurate picture of the data for the week.

Visually, this shows up as the smoothed data not looking very smooth. Here’s recent US data smoothed on a 7-day basis, which looks pretty smooth:

Here’s the same data smoothed on a 5-day basis, which doesn’t look very smooth:

You might assume that the reason the 7-day smoothing is smoother than 5-day smoothing is because it’s a longer period. That is not correct. Periods longer than 7 days have the same problem that periods shorter than 7 days have: they double-count days that are lower than average or higher than average, and therefore reduce accuracy. Here’s the same data as before with 9 day smoothing:

The 9-day smoothing is smoother than 5 days, but rougher than 7 days. The smoothing issue is the weekly cycle, not the sheer quantity of days per se.

Smoothing periods that are multiples of 7 days do not have this problem. Smoothing periods of 7, 14, and 21 days can all be accurate.

Wrong Way #4: Using a Lagging Smoothing Period

Some smoothing techniques use a 7-day smoothing period, but they calculate the 7-day average on a lagging basis. In other words, they calculate the value for day number n by averaging days n, n-1, n-2, n-3, n-4, n-5, and n-6.

Smoothing on a backward-looking basis means the average of the data is based on a midpoint 3.5 days earlier. The data that is purportedly showing day n is actually showing data for day n-3.5.

It’s easy to spot this phenomenon on graphs that show both raw and smoothed data, such as this one:

If you study the graph, you can see the smoothed line lags the raw data lines. The peaks and valleys are offset by 3–4 days. If the smoothing is done properly, the smoothing line will be right on top of the raw data lines, as shown here:

7-day smoothing needs to be based on 3 days prior to the date of record, the date of record, and 3 days after the date of record, e.g., based on days n-3, n-2, n-1, n, n+1, n+2, and n+3.

Wrong Way #5: Not Addressing the Most Recent 3 Days Consciously

If the date of record is today, that means that days n+1, n+2, and n+3 haven’t happened yet. We’re missing 3 days of look-ahead data.

The same issue applies to yesterday, which is missing 2 days of look-ahead data, and the day before yesterday, which is missing 1 day of data.

So we need a plan for smoothing the most recent 3 days, for which only partial smoothing data is available. A few options are available:

Project today’s data 3 days into the future, and smooth based on the projections.
Switch to a backward-looking basis as you run out of forward-looking days. Today uses the most recent 7 days. Yesterday uses today plus the most recent 6 days. The day before yesterday uses today, yesterday, and the most recent 5 days. Days prior to that use normal 7-day smoothing.
Smooth based on partial periods rather than 7-day periods for the most recent days. Today is smoothed based on today plus the preceding 3 days, for a total of 4 days. Yesterday is smoothed based on 5 days. The day before yesterday is smoothed based on 6 days. All the days before that can be smoothed using the normal 7 days.
Don’t provide smoothed data at all for the most recent 3 days.

The last approach is the most correct, but it limits the ability to make use of the most recent days.

The first three approaches have the potential to introduce error into the smoothing for the most recent days. However, those errors are temporary, and they will be corrected over the next 3 days as full data becomes available.

The failure mode in this area is not consciously choosing the approach that’s best for the situation. Have a plan, and think through the implications of shifting to projecting forward, looking backward, or using incomplete data as you run out of look-ahead days.

Summary

Smoothing done well enhances the accuracy and usability of Covid-19 data. Smoothing done poorly introduces error into the data.

Incorrect smoothing techniques can be a blind spot in Covid-19 data analysis. Fortunately, this particular blind spot is an easy one to correct.

More Details on the Covid-19 Information Website

I lead the team that contributes the CovidComplete forecasts into the CDC’s Ensemble model. For updates to these graphs, more graphs, forecasts at the US and state-level, and forecast evaluations, check out my Covid-19 Information website.

My Background

For the past 20 years, I have focused on understanding the data analytics of software development, including quality, productivity, and estimation. The techniques I’ve learned from working with noisy data, bad data, uncertainty, and forecasting all apply to COVID-19.