
The problem
When looking at time-series data, decisions can be **** influenced by random, short-term fluctuations (price of a cryptocurrency, number of Covid-19 cases reported).
This is why using a Moving Average (also called Running Average or Rolling Average) helps mostly to smooth out short-term fluctuations and highlight longer-term trends.
For example, Covid-19 test-stations and laboratories may report cases in bulks through the week (or not report them when they are closed on weekends), cryptocurrencies stock market prices can rapidly oscillate between values between days.
To illustrate, you may have seen on Google Search the following graph when looking at Covid 19 new cases:

You will notice in the legend the light blue line showing a 7-day average. This is exactly what we will learn to do in this article π
You can see it below displayed in Data Studio:

The solution
In this article, we will be computing a 7-day moving average based on the Google Covid-19 Open data.
The choice of 7 days, in this case, is important (because we could compute a moving average for 14 days, 30 days, etc…).
Here, we pick 7 days because it is quite a short-term period and we want to keep a weekly view (eg. Monday – Sunday) that would represent better the pace of Corona test stations or test centers (reporting maybe by end of the week, or at a specific interval of a week).
But for a different use case, we might pick a different time frame. Our example is about 7 days period but it could be computed on a 12 weeks period. Depending on the case, the frame can also be days, weeks, quarters, or years.
Additionally, you could apply a moving average to any numeric values like revenue, a number of users, etc…)
Our solution will be using analytic functions. They are functions that allow us to compute or get a value over a group of rows and return a single result for each row. It’s like traveling or browsing over a single table without performing joins or grouping over dimensions.
To know more about aggregate analytic functions, you can have a look here ππΌ
For this experiment, we are using Covid-19 Public data from the public Bigquery data set.
You can get more information and access it by following this link ππΌ
To find the full script which we will explain step by step, you can directly go to step #3, and run it in BigQuery.
Step #1: Transforming and Grouping
First, we would like to get our specific variable (new_confirmed
) aggregated by the desired time period (in our case on a daily basis).
The above query returns a table with the new confirmed cases every day.

By using the "EXPLORE Data" button between the query editor and the result table, we can display nicely on Data Studio the daily new confirmed cases as follow:

That’s only the first step, but it’s a good start, hang on π
Step #2: How to compute our moving average?
We previously computed the number of daily new confirmed cases, where the output is a table for each row a day and the number of new cases. We named the column new_cases
.
We will use an aggregate analytic function to get our moving average. This function will look like this:
AVG(new_cases) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW)
We input that we want an average of new cases _AVG(newcases) and that we would like that this average includes the current row value CURRENT ROW and the preceding 6 rows (and that for each row of our table).
You would then have the following output:

Let’s make the calculation manually, let’s start from 2022β02β06, the 7-day average would return:
(155439 + 214542 + 241049 + 716294 + 831347 + 970542 + 929533)/7
This gives 579820,85 that we round to the value you see in the seven_days_ma
column for 2022β02β06.
But, there is a subtility that we need to address. You could also write it:
AVG(new_cases) OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
So what is the difference between using RANGE or ROWS?
The ROWS will look at how many rows before the current one to include them into our average. So, in our case, the function will look at the current row and then look at the previous "physical" 6 rows and compute an average. But what if there are missing dates?
That’s where RANGE comes in handy, if the table was missing dates for some reason, then ROWS would go back more than 6 days. RANGE will produce a window based on the date values themselves.
As an example with missing dates:

You can see here, the ma_rows
is computing a 7-days moving average using ROWS and ma_range
is computing our moving average using RANGE, they are not computing exactly the same value because of the missing date (2022β02β03).
In general, using a RANGE is recommended as it’s not based on physical rows but logical values. The only difference is that the date field needs to be converted using the UNIX_DATE() function.
Step #3: Applying the formula in our query
Now we can bring it all together, this is the full script.
The first WITH clause computes all the new confirmed cases per date. Under the " – Main Query" comment, we apply our moving average function and we also ROUND it in another query on top (because we can’t use a ROUND function using an OVER() clause).
This gives us the following output:

That’s how we get our 7-days moving average! π» It is then ready to be displayed on Data studio
And now, it’s time to shine β¨
We can then use Datastudio to display our results, which gives us a nice overview of the covid-19 cases and the 7 days moving average line in light blue.

This method can be used on any kind of time series and variable. The interesting part is the use of aggregated analytic functions. They make this exercise a lot easier even if they are not that easy to understand at first.
I hope you’ll like this method and the article will help you save some time. But more importantly, I hope this will help you get more at ease using Sql and BigQuery! It’s a good use case to practice your skills!
Time to dive into some data! π€