The world’s leading publication for data science, AI, and ML professionals.

How to Compute a Moving Average in BigQuery Using SQL

Smooth out variations, spot trends, and visualize them in Data Studio

Stock market station (Credits: https://unsplash.com/@behy_studio)
Stock market station (Credits: https://unsplash.com/@behy_studio)

The problem

When looking at time-series data, decisions can be **** influenced by random, short-term fluctuations (price of a cryptocurrency, number of Covid-19 cases reported).

This is why using a Moving Average (also called Running Average or Rolling Average) helps mostly to smooth out short-term fluctuations and highlight longer-term trends.

For example, Covid-19 test-stations and laboratories may report cases in bulks through the week (or not report them when they are closed on weekends), cryptocurrencies stock market prices can rapidly oscillate between values between days.

To illustrate, you may have seen on Google Search the following graph when looking at Covid 19 new cases:

Covid 19 new cases over time from Google Search (Image by Author)
Covid 19 new cases over time from Google Search (Image by Author)

You will notice in the legend the light blue line showing a 7-day average. This is exactly what we will learn to do in this article 😎

You can see it below displayed in Data Studio:

Covid 19 new cases over time displayed in Data Studio (Image by Author)
Covid 19 new cases over time displayed in Data Studio (Image by Author)

The solution

In this article, we will be computing a 7-day moving average based on the Google Covid-19 Open data.

The choice of 7 days, in this case, is important (because we could compute a moving average for 14 days, 30 days, etc…).

Here, we pick 7 days because it is quite a short-term period and we want to keep a weekly view (eg. Monday – Sunday) that would represent better the pace of Corona test stations or test centers (reporting maybe by end of the week, or at a specific interval of a week).

But for a different use case, we might pick a different time frame. Our example is about 7 days period but it could be computed on a 12 weeks period. Depending on the case, the frame can also be days, weeks, quarters, or years.

Additionally, you could apply a moving average to any numeric values like revenue, a number of users, etc…)

Our solution will be using analytic functions. They are functions that allow us to compute or get a value over a group of rows and return a single result for each row. It’s like traveling or browsing over a single table without performing joins or grouping over dimensions.

To know more about aggregate analytic functions, you can have a look here πŸ‘‡πŸΌ

Aggregate analytic functions | BigQuery | Google Cloud

For this experiment, we are using Covid-19 Public data from the public Bigquery data set.

You can get more information and access it by following this link πŸ‘‡πŸΌ

Free public datasets for COVID-19 | Google Cloud Blog

To find the full script which we will explain step by step, you can directly go to step #3, and run it in BigQuery.


Step #1: Transforming and Grouping

First, we would like to get our specific variable (new_confirmed) aggregated by the desired time period (in our case on a daily basis).

The above query returns a table with the new confirmed cases every day.

New confirmed cases by day
New confirmed cases by day

By using the "EXPLORE Data" button between the query editor and the result table, we can display nicely on Data Studio the daily new confirmed cases as follow:

New confirmed cases over time by day in Data Studio (Image by Author)
New confirmed cases over time by day in Data Studio (Image by Author)

That’s only the first step, but it’s a good start, hang on πŸ˜…

Step #2: How to compute our moving average?

We previously computed the number of daily new confirmed cases, where the output is a table for each row a day and the number of new cases. We named the column new_cases.

We will use an aggregate analytic function to get our moving average. This function will look like this:

AVG(new_cases) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW)

We input that we want an average of new cases _AVG(newcases) and that we would like that this average includes the current row value CURRENT ROW and the preceding 6 rows (and that for each row of our table).

You would then have the following output:

Output for each row with Moving average
Output for each row with Moving average

Let’s make the calculation manually, let’s start from 2022–02–06, the 7-day average would return:

(155439 + 214542 + 241049 + 716294 + 831347 + 970542 + 929533)/7

This gives 579820,85 that we round to the value you see in the seven_days_macolumn for 2022–02–06.

But, there is a subtility that we need to address. You could also write it:

AVG(new_cases) OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

So what is the difference between using RANGE or ROWS?

The ROWS will look at how many rows before the current one to include them into our average. So, in our case, the function will look at the current row and then look at the previous "physical" 6 rows and compute an average. But what if there are missing dates?

That’s where RANGE comes in handy, if the table was missing dates for some reason, then ROWS would go back more than 6 days. RANGE will produce a window based on the date values themselves.

As an example with missing dates:

ma_rows is using the ROWS frame, ma_range is using the RANGE frame
ma_rows is using the ROWS frame, ma_range is using the RANGE frame

You can see here, the ma_rows is computing a 7-days moving average using ROWS and ma_rangeis computing our moving average using RANGE, they are not computing exactly the same value because of the missing date (2022–02–03).

In general, using a RANGE is recommended as it’s not based on physical rows but logical values. The only difference is that the date field needs to be converted using the UNIX_DATE() function.

Step #3: Applying the formula in our query

Now we can bring it all together, this is the full script.

The first WITH clause computes all the new confirmed cases per date. Under the " – Main Query" comment, we apply our moving average function and we also ROUND it in another query on top (because we can’t use a ROUND function using an OVER() clause).

This gives us the following output:

Results of our moving average query as a table
Results of our moving average query as a table

That’s how we get our 7-days moving average! 🍻 It is then ready to be displayed on Data studio


And now, it’s time to shine ✨

We can then use Datastudio to display our results, which gives us a nice overview of the covid-19 cases and the 7 days moving average line in light blue.

Covid 19 new cases over time displayed in Data Studio (Image by Author)
Covid 19 new cases over time displayed in Data Studio (Image by Author)

This method can be used on any kind of time series and variable. The interesting part is the use of aggregated analytic functions. They make this exercise a lot easier even if they are not that easy to understand at first.

I hope you’ll like this method and the article will help you save some time. But more importantly, I hope this will help you get more at ease using Sql and BigQuery! It’s a good use case to practice your skills!

Time to dive into some data! πŸ€“


Related Articles