
I’m starting 2021 with one of the essential New Year’s resolutions: Practice more complex SQL queries. If you’re on the same boat, join me to explore 3 useful SQL features with Google Bigquery.
You’ll learn how to:
- Create tables with date partitions
- Aggregate over a group of rows with (analytic) window functions
- Break down complex queries using WITH clause
Depending on your background, these might seem like basic features that appeared in other relational databases or they may appear exotic. Either way, I have included detailed examples and approach on how I tackled each query. So let’s hop into something comfy and dive right in!
Before you begin
In Google BigQuery, you will need to be aware of the following hierarchy: Project – Dataset – Tables/ Views
- A project organises all resources (e.g. data, storage, compute engine, etc.) to be used. A project can include one or multiple datasets.
- A dataset is a container to organise and control access to tables and views. A dataset can include one or many tables and/or views.
Before analysing any data, we have to create a project and a dataset. Here is how to do it in less than 3 minutes.
Create a project
Click on Select a project and select New Project

Upon seeing the below screen, specify the project name of your choice and click Create.

It will take around 10–20 seconds for project creation to complete. Notification regarding the status of completion can be viewed by clicking on the Bell icon located at the top right corner of the screen.
Create a dataset
Next, you will create a dataset to store your new tables.
Check the project name shown at the top menu next to Google Cloud Platform. If that is not newly created project, click Select a project and switch to the correct one.

Select the project name, then click Create Dataset

Upon seeing the new dataset under the new project (as shown below), you are all set and ready to get started.

Create tables with date partitions
Goal
Create a new table including all daily average levels of PM2.5 particles in the air from 2010 onwards.
Also, in the new table, include a new column to indicate the air quality category based on PM2.5 levels.

PM2.5 particles are a common air pollutant usually found in smoke. They are small enough for human to breathe deeply into our lungs or enter our bloodstream. People who are sensitive to air pollution might experience chest tightness, difficulty breathing, aggravated asthma or irregular heartbeat when PM2.5 levels are high.
Approach
- Create a new permanent table to isolate PM2.5 levels from 2010 onwards for subsequent queries, thus skipping all historical records before 2020 (assuming we are only interested in 2010 onwards).
- Only select necessary columns from the original dataset to avoid wasting time and cost on scanning through irrelevant columns.
- Bind the year of date_local column as a partition to divide the new table into smaller partitions to improve query performance and control costs by reducing the number of bytes read by a query.
- Leverage CASE statement to create a new column to map air quality category against arithmetic_mean
Query

Ten-second takeaway
If you only care about records for a specific period (e.g. last year, within the last 7 days), creating a date-partitioned table will allow us to completely ignore scanning records in certain partitions if they are irrelevant to our query, thus saving query time and costs.
Aggregate over a group of rows with window functions
Goal
For each state, to identify which county, the exact location and date that had the highest daily level of PM2.5 particles in 2019
Approach
- Look at all 2019 readings, rank all arithmetic_mean over groups of state records in descending order
- Include the state_name, county_name, latitude, longitude, date_local together with those max levels per state identified above
- Only select records with the highest arithmetic_mean (i.e. rank = 1)
Query

Remember those Sql tricky interview questions to find the second highest, third highest and so on? This query also works well since you can easily substitute the WHERE clause for rank = 1 with 2, 3 and so on.
Ten-second takeaway
To evaluate aggregate values over a group of rows (i.e. highest value by state/ month/ year), opt for (analytic) window functions instead of using expensive self-JOINs.
Break down complex queries using WITH clause
Example 1
Goal
In 2019, which counties have at least 5 days with poor air quality (i.e. average daily quantity of PM2.5 reached 25 micrograms or above)? For each county, how many days of poor air quality in total?
Approach
- Calculate the daily average of PM2.5 level for each county in 2019
- For each county, count the number of days where PM2.5 reached 25 micrograms or above
- Include only counties with at least 5 days, together with the number of days where PM2.5 reached 25 micrograms or above. Sort by number of days in descending order to highlight county and state having the most days with poor air quality
Query

Ten-second takeaway
To solve a complex query, use the WITH clause (a.k.a. Common Expression Table) to break apart the complex question into many smaller steps and tables instead of trying to write one massive combined SQL statement.
Bonus tip: If the table within the WITH clause can be reused across different queries, consider creating a permanent table to store the query result. In doing so, you can avoid running the same query inside the WITH clause multiple times.
Example 2
Goal
For each state and county, what is the difference between the annual average level of PM2.5 in 2018 and 2019?
Approach
- For each state and county, calculate the annual average level of PM2.5 in 2018
- For each state and county, calculate the annual average level of PM2.5 in 2019
- Combine these 2 results into 1 table and calculate the difference between PM2.5 level between 2018 and 2019. Round the difference to 2 decimal places for easy comparison.
Query

Ten-second takeaway
To make a complex query more readable, use WITH clause to create multiple table expressions, then join the resulting tables.
Despite showing you long queries and a brief explanation of my approach, this article covered only a small part of what can be done with Google BigQuery. However, I hope this has provided you with a good starting point for all the insightful queries that you will be writing. Inevitably you will hit a roadblock or get stuck with a difficult SQL question. In such a case, remember to take a deep breath, grab something to drink and start breaking down the big question into smaller chunks like how I did in my approach. You’re almost there and you will conquer it!
With that, thank you for reading and do let me know if you have any feedback. Have a good one!
Originally published at http://thedigitalskye.com on January 18, 2021.