The world’s leading publication for data science, AI, and ML professionals.

Save Time and Money in BigQuery by Using Arrays and Structs

Learn to embrace and understand the nested schema

.

Photo by Towfiqu barbhuiya on Unsplash
Photo by Towfiqu barbhuiya on Unsplash

Although Arrays and Structs can add complexity to your queries, once you understand how they work you can reap the benefits of a nested data structure.

To give you an idea of how beneficial this change can be, below shows the savings I’ve gained by converting one of Google Clouds public datasets from a typical denormalised format, into a nested format.

I've been able to reduce the number of rows by 93.1% and the size of the table by 35.5%
I’ve been able to reduce the number of rows by 93.1% and the size of the table by 35.5%

Here’s what we’ll cover in this article:

  • What is a STRUCT and how does it work?
  • What is an ARRAY and how does it work?
  • How can you use a combination of both ARRAY and STRUCT data types to reduce query and table storage costs?
  • Finally, we’ll run through a few test exercises to demonstrate how to query nested data, while also gaining insights into the performance difference vs the original denormalised table

Photo by Maksym Kaharlytskyi on Unsplash
Photo by Maksym Kaharlytskyi on Unsplash

STRUCT

STRUCT is short for STRUCTURE, and in Bigquery you can use the STRUCT data type to bring multiple fields together under one name. What helped me understand this concept was to imagine you’re assigning a category to a group of columns.

Let’s run through an example. Imagine we have a table which contains student data.

SELECT 
193348009 as Student_ID,
"Toby Lerone" AS Student_Name,
"Year 11" as Year_Group,
"Group 5A" as Class_Group,
"Mr Brightside" as Year_Teacher,
"Maths" AS Subject,
73 AS Grade,
DATE("2022-11-01") AS Exam_Date

To create a STRUCT, we first need to decide what fields should be included in each STRUCT. Let’s try two STRUCTs for this example; Details and Results.

Creating the STRUCT is straightforward. Simply list what fields you want to be included in the STRUCT by adding them inside the parenthesis (each one must be split by commas).

STRUCT(FIELD1, FIELD2, FIELD3) AS NEW_STRUCT

Let’s create two new STRUCTs, one for Details and another for Results.

SELECT
193348009 as Student_ID,

STRUCT
  (
    "Toby Lerone" AS Student_Name,
    "Year 11" as Year_Group,
    "Group 5A" as Class_Group,
    "Mr Brightside" as Year_Teacher
  ) AS Details,

STRUCT
  (
    "Maths" AS Subject,
    73 AS Grade,
    DATE("2022-11-01") AS Exam_Date
  ) AS Results

Note* – I’ve indented my STRUCTs above, but this isn’t a syntax requirement and I did this just for readability. You can have the entire STRUCT in one line if you wish.

The results are not that different from what we saw earlier, except we now have a prefix added to each of the field headers.

This tells us that the Student_Name, Year_Group, Class_Group and Year_Teacher are in the Details STRUCT. While the Subject, Grade and exam date fields ** are contained within the Results STRUCT**.

If we save these results as a table, our schema will look like the one below. Notice we have a new data type called ‘RECORD, which now appears against the two STRUCTs we defined earlier.

Querying a STRUCT is as simple as creating one. There is only one change you need to make compared to how you would normally select fields – You must add the prefix to the field name.

SELECT 
Student_ID,
Details.Year_Teacher,
Results.Subject,
Results.Grade,
Results.Exam_Date
FROM Student_Data

The official documentation can be found here for the STRUCT data type.


Photo by Kelli McClintock on Unsplash
Photo by Kelli McClintock on Unsplash

ARRAY

An ARRAY is a list of values which exist in the same cell, which are also all of the same data type. This is an important distinction between STRUCT and ARRAY. A STRUCT can contain various data types whereas an ARRAY can only contain one.

There are a few ways you can create an ARRAY in BigQuery, I will cover two of these methods in this article (helpful documentation here if you want to explore further).


Creating an ARRAY with square brackets

Creating an ARRAY is as simple as creating a STRUCT. You must include your values inside square brackets, and each value must be separated by a comma.

SELECT ["Chris Packet","Liz Anya","Hugh Dunnit"] as Attendees

As you can see above, we have three attendees inside our ARRAY, but what’s important to note is that these three values are all contained in one single row.

This is the key part that leads to the storage and query cost savings mentioned at the beginning of this article, as even in this simple example we’ve reduced our table size from 3 rows to 1.


Creating an ARRAY using ARRAY_AGG()

The ARRAY_AGG() function brings a set of values together and concatenates them into an ARRAY. We’ll be using this function on the STRUCT fields we created earlier which will provide us with what’s known as a nested table (an array of structs).

I’ve expanded our student data from earlier by including a few more rows. Note that the STRUCTs we defined earlier are still in place, and we now have two students in our dataset, with 3 subjects each.

SELECT
Student_ID,
ARRAY_AGG(Details) as Details,
ARRAY_AGG(Results) as Results
FROM STUDENT_DATA
GROUP BY Student_ID

As the name implies, ARRAY_AGG is an aggregate function, so we must include the group by clause here which will group our STRUCTs by the Student_ID field.

In the ARRAY_AGG function, we only need to specify the name of our STRUCT. We don’t need to list each of the fields separately, as they’re already contained in our STRUCTs.

Running this query shows us familiar results. Notice we only have 2 rows of data despite 6 on show. This is thanks to the array_agg function, which pulled the fields within the STRUCT into an array for each unique Student_ID.

Exporting the results into a spreadsheet will give you some insight into how BigQuery stores this data behind the scenes.


Photo by Waldemar Brandt on Unsplash
Photo by Waldemar Brandt on Unsplash

Applying these techniques at scale

With the theory and practice covered for STRUCTs and ARRAYs, let’s come back to the key subject of this article. How can we use these two methods to save money on storage and query costs?

First, let’s take a look at the original dataset. What I used for this example is a public dataset in BigQuery called Chicago taxi trips.

FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`

Below is a screenshot of the schema. This contains 23 columns, and 203 million rows, taking up 74GB of Google Clouds storage space.

Earlier we walked through STRUCTs and how your fields can be broken down into categories. Let’s do the same again, this time breaking the dataset down into 3 STRUCTs: Details, Payment and Geographic.

When you have a table with 15+ fields, it can become tiring (and error-prone) to work with if you need to flick back and forth to check what fields you need to use. Having clearly defined STRUCTs in your table can help add context to your fields, which is especially useful if any fields happen to have similar names.

BigQuery has a handy assist feature which works wonders when working with STRUCTs, allowing you to see all the fields that exist in the STRUCT without needing to check the table schema itself.

Let’s convert the original table into a nested version, using a combination of ARRAYs and STRUCTs.

The query below will work for you too as the data is public. Just amend the table path at the top of the query, but make sure your dataset location is set to US to match the original dataset

create or replace table 
`spreadsheep-20220603.Dashboard_Datasets.taxi_trips_new` 
as (
SELECT 
company, 
taxi_id,
extract(date from trip_start_timestamp) as pickup_date,
#####STRUCT ONE#####
array_agg
(
  struct
  (
    unique_key,
    trip_start_timestamp as start_timestamp,
    trip_end_timestamp as end_timestamp,
    trip_seconds as seconds,
    trip_miles as miles,
    pickup_census_tract, 
    dropoff_census_tract, 
    pickup_community_area, 
    dropoff_community_area
  )
) as details,
#####STRUCT TWO#####
array_agg
(
  struct
  (
    fare,
    tips,
    tolls,
    extras,
    trip_total,
    payment_type
  )
) as payment,
#####STRUCT THREE#####
array_agg
(
  struct
  (
pickup_latitude, 
pickup_longitude, 
pickup_location, 
dropoff_latitude, 
dropoff_longitude,
dropoff_location
  )
) as geographic

FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` group by 1,2,3)

As you can see above, we have three STRUCTs being created as shown in the diagram earlier, and each one is contained in an ARRAY_AGG function.

Below is the new schema. We have the same fields before in terms of name and data types, but now they’re just organised a little differently under the RECORD type fields (these are the STRUCTs we defined).

Notice also the Mode of the RECORD field shows as REPEATED. This tells us this STRUCT exists within an ARRAY.


Results

This brings us back to the summary screenshot shown at the beginning of this article. We’ve reduced the number of rows in our table by 93.1% and the size of the table by 35.5%!


Testing

Ok, so we can see this approach saves on storage costs, but what about query performance? Let’s run a few queries to answer these three questions:

  1. What % of taxi trips in May 2022 gave a tip (by the company and overall, ordered by total trips desc)
  2. What is the average trip duration in seconds by month for 2022 YTD (removing the top and bottom 5% of rows)
  3. What was the £ per mile for each company in January 2022 (ordered by total trips desc)

Also, let’s bring in a new table to simulate a normalised dataset for another point of comparison. This third table is based on the same data as the other two, I have split the table into two parts as shown below.

create or replace table 
`spreadsheep-20220603.Dashboard_Datasets.taxi_trips_payment` as 
(
SELECT
unique_key, 
fare, 
tips, 
tolls, 
extras, 
trip_total, 
payment_type
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
);

create or replace table 
`spreadsheep-20220603.Dashboard_Datasets.taxi_trips_main` as 
(
SELECT 
* except 
(
  fare,
  tips, 
  tolls, 
  extras, 
  trip_total, 
  payment_type
)
  FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
);

What % of taxi trips in May 2022 gave a tip?

/*##################################################################################################*/
--What % of taxi trips in May 2022 gave a tip (by company ordered by total trips by each company total desc)--
/*##################################################################################################*/

--NESTED TABLE--
SELECT
COMPANY,
ROUND(SAFE_DIVIDE(TIPS_MADE,TOTAL_TRIPS_BY_COMPANY)*100,1)||"%" AS PERC_TIPPING,
TIPS_MADE,
TOTAL_TRIPS_BY_COMPANY
FROM
(
  SELECT
  LOWER(COMPANY) as COMPANY,
  COUNT(IF(tips > 0,UNIQUE_KEY,NULL)) as TIPS_MADE,
  COUNT(UNIQUE_KEY) AS TOTAL_TRIPS_BY_COMPANY
  FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_new, 
  UNNEST(details) WITH OFFSET pos1, 
  UNNEST(payment) WITH OFFSET pos2
  WHERE DATE_TRUNC(start_timestamp,MONTH) = "2022-05-01" 
  AND pos1 = pos2
  GROUP BY COMPANY
)
ORDER BY TOTAL_TRIPS_BY_COMPANY DESC;

--ORIGINAL TABLE--
SELECT
COMPANY,
ROUND(SAFE_DIVIDE(TIPS_MADE,TOTAL_TRIPS_BY_COMPANY)*100,1)||"%" AS PERC_TIPPING,
TIPS_MADE,
TOTAL_TRIPS_BY_COMPANY
FROM
(
  SELECT
  LOWER(COMPANY) as COMPANY,
  COUNT(IF(tips > 0,UNIQUE_KEY,NULL)) as TIPS_MADE,
  COUNT(UNIQUE_KEY) AS TOTAL_TRIPS_BY_COMPANY
  FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_original 
  WHERE DATE_TRUNC(trip_start_timestamp,MONTH) = "2022-05-01"
  GROUP BY COMPANY
)
  ORDER BY TOTAL_TRIPS_BY_COMPANY DESC;

--SPLIT TABLE--
SELECT
COMPANY,
ROUND(SAFE_DIVIDE(TIPS_MADE,TOTAL_TRIPS_BY_COMPANY)*100,1)||"%" AS PERC_TIPPING,
TIPS_MADE,
TOTAL_TRIPS_BY_COMPANY
FROM
(
  SELECT
  LOWER(COMPANY) as COMPANY,
  COUNT(IF(tips > 0,MAIN.UNIQUE_KEY,NULL)) as TIPS_MADE,
  COUNT(MAIN.UNIQUE_KEY) AS TOTAL_TRIPS_BY_COMPANY
  FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_main AS MAIN
  INNER JOIN spreadsheep-20220603.Dashboard_Datasets.taxi_trips_payment AS PAYMENT
  ON MAIN.UNIQUE_KEY = PAYMENT.UNIQUE_KEY
  WHERE DATE_TRUNC(trip_start_timestamp,MONTH) = "2022-05-01"
  GROUP BY COMPANY
)
  ORDER BY TOTAL_TRIPS_BY_COMPANY DESC;

Our nested table comes out on top in this exercise. The query duration is on par with the original table while using less computational power and processes less data.

For those who are unfamiliar with ARRAYs, pay close attention to how I unnested multiple arrays in this example. If you take away the WITH OFFSET part, then you’ll end up with duplicated results because of the double unnest. To prevent this from happening, I set a condition in the WHERE clause that the array offset values from both arrays must equal each other.

More detail on WITH OFFSET can be found here, or drop your question in this article and I’ll get back to you when I can.


What is the average trip duration in seconds by month for 2022 YTD?

/*##################################################################################################*/
--What is the average trip duration in seconds by month for 2022 YTD--
/*##################################################################################################*/

--NESTED TABLE--
SELECT
date_trunc(CAST(start_timestamp AS DATE),month) as month,
AVG(SECONDS) as avg_seconds
FROM
(
  SELECT
  start_timestamp,
  seconds
  FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_new, UNNEST(details)
  WHERE EXTRACT(YEAR FROM start_timestamp) = 2022
  QUALIFY 
  seconds BETWEEN 
    PERCENTILE_CONT(seconds,0.05) over () 
    AND 
    PERCENTILE_CONT(seconds,0.95) over ()
)
GROUP BY MONTH
ORDER BY MONTH DESC;

--ORIGINAL TABLE--
SELECT
date_trunc(CAST(trip_start_timestamp AS DATE),month) as month,
AVG(trip_seconds) as avg_seconds
FROM
(
SELECT
trip_start_timestamp,
trip_seconds
FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_original
WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2022
  QUALIFY 
  trip_seconds BETWEEN 
    PERCENTILE_CONT(trip_seconds,0.05) over () 
    AND 
    PERCENTILE_CONT(trip_seconds,0.95) over ()
)
GROUP BY MONTH
ORDER BY MONTH DESC;

--SPLIT TABLE--
SELECT
date_trunc(CAST(trip_start_timestamp AS DATE),month) as month,
AVG(trip_seconds) as avg_seconds
FROM
(
SELECT
trip_start_timestamp,
trip_seconds
FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_main AS MAIN
INNER JOIN spreadsheep-20220603.Dashboard_Datasets.taxi_trips_payment as PAYMENT
ON MAIN.UNIQUE_KEY = PAYMENT.UNIQUE_KEY
WHERE EXTRACT(YEAR FROM trip_start_timestamp) = 2022
  QUALIFY 
  trip_seconds BETWEEN 
    PERCENTILE_CONT(trip_seconds,0.05) over () 
    AND 
    PERCENTILE_CONT(trip_seconds,0.95) over ()
)
GROUP BY MONTH
ORDER BY MONTH DESC;

I added two window statements into these queries to see if they caused any noticeable differences between these tables. The window statement removes the bottom and top 5% of rows.

Surprisingly, the original and nested table show almost identical performance, although the nested table did take slightly more computation time.

What was the £ per mile for each company in January 2022?

/*##################################################################################################*/
--What was the £ per mile for each company in January 2022 (ordered by total trips desc)--
/*##################################################################################################*/

--NESTED TABLE--
SELECT
COMPANY,
COUNT(unique_key) AS TOTAL_TRIPS,
SUM(TRIP_TOTAL)/SUM(MILES) AS DOLLAR_PER_MILE
FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_new, 
UNNEST(details) WITH OFFSET pos1, 
UNNEST(payment) WITH OFFSET pos2
WHERE DATE_TRUNC(start_timestamp,MONTH) = "2022-01-01" 
AND POS1 = POS2
GROUP BY COMPANY
ORDER BY TOTAL_TRIPS DESC;

--ORIGINAL TABLE--
SELECT
COMPANY,
COUNT(unique_key) AS TOTAL_TRIPS,
SUM(TRIP_TOTAL)/SUM(TRIP_MILES) AS DOLLAR_PER_MILE
FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_original
WHERE DATE_TRUNC(trip_start_timestamp,MONTH) = "2022-01-01"
GROUP BY COMPANY
ORDER BY TOTAL_TRIPS DESC;

--SPLIT TABLE--
SELECT
COMPANY,
COUNT(MAIN.unique_key) AS TOTAL_TRIPS,
SUM(TRIP_TOTAL)/SUM(TRIP_MILES) AS DOLLAR_PER_MILE
FROM spreadsheep-20220603.Dashboard_Datasets.taxi_trips_main AS MAIN
INNER JOIN spreadsheep-20220603.Dashboard_Datasets.taxi_trips_payment AS PAYMENT
on MAIN.UNIQUE_KEY = PAYMENT.UNIQUE_KEY
WHERE DATE_TRUNC(trip_start_timestamp,MONTH) = "2022-01-01"
GROUP BY COMPANY
ORDER BY TOTAL_TRIPS DESC;

Similar performance for the original and nested table, but the nested table does come out on top thanks to the lower amount of data processed.


Photo by Aron Visuals on Unsplash
Photo by Aron Visuals on Unsplash

Key Takeaways

  • Do you work with large event-based tables which contain a vast amount of events for each unique user/id? Then a nested version of your table can easily save you future time and money
  • If you intend to switch from denormalised to nested as I’ve done in this article, be aware that any of your scheduled or saved queries will need to be updated to handle the new nested schema
  • When you’re unnesting multiple arrays, make sure you use WITH OFFSET for each unnest, and set criteria in the WHERE clause to prevent duplicating rows (see 1st and 3rd test for examples)
  • Although your queries on a nested table can be more complex due to the need to unnest arrays, you can expect cheaper (sometimes faster) queries compared to a denormalised table for your efforts
  • If you work with normalised data (i.e joining multiple tables together in your queries) and routinely query the same tables (such as weekly reporting), then I highly recommend investing time into creating a nested version of the tables being used

If your table is on the smaller side, say less than 10GB, then the benefits of a nested table become less significant and the added complexity will not be worth the trade-off.


That brings this article to an end. If you have any suggestions or recommendations, please feel free to leave a comment. I would be happy to see it.

I frequently write articles for BigQuery and/or Data Studio. If you’re interested, consider following me here on medium for more!

All images unless otherwise noted are by the author.

Stay classy folks! Tom


Related Articles