Explore Arrays and Structs for Better Query Performance in Google BigQuery

Stepping up your SQL query game with these 2 advanced SQL concepts is easier than you think. Don’t miss out!

Skye Tran
Towards Data Science

--

Photo by toine G on Unsplash

A good understanding of arrays and structs could be extremely powerful when analyzing big data because we can query faster and more efficiently with pre-joined tables from object-based schemas such as JSON or Avro files. In this blog post, we will explore arrays, structs and how to make use of their full potential in Google BigQuery through lots of examples.

You’ll learn

  1. Why do we need to know arrays and structs?
  2. Arrays and how to work with them
  3. Structs and how to combine them with arrays to create nested records

Enough talking, let’s start!

Why do we need to know arrays and structs?

Photo by author

Take a look at the picture above, what can you see?

The picture represents 3 different schemas to capture the relationship between a restaurant’s name, location and its cuisine labels. To keep it simple, we assume a restaurant only has 1 location but multiple cuisine labels.

All the way to the left is option 1 where we have 2 separate tables tied together by a one-to-many relationship. Normalization helps to avoid storing location redundantly, thus reducing storage costs. But we are also sacrificing query performance because joining 2 tables is required if we want to know where is the Thai restaurant.

Option 2 is the total opposite. Query performance will be much faster because there is no need to join 2 tables, yet storage costs will be higher because the location is stored redundantly throughout the denormalized table.

But what if we could have the best of both worlds? Of course, we can with option 3 implemented in Google BigQuery. Each restaurant’s cuisine labels are nicely nested inside one row, which brings about 2 key benefits.

Better query performance

  • All data is in one table, so you can skip the expensive joins.
  • If we are doing something like a SELECT DISTINCT under location, we only have to iterate over 3 rows instead of 6 rows.

Lower storage costs

  • Name and location don’t have to be stored redundantly

Sounds good! But how can we organize information according to option 3 to reap these awesome benefits? Well, my friend, that’s why we need arrays and structs since they represent nested and repeated fields in BigQuery. In the next sections, we will explore arrays and structs in detail.

Arrays

Definition

An array is an ordered list containing values of the same data type. It is used when we want to store repeated values in a single row such as those below scenarios.

  • An employee has one or multiple skills
  • A student can speak one or multiple languages
  • A restaurant listed on Zomato has one or multiple cuisine labels such as Italian, Pizza, Pasta, Casual
  • A customer’s sales order includes one or multiple items

Working with SQL Arrays

Create an array with square brackets

Let’s begin with something simple. We will attempt to create a permanent table with an array to capture cuisine labels associated with 4 different restaurants in Sydney.

Photo by author

Looking at the query result, here are 3 key points to note.

  1. We can create an array with square brackets [ ] with each element separated by a comma. Values within each pair of square brackets form 1 array.
  2. Since each array represents cuisine labels associated with a restaurant, BigQuery has stored them in repeated fields associated with one single row. In total, we have 4 different rows corresponding to 4 restaurants.
  3. Under each row, cuisine labels are displayed vertically in different lines instead of lumping them together inside square brackets horizontally. We say BigQuery has naturally flattened the output.

Break open an array with UNNEST

What if you want to unpack the array so that it can be exported to a spreadsheet in MS Excel or Google Sheets? Well, fair enough, UNNEST operator will come in handy for such a case.

Photo by author

As you can see, the UNNEST operator transforms the array into a denormalized format, ready to be consumed by MS Excel or Google Sheets. From 1 row nesting 2 cuisine labels for Ho Jiak restaurant at Haymarket, we now end up with 2 separate rows.

Aggregate elements into arrays with ARRAY_AGG

What if we have a denormalized table beforehand and need to aggregate all cuisine labels into array format so that you end up with fewer rows?

Let’s save the result of the above query into a BigQuery table named “restaurant_cuisine_unnested”. Next, we will use the ARRAY_AGG operator with GROUP BY to bring it back to the array format.

Photo by author

Voilà! Here you go, things are nicely re-packaged to normal with 4 rows for 4 restaurants. Do you notice something different regarding the order of cuisine for each row? (Hint: Notice how Casual now appears above Japanese for Chaco Bar)

Yes, since we have opted to use “ORDER BY cuisine” with ARRAY_AGG, all cuisine labels are now sorted in alphabetical order. This order will be preserved if we save it as a permanent table.

Count the elements in an array with ARRAY_LENGTH

It’s time to turn things up a notch and see what else we can do with arrays. Let’s count how many cuisine labels each restaurant is associated with.

Photo by author

Query Array with UNNEST and WHERE IN

What if we want to list all related information about casual restaurants?

To solve this question, we will have to go through each element of the arrays one by one and filter only cuisine containing “Casual”. But an array column behaves differently from a normal column. You can’t go through and filter elements inside an array without first unnesting it. Keeping that in mind, I have included 2 slightly different queries that give you the same result in 2 different formats, depending on what you want.

Option 1. If you want to obtain a denormalized query result without any array (to consume in Excel or Google Sheets)

Photo by author

Option 2. If you want your cuisine column nicely packed into an array (to save storage costs and query time)

Note: The below query has been kindly provided by Michael Entin. I decide to replace it with the original query since it’s way simpler and more concise. Once again, thanks @mentin for enlightening me!

Photo by author

Structs

Definition

Structs are flexible containers of ordered fields each with a type (required) and a name (optional).

- From Google Cloud

Contrasting with arrays, you can store multiple data types in a Struct, even Arrays. In Google BigQuery, a Struct is a parent column representing an object that has multiple child columns. For example,

  • A restaurant has a location represented by different fields such as address, city, state, postal code.
  • An employee has a qualification associated with different fields such as university, degree, start date and end date.

It’s important to keep in mind that when you load nested and repeated data in Google BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs.

Working with SQL Structs

Create a simple struct

The easiest way to understand how Structs work is to create a struct. Let’s combine the existing cuisine_array with other details (e.g. price range, delivery) to form a struct. Here is how to do it with the STRUCT operator.

Photo by author

At its simplest form, 3 columns with the same prefix “basic_info” form a Struct. Different elements in a struct can have different data types. For example, above we have an array, string and Boolean data types, all live in harmony within a single struct.

Since we are here, you might want to take a look at the Schema below to understand the difference between array and struct. If the Mode is REPEATED, that’s an array (a.k.a. repeated field). When the Type is RECORD, it represents a struct (a.k.a. nested field).

Photo by author

Query a STRUCT with DOT NOTATION

Given the above table in BigQuery, it’s time to push ourselves a little bit harder. How about identifying all casual restaurants with delivery service?

Photo by author

Below are 2 crucial takeaways to take note of when querying a STRUCT.

  1. Use dot notation to query a nested column (e.g. WHERE basic_info.has_delivery = true)
  2. Don’t forget to UNNEST an array before query elements in an array, even if that array is nested under a struct

Create nested records from arrays of structs

This is where your SQL knowledge is going to peak with the very advanced concept of nested records. So here we go, everybody!

Let’s take a closer look at the table we obtained from the query. What do you notice?

Photo by author

Earlier on, we see a struct that contains an array. But now, we see the other way round, the first array “Sydney” includes 4 structs while the second array “Wollongong” contains 2 structs. What’s so significant about this?

Photo by author (Only selected child values are shown in the photo)

Here we witness the creation of nested records that denormalizes multiple levels of one-to-many (i.e. parent-and-child) relationship without joins. All data resides within one single table; each record is nested to optimize storage yet expandable (with UNNEST) for high query performance. Therefore, in BigQuery, we say nested records are arrays of structs. Maintaining nested records eliminate the need for repeating data, creating additional subtables or using expensive joins during data analysis. That is the crux of what makes Google BigQuery so powerful as a data warehouse, especially when analyzing data from JSON or Avro files.

Wrapping Up

In this post, we have explored the definition of arrays and structs as well as why nested and repeated fields are so important in BigQuery. Here are 3 key takeaways from this post.

  1. An array is a data type supported in SQL, even outside of Google BigQuery. Within an array, each element must have the same data type and the order of values is preserved.
  2. A struct can contain elements of different data types, including arrays.
  3. In BigQuery, nested records are arrays of structs. Maintaining nested records eliminate the need for repeating data, creating additional subtables or using expensive joins during data analysis, resulting in better query performance and lower storage costs.

Thank you for reading. Have a question or want to chat? Let me know in the comments or find me on Twitter. Have a rock n’ roll day and see you later, alligator!

Originally published at http://thedigitalskye.com on January 20, 2021.

--

--

Love data, problem-solving, and storytelling | Observe the world through technology-driven lens | Cherish order from chaos