The world’s leading publication for data science, AI, and ML professionals.

BigQuery UNNEST: How to work with nested data in BigQuery

We use a common example: un-nesting Firebase event data to facilitate data science analysis

Image licensed to author
Image licensed to author

We frequently apply machine learning techniques to event-based data in order to generate high-value insights; from predicting customer churn using people’s recent activity of say, a subscription-based product, to scoring how engaged a user is on a piece of content for feeding into a recommendation engine.

Event-based data is a good example of semi-structured-data.

What is semi-structured data?

Structured data is data that adheres to a rigid tabular format. This makes it ideal for storing in say, a database table or a spreadsheet.

Semi-structured data, however, has variations in its structure; attributes are not fixed(equivalent to columns in a database table), and, to further complicate things, the data itself can be nested.

To illustrate this, consider the following use case: We want to consolidate event data generated from 3 different apps that users interact with via mobile devices and laptops/desktop PCs.

Some of the attributes received in the event data will be common across all apps. A timestamp, for example, to indicate when the event occurred. Other attributes will only apply to certain applications/devices. Furthermore, some of these attributes may have multiple occurrences within the context of a single event, and therefore creates a need for the data to be nested.

How is semi-structured stored?

Semi-structured data is typically stored as JSON or XML, and is the foundation of no-SQL data stores such as Google’s Firebase, or MongoDB.

Storing semi-structured data in BigQuery

Alongside structured data, Google’s Bigquery also supports the storage of semi-structured data via two nested datatypes: the array and the struct.

Furthermore, BigQuery makes it really easy to ingest JSON, XML, and other such data into its tables, to facilitate further analysis.

We will explore a common example of event-data exported from Firebase and loaded into BigQuery. We will show how to work with the data and explore useful BigQuery functions, including UNNEST.


Before you start

Google BigQuery

We assume you already have a Google Cloud Project with access to BigQuery. If not, you can easily create one (for free) using the link below.

GCP Free Tier – Free Extended Trials and Always Free | Google Cloud

Example Firebase event data

For this example, we use some really good, publically available Firebase data from here. These are simulated events from an online game.


Load Firebase raw event data

We start by creating a dataset in our BigQuery project to contain our example tables. To do this:

  1. Navigate to your BigQuery project in the Google Cloud Console. Click your project-id in the nav menu on the left.
  2. You should see a CREATE DATASET option appear like this. Click the link.
Creating a dataset to keep things organised
Creating a dataset to keep things organised
  1. Enter "blog_unnest" as the dataset name, and leave all other defaults. Click Create Dataset.

Now we have a dataset, we can create a table called firebase_raw that contains a sample (1,000 rows) of Firebase event data. To do this, simply run this in the BigQuery UI:

create table blog_unnest.firebase_raw
as
select *
from `firebase-public-project.analytics_153293282.events_20180801`
where event_name = 'level_complete_quickplay'
limit 1000

Tip: Notice the Firebase to BigQuery export generates an events table that is sharded __ by the event date (in bold above). In this example, we are extracting data from shard 20180801, which contains all events seen on 1 Aug 2018.

The limit __ keyword tells BigQuery to limit rows to 1,000. Like the top n feature if you come from an MS SQL background.

2. Exploring the Firebase data

With the sample data loaded, we can explore its structure. Clicking on the firebase_raw table, we see the columns listed. Note the event_params is of type RECORD – indicating it is a nested field.

Here is one of many nested fields in the Firebase event data
Here is one of many nested fields in the Firebase event data

If we run the simple query below, we can see what the data looks like.

select *
from blog_unnest.firebase_raw
Here's what our nested data looks like in the results window
Here’s what our nested data looks like in the results window

Event_params (highlighted) is a nested field, with each nested row representing a parameter for that event. We can see there are 5 nested rows for this "level complete quickplay" event, one row per parameter.

In Firebase, typically each event needs its own unique set of supporting parameters, and you can see how storing them in this way makes it possible.

Event_date is a regular column, indicating which day the event occurred.

Event_timestamp represents a derivative of Unix time to denote when the event took place.

Tip: Event_timestamp ** in Firebase denotes the number of microseconds since 1 Jan 1970. To convert this into a BigQuery timestamp, you can use BigQuery’s timestamp_micro**s function:

select timestamp_micros(event_timestamp) as event_ts,
       event_timestamp
from blog_unnest.firebase_raw

Which gives: 2018–08–01 13:45:54.159108 UTC for event_timestamp 1533131154159108

BigQuery UNNEST function

The UNNEST function allows us to easily query nested fields, such as the parameters in our event data.

Suppose we want to flatten our event data into rows, and extract:

  1. The event_timestamp
  2. The event_name
  3. The value from the nested params (the number of moves a player made)
  4. The board from the nested params (size of board played, S M or L)

The first two are easy. But the last two are more difficult. Here is where the UNNEST function comes into play:

select timestamp_micros(event_timestamp) as event_ts,
       event_name,
       event_timestamp,
       param_value.value.int_value       as move_count,
       param_board.value.string_value    as board_size
from blog_unnest.firebase_raw r
  cross join unnest(event_params) param_value
  cross join unnest(event_params) param_board

where param_value.key = 'value'
and param_board.key = 'board'

This produces the output that we want:

We have extracted the move count and board size from the nested data and flattened to a single row per event
We have extracted the move count and board size from the nested data and flattened to a single row per event

Let’s break this down.

cross join unnest(event_params) param_value

UNNEST takes the rows in our nested event_params struct, unpacks these, and produces a table with a single struct for each row in the original event_params nested field. In our case, this nested field contained 5 rows per parent event row.

The cross join applies a cartesian product of the 5 nested rows with the parent event row.

The total row count returned is therefore 5,000 (5 parameters per 1,000 rows).

In our case, the first attribute we wanted to extract from event_params was the value. So, we alias our cross join unnest function call as param_value, and then add the following where clause to filter on the key:

where param_value.key = 'value'

Remember, the rows in our param_value are themselves a struct. So, when we extract the int_value in the select ** clause, we need to use this notation to reference the valu**e:

select param_value.value.int_value       as move_count,

Tip: if you want to select all the attributes in a struct, and not just the int_value as we did here, you can just do this:

select  timestamp_micros(event_timestamp) as event_ts,
        event_name,
        event_timestamp,
        param_value
from  blog_unnest.firebase_raw r
cross join unnest(event_params) param_value

Which will add each attribute is a column in the result set. Useful if you want to explore structs where you don’t really know what the structure is:

Output of my struct showing all attributes as columns
Output of my struct showing all attributes as columns

Using UNNEST in an inline select

There is another way to use the UNNEST function to achieve the same output as our previous example. We can use an inline select, instead of the multipe cross joins:

select timestamp_micros(event_timestamp) as event_ts,
       event_name,
       event_timestamp,

       ( 
          select event_params.value.int_value 
          from unnest(event_params) event_params 
          where event_params.key = 'value'
       ) as move_count,

       ( 
          select event_params.value.string_value 
          from unnest(event_params) event_params
          where event_params.key = 'board'
       ) as board_size
from blog_unnest.firebase_raw r

Beware: Handling missing attributes

One advantage of using the inline select approach over cross join, is it handles missing attributes much better.

For example, say we want to bring back a third attribute, "foo", from the event_params. This parameter only applies to certain events, and if it is missing, we still want our other data returned. Note foo does not exist in our test data.

Modifying the first approach:

select timestamp_micros(event_timestamp) as event_ts,
       event_name,
       event_timestamp,
       param_value.value.int_value       as move_count,
       param_board.value.string_value    as board_size,
       param_foo.value.int_value         as foo

from blog_unnest.firebase_raw r
 cross join unnest(event_params) param_value
 cross join unnest(event_params) param_board
 cross join unnest(event_params) param_foo

where param_value.key = 'value'
and param_board.key = 'board'
and param_foo.key = 'foo'

Running this returns 0 rows. This is because, in a similar way to how applying a where clause limits rows to a left outer join, the same applies to a cross join.

However, if we amend our inline select approach:

select timestamp_micros(event_timestamp) as event_ts,
       event_name,
       event_timestamp,
       ( select event_params.value.int_value 
         from unnest(event_params) event_params 
         where event_params.key = 'value'
       ) as move_count,

       ( select event_params.value.string_value 
         from unnest(event_params) event_params
         where event_params.key = 'board'
       ) as board_size,

       ( select event_params.value.string_value 
         from unnest(event_params) event_params
         where event_params.key = 'foo'
       ) as foo
from blog_unnest.firebase_raw r

This time, we get our expected result; rows returned with the foo column containing nulls.

Success!
Success!

In some ways, this approach is also easier to read and is easier to add additional attributes because only the select clause needs to be modified. In terms of the amount of data scanned, the two approaches are identical.

Tip: as a general rule we try to avoid inline selects as joins are typically more performant; however in this use case, the performance is identical and we would typically opt for the inline select approach.

Parsing datatypes in a struct using UNNEST

Lastly, we show you how to convert a data type (or do other manipulations) to nested data whilst maintaining the data in its nested form.

Again, we use the BigQuery UNNEST function to achieve this. The basic approach is to:

  1. Call UNNEST to unpack the nested data
  2. Apply cast function (or other data manipulation)
  3. Reconstruct the structure using the ARRAY and STRUCT datatypes accordingly.

Keeping with the Firebase event data, in the user_properties nested field we find another timestamp attribute called set_timestamp_micros that is again in the Unix like format and therefore difficult to read. We want to apply a timestamp_micros() function to this attribute, converting it to a human-readable timestamp.

We can achieve this using the following combination of UNNEST, ARRAY, and STRUCT:

select timestamp_micros(event_timestamp) as event_ts,
       event_name,
       event_timestamp,
       array(
          select struct(
             user_properties.key,
             struct(
                user_properties.value.string_value,
                user_properties.value.int_value,
                user_properties.value.float_value,
                user_properties.value.double_value,
                timestamp_micros
                (  
                    user_properties.value.set_timestamp_micros
                ) as set_timestamp_micros
             ) as value
          )
          from unnest(user_properties) user_properties
      ) as user_properties
from blog_unnest.firebase_raw r

If you take the time to understand the above, you should be well on your way to mastering the UNNEST function and working with Firebase, and other semi-structured data.


Conclusions

As we have seen, BigQuery is really flexible in that it can store both structured and semi-structured data via its support for nested data types. This lends itself really well for analysing and applying Machine Learning activities on event data, which is inherently semi-structured.

The notation in BigQuery for working with nested data does have a bit of a learning curve, and this was partly the motivation for writing this article – to help fellow data scientists and analysts who might be faced with this challenge to see a working example complete with explanations.

We specifically hope those of you who are working with Firebase event data (a very common use case in the field), will find this helpful and should hopefully save you some time!


Next Steps

  1. You can read the official Google Cloud documentation on working with nested data here.
  2. Learn more about Ancoris Data, Analytics & AI

Related Articles