The world’s leading publication for data science, AI, and ML professionals.

How to build efficient and perfomant Data Structures in BigQuery

Ways of using Denormalization and Nested Data

Photo by James Lee on Unsplash
Photo by James Lee on Unsplash

Bigquery is a fully managed, serverless data warehouse on the Google Cloud Platform infrastructure that provides scalable, cost-effective and fast analytics over petabytes of data. It is service-software that supports queries using standard SQL. In this article, I would like to mention two main techniques to make your BigQuery Data Warehouse become efficient and performant [1].

Some Theory ahead

SQL vs NoSQL: SQL databases are table-based databases, whereas NoSQL databases can be document-based, key-value pairs, and graph databases. SQL databases are vertically scalable, while NoSQL databases are horizontally scalable. SQL databases have a predefined schema, while NoSQL databases use a dynamic schema for unstructured data[2].

Row vs. Column based Databases: A row-structured database stores data belonging to specific table rows in the same physical location. Famous examples are MySQL and MSSQL databases. Column-oriented databases are in contrast to the most common row-oriented databases. In contrast to the row-oriented databases, they do not store the individual rows next to each other, but the columns. This form of storage is particularly useful for analytical processes involving large amounts of data, since aggregation functions often have to be calculated for individual columns. Well-known examples are HBase, MongoDB or Google BigQuery.

Now, we come to the next two important theoretical terms: OLTP and OLAP. Online transaction processing (OLTP) captures, stores, and processes data from transactions in real time. While online analytical processing (OLAP) uses complex queries to analyze aggregated historical data from OLTP systems. Where columnar databases have problems compared to row structured databases:

  • Materializing entire rows is expensive. If you want to do the equivalent of select * from, you have to do a Join across all your column indexes to get the values.
  • While INSERTs are straightforward, UPDATE and DELETE operations are complicated.

For the above reasons, true columnar databases have their main use cases in data warehousing, analytics, and other archive-type data stores, while row-structured databases are generally better suited for OLTP workloads.

So what Now?

For once, it was important for me to present the basic theoretical background. Because you have to know that BigQuery can best be described as a hybrid system. It is definitely a column-based system and therefore more suitable for analytical purposes. This is also important to understand why data should be denormalized – but more on that later. In addition, BigQuery is quite similar to a standard SQL Data Warehouse, since it can be queried with standard SQL, for example, and serves more as a repository for data from OLTP systems – instead of, for example, image data or similar – but on the other hand allows storage of nested data structures. Therefore, BigQuery can truly be called hybrid.

Denormalized Data

The following sections will describe the technical part behind the whole process. Firstly, one must think about how to build their best data schema. Rather than adopting or redeveloping traditional Star or Snowflake schemas, data engineers should look at the opposite, denormalization. As mentioned before, data is often taken from OLTP systems and normalized. In BigQuery data should be denormalized again. Here is a small basic example:

Denormalization - Image by Author
Denormalization – Image by Author

Simply put, you should join tables in your ETL/ELT process that belong together in terms of content and save them as a new table. A Join over a View would be theoretically also possible, but a stored table – which one renews in its transformation step for example once on the day – is however simply more performant. So in the example above, you would join the customer data with the sales data and save it as a new object. The challenge here is to also deal with the business process in the background, so that meaningful new data objects can be found and as few joins and transformations as possible for the subsequent data analysis process arise.

Nested Data

The next step in the denormalization process would be packing the sales if necessary immediately into a nested structure. If you ask yourself what is Nested Data – In short: BigQuery supports loading and querying nested and recurring data from source formats that support object-based schemas for example JSON. These columns can preserve relationships without degrading performance as relational or normalized schemas do [3].

Illustration of nested and recurring data - Image by Author
Illustration of nested and recurring data – Image by Author

The address column contains an array of values. The different addresses in the array are the recurring data. The different fields within each address are the nested data. Read more about it here . So, in our example shown above regarding customers and sales data, the following data structure would result:

Denormalized and nested Data - Image by Author
Denormalized and nested Data – Image by Author

Summary

In order to build the most efficient and performant data structures possible in BigQuery, data engineers should therefore deal with the technical background and implement the two core concepts of denormalization and nested data. In my opinion, it is not so much the technical know-how that is crucial here, but rather the knowledge of the business process. Here, you have to invent meaningful data models together with the business. Once this has been achieved, much less effort is required in further processes, for example in reporting or analytics, and the queries are probably also faster. Another possibility to make the queries more efficient are techniques like partitioning and clustering. These are also known from classic databases. Here, I recommend the good documentation [4] from Google.

Sources and Further Readings

[1] Google, BigQuery (2021)

[2] IONOS, NoSQL – Funktion und Vorteile von NoSQL-Datenbanken (2019)

[3] Google, Specifying nested and repeated columns (2021)

[4] Google, Introduction to clustered tables (2021)


Related Articles