How to Reduce Your Analytics Costs with BigQuery Partitioned Tables

A simple solution with “big” results.

Taylor Brownlow
Towards Data Science

--

The BigQuery Dilemna

BigQuery is one of the most popular data warehouses in the market today, but if you aren’t careful you can easily rack up a debilitating bill. If BigQuery powers your analytics stack then partitioned tables can be a no-brainer solution to keeping costs down.

What are Partitioned Tables?

Partitioned tables are tables that are, erm, partitioned into smaller tables, usually by date.

Example of tennis data partitioned by year of match. (image by author)

Partitioned tables are not unique to BigQuery. They most commonly appear as those notorious tables with long suffixes (e.g. sessions_2017_01_01, sessions_2017_01_02). That kind of partitioning has always been painful for both the data engineer responsible for maintaining these separate tables and the analyst trying to query them.

But because BigQuery does partitioning behind the scenes, it actually becomes a far tidier, and desirable solution for both engineers and analysts.

Ways to Partition:

In BigQuery there are 3 ways to partition your tables:

1. By Ingestion Time

A table will be automatically partitioned when new data arrives. If you’re dumping the last day’s worth of data into your database every day, then you’ll likely use this method.

2. By a Date/Time Column

A table will partition based on a specified date/time column. If the ingestion date isn’t as relevant as another date/time column (e.g. Date of first purchase), you can go with this option.

3. By an Integer Column

Probably the least common method, this will allow you to create a partition based on an Integer column (e.g. Customer number).

Note: For partitioning on categorical columns (e.g. Customer Segments), check out clustered tables.

Why Use Them?

There are two main benefits of using partitioned tables:

1. They’ll solve your ‘blocked table’ problem.

One of the first challenges data engineers have to solve is how to update a table without blocking other users from extracting data at the same time. Partitioned tables are one way of doing that. If you insert data to your table every day, that data gets added as its own partition without blocking people from accessing the other partitions. Score.

2. They’re more efficient (translation: cheaper).

BigQuery charges for the data returned from a query. Meaning, the more efficient your queries, the less you’ll pay.

This is reason enough to give partitioned tables a try if you ask me.

To prove just how much more efficient partitioned tables can be, I ran a few tests with Jeff Sackmann’s Tennis data (thanks Jeff 🙏), which has match stats from nearly every ATP and WTA tennis match from the 1940s to the present day.

Data Preview

The data is 92.15 MB, with 373,436 rows and 51 columns.

I ran a few queries on a non-partitioned table (poetically named matches_not_partitioned), and one that was partitioned by the year of tourney_date (matches_partitioned). Let’s see how they did…

You can see the full results here.

This test was done on relatively small data, but you can see how the savings only get more significant with larger datasets. Like I said, no-brainer.

How to Make the Most of Them: Think Beyond the Data

There are plenty of articles available that go into the nitty-gritty of how to partition your tables in BigQuery, so I won’t replicate that information here.

However, there is one point missing from those tutorials: How are you going to make sure you realize all of those savings?

You will get absolutely no benefit from partitioned tables if people don’t write queries that take advantage of them.

For example, if I wanted to just pull all of Serena William’s matches, I get the following:

This means in order to actually realize the savings available with partitioned tables, you’re going to need to (gasp), work with whoever is writing your queries. Practically, this means 2 things:

1. Understanding the queries your business needs

In the case of partitioned tables, it’s important to know how people are using the data. If the data engineering team decides to partition based on customer acquisition date, but no one uses that as a filter, then it will all be for nothing and you’ll be querying your whole table every time.

Take the time to understand what partitions would add the most value for the business. It might not be what you’d expect just by looking at the data.

2. Having some way to ‘encourage’ efficient queries

Once you have partitioned tables that make some sense, you are just praying that people write queries that actually make use of them. It can be tempting to restrict access here, to limit the people allowed to query the data to a select, trusted, few. Resist this urge!

With the right tools in place, you can still give users access to the data while encouraging more efficient queries (without training them in SQL).

For interactive BI tools, like Count, that means making use of user controls and query parameterization.

In this interactive report I created from my partitioned tennis data, I included data controls at the top of the notebook that encourage the end-user to look at smaller chunks of time. These parameters link directly to the SQL queries generated that then run on the database, resulting in more efficient queries than if I’d given them free rein over the data.

Interactive report with built-in filters that link to partitioned data.

This guard-rails approach still allows the end-user enough flexibility to get the answers they need without facing exorbitant querying costs.

You can check out the whole notebook here:

Who is the Tennis GOAT?

And learn more about Count here.

Next Steps

To start partitioning your tables (and saving some money), I suggest starting with your most commonly used tables and queries. From there you can work out common ways users are already partitioning the data during analysis, and adjust your tables partitions accordingly. This might be:

  • partition by ingestion date for social media, or website traffic analytics
  • partition by customer acquisition date for cohort analysis
  • partition by zip code for regional sales analysis

And crucially once those new partitioned tables are in place, resist the urge to restrict querying access! Put in sensible controls to your existing BI infrastructure to encourage people to query using these new partitioned columns, ensuring your work turns into real, tangible savings.

But this is only the beginning. BigQuery has several other features that can help reduce your bill, and the time to insights. These opportunities are best realized when data engineering is working closely with the business to understand how these tables can be best stored and manipulated to get the answers the business needs.

Helpful Links

--

--