The world’s leading publication for data science, AI, and ML professionals.

Use the Partitions, Luke! A Simple and Proven Way to Optimise Your SQL Queries

If you've ever written an SQL query that takes ages to run, this is the article for you

Baby Yoda loves partitions. Do you? Image by Victor Serban on Unsplash
Baby Yoda loves partitions. Do you? Image by Victor Serban on Unsplash

Data Scientists love SQL, but boy do we suck at writing performant queries (maybe because we spent too much time debating whether it’s pronounced "S-Q-L" or "sequel"?).

In this article, I’m going to show you how to use SQL partitions to optimise your queries and write code that’s quicker and cheaper to run. If you’ve mastered the basics of SQL and want to start unlocking higher-level Data Science skills, this will be a great addition to your toolkit.

What’s a partitioned table?

A partitioned table is a table that’s divided into segments/partitions (who saw that coming?).

In a partitioned table, each segment is stored in a different location on the server. This is different from a normal (unpartitioned) SQL table, where the entire table sits in a single location.

Here’s a comparison using dummy data about the daily sales of three of my favourite books:

Image by author
Image by author

Both the unpartitioned and the partitioned table hold the same data; the only difference is that the partitioned table splits the data into different segments. It’s still a single table (i.e. it’s not three separate tables); it’s just storing the data in a different way.

Why should we care? Well, as we’ll see shortly, we can take advantage of this structure to write more efficient SQL queries.

Creating partitioned tables

Creating a partitioned table is easy as pie.

For example, if we’d use the following code to create a normal (unpartitioned) table…

CREATE TABLE `myproject.mydataset.unpartitioned` (
  date DATE, 
  book_name STRING,
  total_sales INT64
);

INSERT INTO `myproject.mydataset.unpartitioned`
VALUES
  ('2023-12-01', 'Shoe Dog', 154),
  ('2023-12-01', 'Born to Run', 90),
  ('2023-12-02', 'Running with the Kenyans', 12),
  ('2023-12-02', 'Born to Run', 61),
  ('2023-12-01', 'Running with the Kenyans', 8),
  ('2023-12-03', 'Shoe Dog', 231),
  ('2023-12-03', 'Born to Run', 63),
  ('2023-12-02', 'Shoe Dog', 100),
  ('2023-12-03', 'Running with the Kenyans', 5);

… we can create a partitioned version by adding a single line at the end of the CREATE TABLE statement:

CREATE TABLE `myproject.mydataset.partitioned` (
  date DATE, 
  book_name STRING,
  total_sales INT64
) PARTITION BY date; # New line!

INSERT INTO `myproject.mydataset.partitioned`
VALUES ... # Rest of the code is same as in previous snippet

The column on which were are partitioning is known as the partitioning key; in this case, we’re partitioning on the date column, but any column is fine as long as it’s either (1) a date/timestamp field or (2) an integer field.

Once we’ve created these two tables, we’ll see that they look identical at first glance (e.g., if you run SELECT * against each table, the results will look the same). If however we look at the tables’ details/metadata, we’ll see that the partitioned table includes some extra metadata. Here’s what this looks like in Bigquery (the place I’m running my SQL):

Image by author
Image by author

Why does this matter? Because using partitions reduces the amount of data required to process your query

And this is great news, because it means that partitions can help you write more efficient queries!

When you query a normal (unpartitioned) table, the Sql engine typically has to scan the entire table in order to find the rows you want. On large tables, this can be unnecessarily slow and expensive, as your machine needs to process data which are not useful for creating the final output.

For example, let’s query the unpartitioned table we created earlier:

SELECT * 
FROM `myproject.mydataset.unpartitioned`
WHERE date > '2023-12-01'
Image by author
Image by author

In the image above, you can see that all 9 rows of the table were read in order to return the 6 rows with dates greater than ‘2023–12–01’.

Now let’s run the same query against our partitioned table:

SELECT *
FROM `myproject.mydataset.partitioned`
WHERE date > '2023-12-01'
Image by author
Image by author

This time, we see that only 6 rows of the unpartitioned table were read in order to generate the same results. Before executing the main part of the query, BigQuery was able to identify the partitions which contained the relevant rows and pull selectively from them. It simply didn’t need to read the 3 rows in the other partition.

This preliminary step of selecting partitions is known as pruning. It’s much more efficient than normal queries because it means that the SQL engine won’t have to read every single row in the table; it will first fetch the partitions required, and only then will it execute your query. In fancy SQL lingo, the filter we add on the partitioning column is treated as an access predicate by the SQL engine and gets run before executing the main query.


– Quick interlude –

If you’re enjoying this article, you might like my site the-sql-gym.com, which contains over 100 practice SQL questions. If you want to boost your SQL skills, check it out! And let me know if you have any feedback 🙂

Image by author. Source: the-sql-gym.com
Image by author. Source: the-sql-gym.com

Back to partitioning – Let’s look at a bigger example

When we’re working with small tables like the ones above, partitioning might seem a bit overkill. But when we upscale to larger tables, it can lead to significant performance gains.

First, let’s create two large tables, each with 1 million rows. The first will be an unpartitioned table, and the second will be partitioned by the id column, with each partition being 10,000 rows.

##############################
# Create unpartitioned table #
##############################

CREATE OR REPLACE TABLE `myproject.mydataset.unpartitioned_large` AS (
  SELECT id, GENERATE_UUID() AS uuid 
  FROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS id
);

############################
# Create partitioned table #
############################

CREATE TABLE `myproject.mydataset.partitioned_large` (id INT64, uuid STRING)

# Partition into 100 segments of 10,000 each
PARTITION BY RANGE_BUCKET(id, GENERATE_ARRAY(1, 1000000, 10000))

AS (
  SELECT *
  FROM `myproject.mydataset.unpartitioned_large`
);

When we run the following query against the unpartitioned table:

SELECT *
FROM `myproject.mydataset.unpartitioned_large` # Unpartitioned
WHERE id BETWEEN 1000 AND 2000;
Image by author
Image by author

… we see that all 1,000,000 rows of the table are read before the 1,001 records are outputted. The whole operation took 650ms (Elapsed time) / 503ms (Slot time consumed).

However, when we run the same query against the partitioned table, only 10,000 rows are read (i.e., one single partition).

SELECT *
FROM `myproject.mydataset.partitioned_large` # Partitioned
WHERE id BETWEEN 1000 AND 2000;
Image by author
Image by author

When querying the unpartitioned table, the operation’s elapsed time was less than half of that in the first query (on the unpartitioned large table), and the slot time consumed was nearly 95% lower.

Pretty cool, right?

Common mistakes when using partitions

If you’d like to start using partitions to improve your tables/queries, there are a couple of pitfalls to watch out for:

  1. Don’t filter a large partitioned table on a non-partitioning column – If you filter on a column which is NOT the partitioning key, you won’t be able to take advantage of the partitioned structure of the table. If you DO need to filter on a non-partitioning-key column, I’d recommend that (if possible) you first add in a filter on the parititioning key (to prune the unneeded partitions), and then apply your second filter. This is because BigQuery (like many SQL engines) executes filters in the WHERE statement from top to bottom.
  2. Don’t apply a function on top of the partition keys – For example, if your partitioning key is a column date, don’t add a filter like WHERE CAST(date AS STRING) = '2023-12-02'. Instead, if you need to include a function in your WHERE clause, add the function on top of a constant. In that example, you could rewrite the filter to WHERE date = CAST('2023-01-01' AS DATE) to ensure that you still take advantage of pruning/partitioning.

2 quick notes about partitioning in BigQuery specifically

BigQuery has a limit of 4,000 partitions per table. If you’re trying to use a partitioning key that will create more than 4,000 partitions, try using a different resolution. For example, instead of partitioning by date/day:

CREATE TABLE `myproject.mydataset.partitioned` (
  ...
) PARTITION BY date;

… you could partition by week:

CREATE TABLE `myproject.mydataset.partitioned` (
  ...
) PARTITION BY DATE_TRUNC(date, WEEK);

It’s also worth knowing that BigQuery makes it possible to partition based on an ingestion time field and a pseudocolumn named _PARTITIONTIME. It’s too niche for this article, but you can read more about it here.

When NOT to use partitions

On smaller tables, partitioning might not lead to a performance boost. Why? Because the process of pruning partitions might take longer than simply looking through all the rows in the table (source).

In these cases, clustering can be a more performant alternative. Stay tuned for more on that in my next article!

One more thing –

I’ve started a free newsletter called AI in Five where I share 5 bullet points each week on the latest AI news, coding tips and career stories for Data Scientists/Analysts. There’s no hype, no "data is the new oil" rubbish and no tweets (or should I say ‘x-es’ now?) from Elon – just practical tips and insights to help you develop in your career.

Subscribe here if that sounds up your street! Thanks for reading.

AI in Five | Matt Chapman | Substack


Related Articles