Make your query run fast like Ferrari, but cheap like Honda.

Poorly optimized SQL queries are like cracks on the lining of plumbing – barely keeping the water in. When the water pressure is low, there are minor leaks but everything still works. The nightmare begins when we amp up the load. The cracks, once negligible, now burst wide open and we start bleeding resources until the integrity of the infrastructure crumbles.
With the proliferation of big data, data pipelines have to deal with massive loads and it’s increasingly easy for costs to go out of control. Querying is no longer just about writing a syntax that runs. It needs to be cost-effective and fast, too.
After crashing the server one too many times, I thought it was about time I started asking… how?
- #1: LIMIT is a booby trap.
- #2: SELECT as few columns as possible.
- #3: Use EXISTS() instead of COUNT().
- #4: Use Approximate Aggregate Function.
- #5: Replace Self-Join with Windows Function.
- #6: ORDER BY or JOIN on INT64 columns.
- #7: Optimize your anti-joins.
- #8: Trim your data early and often.
- #9: WHERE sequence matters (?)
- #10: Utilize PARTITIONS and/or CLUSTERS.
- #11: Push ORDER BY to the end of query (?)
- #12: Delay resource-intensive operations.
- #13: Use SEARCH().
- #14: Take advantage of Caching.
Note 1: All queries here are written based on the BigQuery public data, which is accessible by everyone. In other words, you can copy and paste the queries into Bigquery to try out the query yourself.
Note 2: Although we are using BigQuery in this article, most of the optimization techniques described here are general SQL best practices that can be applied to other platforms such as Amazon Redshift, MySQL, Snowflakes, etc.
1: LIMIT is a booby trap.
Best practice:
LIMIT
speeds up performance, but doesn’t reduce costs. For data exploration, consider using BigQuery’s (free) table preview option instead.
It has to be said –
Most SQL practitioners were once prey to the illusion of safety misrepresented by LIMIT 1000
. It’s perfectly reasonable to assume that if we only show 1000 rows of outputs, there would be fewer loads on the Database and hence lower cost.
Unfortunately, it isn’t true.
The row restriction of LIMIT
clause is applied after Sql databases scan the full range of data. Here’s the kicker – most distributed database (including BigQuery) charges based on the data scans but not the outputs, which is why LIMIT
doesn’t help save a dime.

However, it’s not all doom and gloom. Since LIMIT
puts a cap on the output rows, we need to move around less data on BigQuery’s network. This reduction in bytes shuffled significantly improves query performance.
To demonstrate, I’m using the crypto_ethereum
table from BigQuery’s public data repository which has 15 million rows of data.
# Not Optimized
SELECT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
-----------------------
Elapsed Time : 11s
Slot Time : 162s
Bytes Processed: 617 MB
Bytes Shuffled : 1.7 GB
Bytes Spilled : 0 B
-----------------------
Let’s try the query again with LIMIT
.
# Optimized (for speed only)
SELECT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
LIMIT
1000
-----------------------
Elapsed Time : 2s
Slot Time : 0.01s
Bytes Processed: 617 MB
Bytes Shuffled : 92 KB
Bytes Spilled : 0 B
-----------------------
Using LIMIT
improved speed, but not cost.
- Cost: Bytes processed remain the same at 617 MB.
- Speed: Bytes shuffled dropped from 1.7 GB to merely 92 KB, which explains the huge improvement in slot time (from 162s to 0.01s).
While using LIMIT
is better than nothing, there are better alternatives if the intention is purely to explore the table. I strongly recommend using BigQuery’s table preview option. This feature allows us to navigate the table page by page, up to 200 rows at a time and it’s completely free.

For cost optimization, limit your use of LIMIT
.
2: SELECT as few columns as possible.
Best practice: Avoid using
SELECT *
. Choose only the relevant columns that you need to avoid unnecessary, costly full table scans. Source.
BigQuery is not a traditional row-based database, but a columnar database. This distinction is meaningful because it reads data differently.
If a table has 100 columns but our query only needed data from 2 specific columns, a row-based database will go through each row – all 100 columns of each row—only to extract the 2 columns of interest. In contrast, a columnar database will process only the 2 relevant columns, which makes for a faster read operation and more efficient use of resources.

Here is a typical query that is fast to write, but runs slow.
# Not Optimized
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.blocks`
-----------------------
Elapsed Time : 23s
Slot Time : 31 min
Bytes Processed: 15 GB
Bytes Shuffled : 42 GB
Bytes Spilled : 0 B
-----------------------
Since a columnar database can skip through columns, we can take advantage of this by querying only the columns we need.
# Optimized
SELECT
timestamp,
number,
transactions_root,
state_root,
receipts_root,
miner,
difficulty,
total_difficulty,
size,
extra_data,
gas_limit,
gas_used,
transaction_count,
base_fee_per_gas
FROM
`bigquery-public-data.crypto_ethereum.blocks`
-----------------------
Elapsed Time : 35s
Slot Time : 12 min
Bytes Processed: 5 GB
Bytes Shuffled : 11 GB
Bytes Spilled : 0 B
-----------------------
In this example, query cost is reduced by 3x because the bytes we needed to process went down from 15 GB to 5 GB. On top of that, we also observe a performance gain with slot time decreasing from 31 minutes to 12 minutes.
The only downside of this approach is that we would need to type out the column names, which can be a hassle, especially when our tasks require most of the columns except a few. In cases like these, not all is lost, we can make use of EXCEPT
statement to exclude the unnecessary columns.
# Optimized
SELECT
*
EXCEPT (
`hash`,
parent_hash,
nonce,
sha3_uncles,
logs_bloom)
FROM
`bigquery-public-data.crypto_ethereum.blocks`
-----------------------
Elapsed Time : 35s
Slot Time : 12 min
Bytes Processed: 5 GB
Bytes Shuffled : 11 GB
Bytes Spilled : 0 B
-----------------------
Avoid SELECT *
unless absolutely necessary.
3: Use EXISTS() instead of COUNT().
Best practice: If we don’t need the exact count, use
EXISTS()
because it exits the processing cycle as soon as the first matching row is found. Source.
When exploring a brand new dataset, sometimes we find ourselves needing to check for the existence of a specific value. We have two choices, either to compute the frequency of the value with COUNT()
, or to check if the value EXISTS()
. If we don’t need to know how frequently the value occurs, always use EXISTS()
instead.
This is because EXISTS()
will exit its processing cycle as soon as it locates the first matching row, returning True
if the target value is found, or False
if the target value doesn’t exist in the table.
On the contrary, COUNT()
will continue to search through the entire table in order to return the exact number of occurrences for the target value, wasting unnecessary computing resources.

Suppose that we want to know if the value 6857606
exists in the number
column and we used COUNT()
function…
# Not Optimized
SELECT
COUNT(number) AS count
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
timestamp BETWEEN '2018-12-01' AND '2019-12-31'
AND number = 6857606
-----------------------
Elapsed Time : 6s
Slot Time : 16s
Bytes Processed: 37 MB
Bytes Shuffled : 297 B
Bytes Spilled : 0 B
-----------------------
The COUNT()
returned 1 because only one row matches the value. Now, let’s try with EXISTS()
instead.
# Optimized
SELECT EXISTS (
SELECT
number
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
timestamp BETWEEN "2018-12-01" AND "2019-12-31"
AND number = 6857606
)
-----------------------
Elapsed Time : 0.7s
Slot Time : 0.07s
Bytes Processed: 37 MB
Bytes Shuffled : 11 B
Bytes Spilled : 0 B
-----------------------
The query returns True
because the value exists in the table. With EXISTS()
function, we don’t get information on its frequency, but in return, the query performance improved massively – from 16 seconds to just 0.07 seconds.
Aren’t you glad that EXISTS()
function exists?
4: Use Approximate Aggregate Function.
Best practice: When you have a big dataset and you don’t need the exact count, use approximate aggregate functions instead. Source.
A COUNT()
scans the entire table to determine the number of occurrences. Since this is done row-by-row, the operations will run at a time-space complexity of O(n). Performing such an operation on big data with hundreds of millions of rows will quickly become unfeasible as it requires massive amounts of computing resources.
To exacerbate the performance issue, COUNT(DISTINCT)
will need an ungodly amount of computer memories to keep count of the unique ids of every user. When the list exceeds the memory capacity, the surplus will spill into disks, causing performance to take a nosedive.
In cases when data volumes are significant, it may be in our best interest to trade accuracy for performance by using approximate aggregation functions. For example:-
[APPROX_COUNT_DISTINCT()](https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_count_distinct)
[APPROX_QUANTILES()](https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_quantiles)
[APPROX_TOP_COUNT()](https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_top_count)
[APPROX_TOP_SUM()](https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions#approx_top_sum)
[HYPERLOGLOG++](https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions#hyperloglog_functions)
Unlike the usual brute-force approach, approximate aggregate functions use statistics to produce an approximate result instead of an exact result. Expects the error rate to be 1~2%. Since we are not running a full table scan, approximate aggregate functions are highly scalable in terms of memory usage and time.

Suppose that we are interested in the number of unique Ethereum miners for the 2.2 million blocks, we can run the following query…
# Not Optimized
SELECT
COUNT(DISTINCT miner)
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
timestamp BETWEEN '2019-01-01' AND '2020-01-01'
-----------------------
Elapsed Time : 3s
Slot Time : 14s
Bytes Processed: 110 MB
Bytes Shuffled : 939 KB
Bytes Spilled : 0 B
-----------------------
The COUNT(DISTINCT)
function returned 573 miners but took 14s to do it. We can compare that to APPROX_COUNT_DISTINCT()
.
# Optimized
SELECT
APPROX_COUNT_DISTINCT(miner)
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
timestamp BETWEEN '2019-01-01' AND '2020-01-01'
-----------------------
Elapsed Time : 2s
Slot Time : 7s
Bytes Processed: 110 MB
Bytes Shuffled : 58 KB
Bytes Spilled : 0 B
-----------------------
Much to my delight, the APPROX_COUNT_DISTINCT()
returned the correct count of 573 miners (luck?) in half the slot time. The difference in performance is clear even with just 2.2 million rows of data but I’d imagine that difference will widen in our favor as the table gets bigger.
Whenever super-precise calculations are not needed, do consider utilizing approximate aggregate functions for a much higher level of responsiveness.
5: Replace Self-Join with Windows Function.
Best practice: Self-join are always inefficient and should only be used when absolutely necessary. In most cases, we can replace it with a window function. Source.
A self-join is when a table is joined with itself. This is a common join operation when we need a table to reference its own data, usually in a parent-child relationship.

A common use case – an Employee table with a manager_id column would contain row records of all employees, and assistant managers (who are also employees of the companies), who may also have a manager of their own. To get a list of all employees and their direct supervisors, we can perform a self-join with employee_id = manager_id.
This is typically a SQL anti-pattern because it can potentially square the number of output rows, or forces a lot of unnecessary reads, which slows our query performance exponentially as the table gets bigger.
For example, if we want to know the difference between the number of Ethereum blocks mined today and yesterday by each miner, we could write a self-join, although it is inefficient:-
# Not Optimized
WITH
cte_table AS (
SELECT
DATE(timestamp) AS date,
miner,
COUNT(DISTINCT number) AS block_count
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN "2022-03-01"
AND "2022-03-31"
GROUP BY
1,2
)
SELECT
a.miner,
a.date AS today,
a.block_count AS today_count,
b.date AS tmr,
b.block_count AS tmr_count,
b.block_count - a.block_count AS diff
FROM
cte_table a
LEFT JOIN
cte_table b
ON
DATE_ADD(a.date, INTERVAL 1 DAY) = b.date
AND a.miner = b.miner
ORDER BY
a.miner,
a.date
-----------------------
Elapsed Time : 12s
Slot Time : 36s
Bytes Processed: 12 MB
Bytes Shuffled : 24 MB
Bytes Spilled : 0 B
-----------------------
Rather than performing a self-join, a window function in combination with a navigation function, LEAD()
, would be a much better approach.
# Optimized
WITH
cte_table AS (
SELECT
DATE(timestamp) AS date,
miner,
COUNT(DISTINCT number) AS block_count
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN "2022-03-01" AND "2022-03-31"
GROUP BY
1,2
)
SELECT
miner,
date AS today,
block_count AS today_count,
LEAD(date, 1) OVER (PARTITION BY miner ORDER BY date) AS tmr,
LEAD(block_count, 1) OVER (PARTITION BY miner ORDER BY date) AS tmr_count,
LEAD(block_count, 1) OVER (PARTITION BY miner ORDER BY date) - block_count AS diff
FROM
cte_table a
-----------------------
Elapsed Time : 3s
Slot Time : 14s
Bytes Processed: 12 MB
Bytes Shuffled : 12 MB
Bytes Spilled : 0 B
-----------------------
Both the queries gave us the same result, but there is a significant improvement in query speed (from 36 seconds slot time to 14 seconds slot time) with the latter approach.
Other than the LEAD()
function, there is plenty of other navigation, numbering, and aggregate analytics functions that can be used in place of self-join operations. Personally, these are the functions that I use frequently in my day-to-day tasks:-
- Navigation Function:
LEAD()
,LAG()
- Numbering Function:
RANK()
,ROW_NUMBER()
- Aggregate Analytics Function:
SUM()
,AVG()
,MAX()
,MIN()
,COUNT()
The next time you see a self-join, remind yourself they are just windows of opportunity to flex your mastery of windows function.
6: ORDER BY or JOIN on INT64 columns.
Best practice: When your use case supports it, always prioritize comparing
INT64
because it’s cheaper to evaluateINT64
data types than strings. Source.
Join operations map one table to another by comparing their join keys. If the join keys belong to certain data types that are difficult to compare, then the query becomes slow and expensive.

The question is – which data types are hard to compare, and why?
One reason is the difference in storage size. Each data type is allocated specific chunks of storage space in our database. The exact storage space is listed on BigQuery’s pricing page.
The documentation tells us that, for BigQuery, a INT64
will always take up 8 bytes of space regardless of its length, but a STRING
can take up a varying amount of space depending on its length. We are oversimplifying a bit here, but let’s just say that STRING
generally takes up 2 + 4 * number_of_characters bytes of storage, or in other words, a lot more than INT64
.

It doesn’t take a data scientist to figure out that the fewer the bytes to scan, the faster our query runs. Therefore, in the storage size department, integers are faster than strings.
Other than that, INT64
have another huge advantage over STRING
because of collation. Collation tells our database how to sort and compare strings. For example, when we run an ORDER BY clause, collation dictates if uppercase and lowercase should be treated the same. It also looks at things like accents, Japanese kana character types, width sensitivity, and variation selector sensitivity.
All of these sorting rules add to the complexity of STRING
comparison, which for all intents and purposes, slows the pineapple out of our query. INT64
on the other hand, doesn’t care about all these because the only comparison we can make, is whether they are smaller or bigger than the other numbers.
So, how bad is it to join on STRING
?
I came across a brilliant article by Borna Almasi where he compared the impact of column types on join speeds. His experiments found that integers are 1.2x faster than bytes and about 1.4x faster than strings.
However, some of the comparisons had shorter/longer lengths of characters which may have contributed to lesser/more bytes to read, hence the difference in speed. Out of curiosity, I’ve decided to use a similar approach, but to compare strings and integers of exactly 10 characters each.
WITH
keys AS (
SELECT
*
FROM
UNNEST(GENERATE_ARRAY(1000000000,1001000000)) AS key
),
keys_hashed AS (
SELECT
key AS key_int,
CAST(key AS STRING) AS key_str,
CAST(CAST(key AS STRING) AS BYTES) AS key_byte
FROM
keys
)
SELECT
*
FROM
keys_hashed a
LEFT JOIN
keys_hashed b
ON a.key_int = b.key_int
-- Change to key_str/key_byte for other experiments
For the experiment, I ran the queries 10 times for each data type and computed the margin of errors based on T-statistics. Here are the findings.

While we had a very limited number of sample data, it does appear that applying a join operation on INT64
yields better performance than BYTE
, followed by STRING
by upwards of 39%.
The next time you are creating a table with DDL, I’d recommend prioritizing INT64
. It’s a simple maneuver but can pay huge dividends if we think about the performance gain that we get from all future queries against this table.
7: Optimize your anti-joins.
Best practice: Instead of
NOT IN
, useNOT EXISTS
operator to write anti-joins because it triggers a more resource-friendly query execution plan. Source.
Most SQL practitioners are familiar with the JOIN
operator, but few know about anti-join. It’s not that it’s complicated or advanced, but we just rarely concern ourselves with the naming conventions. In fact, you may have unknowingly written a few anti-join operators yourself.
An "anti-join" is, quite literally, a JOIN
operator with an exclusion clause (WHERE NOT IN
, WHERE NOT EXISTS
, etc) that removes rows if it has a match in the second table.
For example, if we want to know which cars from the "Car" table are accident-free, we can query the list of cars from the "Car" table and then filter out those that appeared in the "Accident" table.

To really drive the point home, here’s another example that you can try on BigQuery. Suppose that we have been tracking the names of all Ethereum miners for over 2 years and we store the data in two separate tables (year 2019 and year 2020). Our goal here is to figure out which miners from 2019 stopped mining in 2020.
WITH
miner2019 AS (
SELECT DISTINCT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN '2019-01-01' AND '2019-12-31'
),
miner2020 AS (
SELECT DISTINCT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN '2020-01-01' AND '2020-12-31'
)
If we take the list of 2019 miners, and then remove their names if they appeared in the 2020 list, then we should get a list of miners who stopped mining. This is one of the many scenarios where anti-join can be applied. For better or worse, there are many ways we can write an anti-join clause.
LEFT JOIN
methodNOT EXISTS
methodNOT IN
methodEXCEPT DISTINCT
method
The syntax is as follows:-
# LEFT JOIN METHOD
SELECT
a.miner
FROM
miner2019 a
LEFT JOIN
miner2020 b ON a.miner = b.miner
WHERE
b.miner IS NULL
# NOT EXISTS METHOD
SELECT
a.miner
FROM
miner2019 a
WHERE NOT EXISTS
(SELECT b.miner FROM miner2020 b WHERE a.miner = b.miner)
# NOT IN METHOD
SELECT
a.miner
FROM
miner2019 a
WHERE
a.miner NOT IN
(SELECT miner FROM miner2020)
# EXCEPT DISTINCT METHOD
SELECT
a.miner
FROM
miner2019 a
EXCEPT DISTINCT
SELECT
b.miner
FROM
miner2020 b
All of these methods will return the same result (~491 miners) because the underlying logic is the same. The only difference between each of these methods is that they trigger a different query plan – some more efficient than others. In the name of science, I ran each method 5 times with cache disabled and recorded the query performance. Here’s what I found:-

Most methods have similar performance with the exception of the NOT IN
method, which has almost twice the slot time and bytes shuffled.
Well, what a bummer. In my experience, NOT IN
is the most commonly used syntax for an anti-join due to its readability, but unfortunately, it also happens to have the worst performance.
For the curious minds, SQLShack went to great lengths here to discuss the root cause of its poor performance. The TLDR version is that NOT IN
method triggers a few heavy operators that run nested looping and counting operations, which evidently is very costly.
Moral of the story? Steer clear of NOT IN
when writing an anti-join. Personally, I recommend using the NOT EXIST
method by default because it’s performant and equally intuitive to read.
8: Trim your data early and often.
Best practice: Apply filtering functions early and often in your query to reduce data shuffling and wasting compute resources on irrelevant data that doesn’t contribute to the final query result.
I’ll sound like a broken record, but great advice is worth repeating – trim your data with SELECT DISTINCT
, INNER JOIN
, WHERE
, GROUP BY
, or any other filtering function whenever you get the chance. The earlier we do it, the lesser the load on every subsequent stage of our query, therefore compounding the performance gain every step of the way.

For instance, if we want to know the popularity of each GitHub repository, we can look at (i) the number of views and (ii) the number of commits. To extract the data, we can JOIN
the repos
and commits
table then aggregate the counts with GROUP BY
.
# Not Optimized
WITH
cte_repo AS (
SELECT
repo_name,
watch_count
FROM
`bigquery-public-data.github_repos.sample_repos`
),
cte_commit AS (
SELECT
repo_name,
`commit`
FROM
`bigquery-public-data.github_repos.sample_commits`
)
SELECT
r.repo_name,
r.watch_count,
COUNT(c.commit) AS commit_count
FROM
cte_repo r
LEFT JOIN
cte_commit c ON r.repo_name = c.repo_name
GROUP BY
1,2
-----------------------
Elapsed Time : 3s
Slot Time : 8s
Bytes Processed: 50 MB
Bytes Shuffled : 91 MB
Bytes Spilled : 0 B
-----------------------
In this scenario, the GROUP BY
clause was performed in the outermost query so every row of commits is JOIN
to the repository first. Since multiple commits can belong to the same repository, this results in an exponentially larger table that we need to GROUP BY
.
For comparison, we can implement GROUP BY
earlier in the commits
table.
# Optimized
WITH
cte_repo AS (
SELECT
repo_name,
watch_count
FROM
`bigquery-public-data.github_repos.sample_repos`
),
cte_commit AS (
SELECT
repo_name,
COUNT(`commit`) AS commit_count
FROM
`bigquery-public-data.github_repos.sample_commits`
GROUP BY
1
)
SELECT
r.repo_name,
r.watch_count,
c.commit_count
FROM
cte_repo r
LEFT JOIN
cte_commit c ON r.repo_name = c.repo_name
-----------------------
Elapsed Time : 2s
Slot Time : 5s
Bytes Processed: 50 MB
Bytes Shuffled : 26 MB
Bytes Spilled : 0 B
-----------------------
We see a huge improvement in Slot Time and Bytes Shuffled when we GROUP BY
early. This is because all the commits are condensed from 672,000 records to 6 records so there are less data to move around.
Here’s the query plan for comparison. For context, there are 400,000 and 672,000 records in the repos
and commits
table respectively.

Always trim your data wherever and whenever possible.
#9: WHERE sequence matters (?)
Speculated best practice: BigQuery assumes that the user has provided the best order of expressions in the
WHERE
clause, and does not attempt to reorder expressions. Expressions in yourWHERE
clauses should be ordered with the most selective expression first. Source.
This recommendation piques my interest because if it was true, it would be the easiest implementation with huge potential for optimization improvement. Google claims not only that using WHERE
early in our query (on different tables) matter, but the sequence of WHERE
within the same table also matters.

I’ve decided to test it out myself.
# "Supposedly" Not Optimized
SELECT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
miner LIKE '%a%'
AND miner LIKE '%b%'
AND miner = '0xc3348b43d3881151224b490e4aa39e03d2b1cdea'
-----------------------
Elapsed Time : 7s
Slot Time : 85s
Bytes Processed: 615 MB
Bytes Shuffled : 986 KB
Bytes Spilled : 0 B
-----------------------
Amongst the three WHERE
clauses that we use, the LIKE
operators are string-comparison operations that are expensive to run, while the =
operator selects a very specific miner which drastically reduces the number of relevant rows.
In an ideal state, the =
operator will be executed before the other two so that the expensive LIKE
operations will only be performed on the subset of the remaining rows.
If the sequence of WHERE
do matters, then the query performance above should be indisputably worse than a similar query with the =
operator as the first.
# "Supposedly" Optimized
SELECT
miner
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
miner = '0xc3348b43d3881151224b490e4aa39e03d2b1cdea'
AND miner LIKE '%a%'
AND miner LIKE '%b%'
-----------------------
Elapsed Time : 8s
Slot Time : 92s
Bytes Processed: 615 MB
Bytes Shuffled : 986 KB
Bytes Spilled : 0 B
-----------------------
But as it seems, the slot time and bytes shuffled are comparable for both queries, indicating that BigQuery’s SQL Optimizer is smart enough to run the most selective WHERE
clause regardless of how we wrote the query. This was also supported by most StackOverflow answers, such as [[here](https://stackoverflow.com/questions/11436469/does-the-order-of-where-clauses-matter-in-sql#:~:text=No%2C%20that%20order%20doesn’t%20matter%20(or%20at%20least,way%20to%20satisfy%20that%20query.)](https://stackoverflow.com/questions/642784/does-the-order-of-columns-in-a-where-clause-matter), here, and here.
From what I can gather, the sequence of our WHERE
clauses do not matter most of the time, except in extreme edge cases as pointed out by "Registered User" (yes, that’s his username) from StackOverflow.
- If you have a large number of tables in your query (10 or more).
- If you have several
EXISTS
,IN
,NOT EXISTS
, orNOT IN
statements in yourWHERE
clause - If you are using nested CTE (common table expressions) or a large number of CTEs.
- If you have a large number of sub-queries in your
FROM
clause.
While it probably wouldn’t impact our query performance, I suppose it wouldn’t hurt to sequence our WHERE
clauses in order of importance, just in case.
10: Utilize PARTITIONS and/or CLUSTERS.
Best practice: Use partitions and clusters on tables that are bigger than 1 GB to segment and order the data. Applying filters on partition key or cluster key can significantly reduce data scans. Source.
Querying a huge dataset is a pain because it hogs resources and can be extremely slow. It wasn’t uncommon to find databases that divide large datasets into multiple smaller ones (e.g: sales_jan2022, sales_feb2022, sales_mar2022, …) for better usability. While this approach circumvents the drawbacks, it was achieved at the expense of having to deal with the logistical nightmare of managing all the split tables.
This brings us to BigQuery’s partitioned tables. Functionally, partitions allow us to query a subset of the larger table without having to separate it into separate, smaller tables. We get the performance, but none of the drawbacks.
CREATE TABLE database.zoo_partitioned
PARTITION BY zoo_name AS
(SELECT *
FROM database.zoo)

When we run a query against a partitioned table, BigQuery will filter out the irrelevant partitions at storage so that instead of a full table scan, we will only scan the specified partitions.
In the same parallel as partitions, we can also use clustering to refine the data into even smaller blocks.
CREATE TABLE database.zoo_clustered
CLUSTER BY animal_name AS
(SELECT *
FROM database.zoo)

A Clustered Table sorts the data into blocks based on the column (or columns) that we choose and then keeps track of the data through a clustered index. During a query, the clustered index points to the blocks that contain the data, therefore allowing BigQuery to skip through irrelevant ones. The process of skipping irrelevant blocks on scanning is known as block pruning.
The concept is akin to a book library – we can easily navigate to the books we want when the shelves are organized based on genres.
Although an important distinction is that it doesn’t point to the exact row, but only the block. The kicker is that BigQuery doesn’t necessarily create one block for every distinct value in the clustered column. In other words, BigQuery won’t create 1000 blocks for 1000 unique values and save 99% of byte scan when we search for a specific value. As a rule of thumb, Google recommends clustering tables that are at least 1 GB because the algorithm can group high cardinality data into better blocks, which ultimately, is what makes clustered tables more effective.
At the end of the day, both partitions and clusters help to cut down the number of bytes that BigQuery needs to scan. With fewer bytes to scan, the query becomes cheaper and faster to run.
CREATE TABLE database.zoo_partitioned_and_clustered
PARTITION BY zoo_name
CLUSTER BY animal_name AS
(SELECT *
FROM database.zoo)

Note that partitioning and clustering don’t have to be mutually exclusive. For large tables, it makes perfect sense to use both in conjunction because their effect can compound. Consider bigquery-public-data.wikipedia.pageviews_2022
, a partitioned and clustered table.

By referring to the Details page of BigQuery’s UI, we can see that the table is partitioned by the datehour
column and clustered by the wiki
and title
column. It looks and feels just like a regular table, but the real magic happens when we filter it.
# Optimized
SELECT
title
FROM
`bigquery-public-data.wikipedia.pageviews_2022`
WHERE
DATE(datehour) = '2022-01-01'
AND title = 'Kinzie_Street_railroad_bridge'
-----------------------
Elapsed Time : 1s
Slot Time : 27s
Bytes Processed: 1.3 GB
Bytes Shuffled : 408 B
Bytes Spilled : 0 B
-----------------------
When I apply a WHERE
statement to filter its partition of datehour
, the Bytes Processed gets reduced from 483 GB to merely 4 GB. If I add another filter on the title
cluster, it drops further to 1.3 GB. Instead of being billed for $2.4, we only have to pay $0.0065. If that’s not a bargain, I don’t know what is.
11: Push ORDER BY to the end of query (?)
Speculated best practice: Use
ORDER BY
only in the outermost query or within window clauses (analytic functions). Source.
ORDER BY
has always been a resource-intensive operation because it needs to compare all rows and organize them in sequential order.
The reason why it’s recommended to delay using ORDER BY
until the outermost query is because tables tend to be bigger at the start of the query, given that they have not gone through any pruning from WHERE
or GROUP BY
clauses yet. The bigger the table, the more comparisons it needs to do, and therefore the slower the performance.
Besides, if we are using ORDER BY
purely to improve the readability of our data, then there is no point sorting them early since the ordering of the data may get distorted downstream anyway.

For example, the following query has ORDER BY
clauses for cte_blocks
and cte_contracts
but they serve no real purpose because we are not computing any sequential relationships (previous row vs next row) here. Not only that, the ORDER BY
in the outermost query will overwrite the previous sorting anyway.
# "Supposedly" Not Optimized
WITH
cte_blocks AS (
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN '2021-03-01' AND '2021-03-31'
ORDER BY
1,2,3,4,5,6
),
cte_contracts AS (
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.contracts`
WHERE
DATE(block_timestamp) BETWEEN '2021-03-01' AND '2021-03-31'
ORDER BY
1,2,4,5,6,7
)
SELECT
*
FROM
cte_blocks b
LEFT JOIN
cte_contracts c ON c.block_number = b.number
ORDER BY
size,
block_hash
-----------------------
Elapsed Time : 14s
Slot Time : 140s
Bytes Processed: 865 MB
Bytes Shuffled : 5.8 GB
Bytes Spilled : 0 B
-----------------------
For comparison, we removed the pointless ORDER BY
clauses from both cte_tables
and ran the query again.
# "Supposedly" Optimized
WITH
cte_blocks AS (
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.blocks`
WHERE
DATE(timestamp) BETWEEN '2021-03-01'
AND '2021-03-31'
),
cte_contracts AS (
SELECT
*
FROM
`bigquery-public-data.crypto_ethereum.contracts`
WHERE
DATE(block_timestamp) BETWEEN '2021-03-01' AND '2021-03-31'
)
SELECT
*
FROM
cte_blocks b
LEFT JOIN
cte_contracts c ON c.block_number = b.number
ORDER BY
size,
block_hash
-----------------------
Elapsed Time : 14s
Slot Time : 145s
Bytes Processed: 865 MB
Bytes Shuffled : 5.8 GB
Bytes Spilled : 0 B
-----------------------
Based on the logic we have established so far, the former query should run much slower because it needs to perform additional ORDER BY
clause for multiple columns, but to my surprise, both queries have negligible differences in their performance – 140 seconds vs 145 seconds.
This is rather counterintuitive, but further digging into the execution details reveals that both queries only ran the ORDER BY
in the outermost query, regardless of how we wrote the queries.

As it turns out, the almighty BigQuery’s SQL Optimizer is yet again smart enough to figure out the redundant clauses and automatically exclude them from the computation.
Although it was harmless to include the redundant ORDER BY
clauses here, we should always remove unnecessary ORDER BY
clauses and delay them as far down the query as possible. Because while BigQuery SQL Optimizer proves to be impressive, some other legacy databases may not have the same capability.
12: Delay resource-intensive operations.
Best practice: Push complex operations, such as regular expressions and mathematical functions to the end of the query. Source.
Extending the same philosophy as delaying ORDER BY
statements, we want to push complex functions as far down our query as possible to avoid computing data that we will toss out eventually.

This applies to any function, such as LOWER()
, TRIM()
, CAST()
, but we are putting the emphasis on regular expressions and mathematical functions, such as REGEXP_SUBSTR()
and SUM()
, because they tend to consume a lot more resources.
To showcase the impact, I’m going to run REGEXP_REPLACE()
early in the query vs late.
# Not Optimized
WITH
cte_repo AS (
SELECT
REGEXP_REPLACE(repo_name, r"(.*)", "1") AS repo_name
FROM
`bigquery-public-data.github_repos.sample_repos`
),
cte_commit AS (
SELECT
REGEXP_REPLACE(repo_name, r"(.*)", "1") AS repo_name
FROM
`bigquery-public-data.github_repos.sample_commits`
)
SELECT
r.repo_name,
c.repo_name
FROM
cte_repo r
INNER JOIN
cte_commit c ON r.repo_name = c.repo_name
-----------------------
Elapsed Time : 2s
Slot Time : 8s
Bytes Processed: 20 MB
Bytes Shuffled : 68 MB
Bytes Spilled : 0 B
-----------------------
Subsequently, we run the same query again, except we only invoke the REGEXP_REPLACE()
in the final table after joining the two initial tables together.
# Optimized
WITH
cte_repo AS (
SELECT
repo_name
FROM
`bigquery-public-data.github_repos.sample_repos`
),
cte_commit AS (
SELECT
repo_name
FROM
`bigquery-public-data.github_repos.sample_commits`
)
SELECT
REGEXP_REPLACE(r.repo_name, r"(.*)", "1") AS repo_name,
REGEXP_REPLACE(c.repo_name, r"(.*)", "1") AS repo_name
FROM
cte_repo r
INNER JOIN
cte_commit c ON r.repo_name = c.repo_name
-----------------------
Elapsed Time : 2s
Slot Time : 3s
Bytes Processed: 20 MB
Bytes Shuffled : 56 MB
Bytes Spilled : 0 B
-----------------------
Just like that, the slot time improved from 8s to 3s while bytes shuffled dropped from 68 MB to 56 MB.
13: Use SEARCH().
Google recently released a preview version of the [SEARCH()](https://cloud.google.com/blog/products/data-analytics/pinpoint-unique-elements-with-bigquery-search-features)
function that tokenizes text data, making it exceptionally easy to find data buried in unstructured text and semi-structured JSON
data.

Traditionally when dealing with nested structures, we need to understand the table schema in advance, then appropriately flatten any nested data with UNNEST()
before running a combination of WHERE
and REGEXP
clause to search for specific terms. These are all compute-intensive operators.
# Not Optimized
SELECT
`hash`,
size,
outputs
FROM
`bigquery-public-data.crypto_bitcoin.transactions`
CROSS JOIN
UNNEST(outputs)
CROSS JOIN
UNNEST(addresses) AS outputs_address
WHERE
block_timestamp_month BETWEEN "2009-01-01" AND "2010-12-31"
AND REGEXP_CONTAINS(outputs_address, '1LzBzVqEeuQyjD2mRWHes3dgWrT9titxvq')
-----------------------
Elapsed Time : 6s
Slot Time : 24s
Bytes Processed: 282 MB
Bytes Shuffled : 903 B
Bytes Spilled : 0 B
-----------------------
Instead of overcomplicating the syntaxes, we can simplify them with a SEARCH()
function.
# Optimized
SELECT
`hash`,
size,
outputs
FROM
`bigquery-public-data.crypto_bitcoin.transactions`
WHERE
block_timestamp_month BETWEEN "2009-01-01" AND "2010-12-31"
AND SEARCH(outputs, '`1LzBzVqEeuQyjD2mRWHes3dgWrT9titxvq`')
-----------------------
Elapsed Time : 6s
Slot Time : 24s
Bytes Processed: 87 MB
Bytes Shuffled : 903 B
Bytes Spilled : 0 B
-----------------------
We can even create a search index for the column to enable point-lookup text searches.
# To create the search index over existing BQ table
CREATE SEARCH INDEX my_logs_index ON my_table (my_columns);

As you can see, SEARCH()
is an extremely powerful, straightforward, and cost-effective way for point-lookup text searches. If your use case requires searching for very specific terms in unstructured data (e.g: log analytics), prioritize using SEARCH()
.
14: Take advantage of Caching.
BigQuery has a cost-free, fully managed caching feature for our queries. When we execute our query, BigQuery automatically caches the query results into a temporary table that lasts for up to 24 hours. We can toggle the feature through Query Settings on the Editor UI.

When a duplicate query is fired, BigQuery returns the cached results instead of re-running the query, saving us that extra expense and compute time.

We can verify whether cached results are used by checking "Job Information" after running the query. The Bytes processed should display "0 B (results cached)".

Before we go crazy firing queries every chance we get, it’s important to know that not all queries will be cached. BigQuery outlined the exceptions here. Some of the more important ones to note are:-
- A query is not cached when it uses non-deterministic functions, such as
CURRENT_TIMESTAMP()
, because it will return a different value depending on when the query is executed. - When the table referenced by the query received streaming inserts because any changes to the table will invalidate the cached results.
- If you are querying multiple tables using a wildcard.
Concluding Remarks
While this isn’t an exhaustive list of all optimization tips and tricks, I hope it serves as a great start. Please do comment if I missed any important techniques as I intend to keep adding to this list so that we can all have an easy point of reference.
Good luck, and Godspeed.
Enjoyed the article? Consider becoming a Medium member to get full access to every story and support content creators like me.