
Google Bigquery is popular for many reasons. It is incredibly fast, easy to work with, provides the full GCP suite, takes care of your data, and ensures to catch mistakes early on. On top of that, you can use standard SQL and some very nice built-in functions. Put short, it is almost the full package!
Always assume bugs and duplicates, always!
However, similar to other web services and programming languages, when working with BigQuery there are a few things one needs to know to avoid falling into a trap. Over the years, I made a lot of mistakes on my own and realized that almost everyone I knew, at some point, encountered the same issues. A handful of these issues I want to call out here because I discovered those fairly late in my career and also see other very experienced data scientists encountering the same issues.
Therefore, I will provide you with my top 5 list of potential mistakes almost everyone makes in BigQuery at some point and which one might not even know about. So make sure to avoid these because each point can have severe consequences and keep in mind the right attitude when working with data: Always assume bugs and duplicates, always!
1. Be Careful When Using "NOT IN"
It happens so fast. You are in a hurry and want to quickly check two tables and see if a certain item mentioned in one of the tables also exists inside of a second table. Why not then go with a NOT IN
statement, since it sounds so intuitive?
The problem is that NOT IN
doesn’t work as intended when you have NULL
values in your table. If so, you will not get the results you desire!
See for yourself and check out this code example in which I am just trying to find the categories from _input2 that are not inside of _input1:
WITH
input_1 AS (
SELECT
category
FROM (
SELECT
["a", "b", CAST(NULL AS STRING), "d"] AS category),
UNNEST(category) category ),
input_2 AS (
SELECT
category
FROM (
SELECT
["a", "b", "c", "d"] AS category),
UNNEST(category) category )
SELECT
*
FROM
input_2
WHERE
input_2.category NOT IN (
SELECT
category
FROM
input_1)
Pay attention to the _input1 cell that does not have a c category, yet, the result of this query will be There is no data to display
.
To overcome this, make sure you use a LEFT JOIN
or remove all NULL
values as shown in this slightly adjusted code (pay attention to the very last line):
WITH
input_1 AS (
SELECT
category
FROM (
SELECT
["a", "b", CAST(NULL AS STRING), "d"] AS category),
UNNEST(category) category ),
input_2 AS (
SELECT
category
FROM (
SELECT
["a", "b", "c", "d"] AS category),
UNNEST(category) category )
SELECT
*
FROM
input_2
WHERE
input_2.category NOT IN (
SELECT
category
FROM
input_1
WHERE category IS NOT NULL)
2. You Might Lose Rows When "UNNEST" Data
When you work with BigData, nested columns are a must-have. They will save money and keep things organized. Also, sometimes nesting your data is a convenient step to limit your data based on a specific order for a partition.
However, when unnesting your data, be aware that you might lose data! When using UNNEST
in a CROSS JOIN
, you will lose all rows for which there is no data inside the nested column.
Check out this example where we have 2 different IDs defined but only one has data in the nested column:
WITH
input_1 AS (
SELECT 1 as id, ['phone', 'car', 'paper'] as data
UNION ALL
SELECT 2 as id, [] as data
)
SELECT
id,
element
FROM input_1,
UNNEST(data) as element
When running this code, you will see that due to the CROSS JOIN
ID number 2 will be lost. That can have a severe impact on your data since you are essentially filtering out that ID from your subsequent analysis.
One way to improve is to use a LEFT JOIN
instead of the CROSS JOIN
as shown in the code below (please pay attention to the last row):
WITH
input_1 AS (
SELECT 1 as id, ['phone', 'car', 'paper'] as data
UNION ALL
SELECT 2 as id, [] as data
)
SELECT
id,
element
FROM input_1
LEFT JOIN UNNEST(data) as element
3. Costs in BigQuery ML are Often Higher Than Estimated
Estimating costs is important. When using BigQuery, it is likely you are working with big data (because that is where BigQuery really shines). One of the intuitive things with BigQuery is that query costs are automatically estimated and BigQuery provides you with the maximum costs based on the estimated partition pruning.
For instance, when the cost estimate is 1 TB, you can expect the query to cost 6.25 dollars or less (because 6.25$ / TB is the on-demand pricing model; when using clustering, you can reduce the amount of data used even further).

However, BigQuery ML is different! Two things I like to point out here:
- It has a different pricing model where for instance linear regression is fairly expensive (312.5$ / TB).
- The processing costs are dependent on the underlying Vertex AI training costs. Unfortunately, BigQuery can’t tell you before running the query how large the passthrough Vertex AI costs are going to be. So you might pay much more than what has been estimated.
Let’s take an example to highlight point 2. Here, I try to predict trip length based on the day and the hour using the publicly available Chicago taxi trips dataset:
CREATE MODEL your_model
OPTIONS(
MODEL_TYPE='BOOSTED_TREE_REGRESSOR',
BOOSTER_TYPE = 'GBTREE',
NUM_PARALLEL_TREE = 5,
MAX_ITERATIONS = 5,
TREE_METHOD = 'HIST',
EARLY_STOP = FALSE,
SUBSAMPLE = 0.85,
INPUT_LABEL_COLS = ['trip_miles']
) AS
SELECT
EXTRACT(hour
FROM
trip_start_timestamp) hour,
EXTRACT(dayofweek
FROM
trip_start_timestamp) weekday,
trip_miles
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles IS NOT NULL;
When estimating the processing costs, it estimates 3.12 GB. But pay attention to the (ML)
in the estimate (see Figure 1). Because that is the clue telling you that it might get more expensive. And as a matter of fact, it did. Training this model resulted in 51.63 GB billed and, thus, has been about 17x more expensive than the original estimate.
4. Avoid MERGE Statements When Parallelising Jobs in Production
Updating an existing table is a common situation when you have production pipelines. That is because you continuously get new data you want to append but you also sometimes have historical changes which you like to be reflected in your table.
That being said, it is important to have a scalable setup for updating tables. One popular choice is via MERGE
statements. Those allow you to update, or insert data based on specific conditions. What makes them really powerful is that they will prevent you from introducing duplicates because MERGE
statements allow you only to update one matching row based on the defined conditions.
However, there is a problem. MERGE
statements do not allow very much for parallelization. Here is an example query that tries to merge data into an existing table. When you run the MERGE
statement several times in parallel, you will encounter some problems.
First, we create a dummy table which we want to merge into:
CREATE OR REPLACE TABLE
your_project.your_dataset.merge_test (
local_date date,
value INT64 )
PARTITION BY
local_date;
Then we run this query several times in parallel which just inserts or updates new dates into the dummy table:
MERGE
your_project.your_dataset.merge_test o
USING
(
SELECT
date local_date,
1 value
FROM
UNNEST(GENERATE_DATE_ARRAY("2022-01-01", "2023-12-31", INTERVAL 1 day)) date
) n
ON
n.local_date = o.local_date
WHEN MATCHED THEN UPDATE SET value = o.value
WHEN NOT MATCHED THEN INSERT VALUES (local_date, value)
When running this in parallel, you will observe two issues:
- You will encounter an error message when an
UPDATE
statement is used on the same partition. The error message will look like this: _Could not serialize access to table your_project.your_dataset.mergetest due to concurrent update. - You will also observe that the queries that were triggered a bit later are for a long time (several seconds) in a pending mode and not running.
Even though MERGE
statements are very useful and powerful, keep in mind when you have a production environment that is designed to parallelize data ingestions into the same table. The query will either crash due to partition conflicts in the update or, when having more than 2 tasks running in parallel, queries will be in the pending mode and you lose the benefits of parallelization.
One way to avoid that is to use DELETE / INSERT
statements or creating temporary tables that are merged together in a batch at a later point.
5. Partition and Cluster Even Temp Tables
Temporary (or TEMP
) tables are very useful when you are constructing a complex query with several subqueries. They also come in handy when you are working with big data where you want to load the data once and then use it several times in subsequent parts of the query. So when you are running into a potential timeout issue that your query runs too long, they might be what you are looking for.
However, often users see the benefits only in the possibility of creating a temporary table that is materialized while the whole query is running. But that is not the full picture as, similar to every other materialized table, TEMP
tables can be partitioned and clustered. This provides additional advantages with which you can reduce your query costs and also improve your runtime (when working with really big data).
The attached query is not very useful but it shows the concept. The query creates two TEMP
tables based on the publicly available Chicago taxi trips dataset where one is using the full dataset and the other only a subset of that. In the last query, those two tables are joined together to filter down to the subset. Since the TEMP
tables are partitioned and clustered, the actual query that joins those two temp tables together bills only 1GB while without partitioning it would bill 36GB:
CREATE TEMP TABLE trips
PARTITION BY
trip_start_date
CLUSTER BY
unique_key AS
SELECT
unique_key,
taxi_id,
DATE(trip_start_timestamp) trip_start_date,
trip_total
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`;
CREATE TEMP TABLE trips_filtered
PARTITION BY
trip_start_date
CLUSTER BY
unique_key AS
SELECT
unique_key,
taxi_id,
DATE(trip_start_timestamp) trip_start_date,
trip_total
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
EXTRACT(year
FROM
trip_start_timestamp) = 2022
AND SUBSTR(taxi_id, -1, 1) = "b";
SELECT
DISTINCT taxi_id
FROM
trips
JOIN
trips_filtered
USING
(unique_key,
taxi_id,
trip_start_date)
In addition, when working with big datasets and performing spatial indexing, performing the joins on the clustered and partitioned TEMP
tables will improve your runtime as well.
Summary
We all make mistakes and that is a good thing, as long as we learn from them. Those 5 issues mentioned in this story are for me the most important to mention because they are not very well known and beginners encounter them rarely. Few people think about them but when getting more and more experienced, at some point, those 5 issues might impact the data or production environment in a negative way. So pay attention to those and also remember the attitude when working with data in general: Always assume bugs and duplicates, always!