The world’s leading publication for data science, AI, and ML professionals.

How to Avoid Five Common Mistakes in Google BigQuery / SQL

While working with BigQuery for years, I observed 5 issues that are commonly made, even by experienced Data Scientists

Photo by James Harrison on Unsplash
Photo by James Harrison on Unsplash

Google Bigquery is popular for many reasons. It is incredibly fast, easy to work with, provides the full GCP suite, takes care of your data, and ensures to catch mistakes early on. On top of that, you can use standard SQL and some very nice built-in functions. Put short, it is almost the full package!

Always assume bugs and duplicates, always!

However, similar to other web services and programming languages, when working with BigQuery there are a few things one needs to know to avoid falling into a trap. Over the years, I made a lot of mistakes on my own and realized that almost everyone I knew, at some point, encountered the same issues. A handful of these issues I want to call out here because I discovered those fairly late in my career and also see other very experienced data scientists encountering the same issues.

Therefore, I will provide you with my top 5 list of potential mistakes almost everyone makes in BigQuery at some point and which one might not even know about. So make sure to avoid these because each point can have severe consequences and keep in mind the right attitude when working with data: Always assume bugs and duplicates, always!

1. Be Careful When Using "NOT IN"

It happens so fast. You are in a hurry and want to quickly check two tables and see if a certain item mentioned in one of the tables also exists inside of a second table. Why not then go with a NOT IN statement, since it sounds so intuitive?

The problem is that NOT IN doesn’t work as intended when you have NULL values in your table. If so, you will not get the results you desire!

See for yourself and check out this code example in which I am just trying to find the categories from _input2 that are not inside of _input1:

WITH
  input_1 AS (
  SELECT
    category
  FROM (
    SELECT
      ["a", "b", CAST(NULL AS STRING), "d"] AS category),
    UNNEST(category) category ),

  input_2 AS (
  SELECT
    category
  FROM (
    SELECT
      ["a", "b", "c", "d"] AS category),
    UNNEST(category) category )

SELECT
  *
FROM
  input_2
WHERE
  input_2.category NOT IN (
  SELECT
    category
  FROM
    input_1)

Pay attention to the _input1 cell that does not have a c category, yet, the result of this query will be There is no data to display.

To overcome this, make sure you use a LEFT JOINor remove all NULL values as shown in this slightly adjusted code (pay attention to the very last line):

WITH
  input_1 AS (
  SELECT
    category
  FROM (
    SELECT
      ["a", "b", CAST(NULL AS STRING), "d"] AS category),
    UNNEST(category) category ),

  input_2 AS (
  SELECT
    category
  FROM (
    SELECT
      ["a", "b", "c", "d"] AS category),
    UNNEST(category) category )

SELECT
  *
FROM
  input_2
WHERE
  input_2.category NOT IN (
  SELECT
    category
  FROM
    input_1
  WHERE category IS NOT NULL)

2. You Might Lose Rows When "UNNEST" Data

When you work with BigData, nested columns are a must-have. They will save money and keep things organized. Also, sometimes nesting your data is a convenient step to limit your data based on a specific order for a partition.

However, when unnesting your data, be aware that you might lose data! When using UNNEST in a CROSS JOIN, you will lose all rows for which there is no data inside the nested column.

Check out this example where we have 2 different IDs defined but only one has data in the nested column:

WITH 
input_1 AS (
  SELECT 1 as id, ['phone', 'car', 'paper'] as data 
UNION ALL
  SELECT 2 as id, [] as data 
)

SELECT 
id,
element
FROM input_1,
UNNEST(data) as element

When running this code, you will see that due to the CROSS JOIN ID number 2 will be lost. That can have a severe impact on your data since you are essentially filtering out that ID from your subsequent analysis.

One way to improve is to use a LEFT JOIN instead of the CROSS JOIN as shown in the code below (please pay attention to the last row):

WITH 
input_1 AS (
  SELECT 1 as id, ['phone', 'car', 'paper'] as data 
UNION ALL
  SELECT 2 as id, [] as data 
)

SELECT 
id,
element
FROM input_1
LEFT JOIN UNNEST(data) as element

3. Costs in BigQuery ML are Often Higher Than Estimated

Estimating costs is important. When using BigQuery, it is likely you are working with big data (because that is where BigQuery really shines). One of the intuitive things with BigQuery is that query costs are automatically estimated and BigQuery provides you with the maximum costs based on the estimated partition pruning.

For instance, when the cost estimate is 1 TB, you can expect the query to cost 6.25 dollars or less (because 6.25$ / TB is the on-demand pricing model; when using clustering, you can reduce the amount of data used even further).

Figure 1: Estimate for the model training based on the Chicago taxi trips dataset. When training the model, it will actually bill 51.6 GB. (Image by author)
Figure 1: Estimate for the model training based on the Chicago taxi trips dataset. When training the model, it will actually bill 51.6 GB. (Image by author)

However, BigQuery ML is different! Two things I like to point out here:

  1. It has a different pricing model where for instance linear regression is fairly expensive (312.5$ / TB).
  2. The processing costs are dependent on the underlying Vertex AI training costs. Unfortunately, BigQuery can’t tell you before running the query how large the passthrough Vertex AI costs are going to be. So you might pay much more than what has been estimated.

Let’s take an example to highlight point 2. Here, I try to predict trip length based on the day and the hour using the publicly available Chicago taxi trips dataset:

CREATE MODEL your_model 
OPTIONS(
    MODEL_TYPE='BOOSTED_TREE_REGRESSOR',
    BOOSTER_TYPE = 'GBTREE',
    NUM_PARALLEL_TREE = 5,
    MAX_ITERATIONS = 5,
    TREE_METHOD = 'HIST',
    EARLY_STOP = FALSE,
    SUBSAMPLE = 0.85,
    INPUT_LABEL_COLS = ['trip_miles']
) AS

SELECT
  EXTRACT(hour
  FROM
    trip_start_timestamp) hour,
  EXTRACT(dayofweek
  FROM
    trip_start_timestamp) weekday,
  trip_miles
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  trip_miles IS NOT NULL;

When estimating the processing costs, it estimates 3.12 GB. But pay attention to the (ML) in the estimate (see Figure 1). Because that is the clue telling you that it might get more expensive. And as a matter of fact, it did. Training this model resulted in 51.63 GB billed and, thus, has been about 17x more expensive than the original estimate.

4. Avoid MERGE Statements When Parallelising Jobs in Production

Updating an existing table is a common situation when you have production pipelines. That is because you continuously get new data you want to append but you also sometimes have historical changes which you like to be reflected in your table.

That being said, it is important to have a scalable setup for updating tables. One popular choice is via MERGE statements. Those allow you to update, or insert data based on specific conditions. What makes them really powerful is that they will prevent you from introducing duplicates because MERGE statements allow you only to update one matching row based on the defined conditions.

However, there is a problem. MERGE statements do not allow very much for parallelization. Here is an example query that tries to merge data into an existing table. When you run the MERGE statement several times in parallel, you will encounter some problems.

First, we create a dummy table which we want to merge into:

CREATE OR REPLACE TABLE
  your_project.your_dataset.merge_test ( 
    local_date date,
    value INT64 )
PARTITION BY
  local_date;

Then we run this query several times in parallel which just inserts or updates new dates into the dummy table:

MERGE
  your_project.your_dataset.merge_test o
USING
  (
  SELECT
    date local_date,
    1 value
  FROM
    UNNEST(GENERATE_DATE_ARRAY("2022-01-01", "2023-12-31", INTERVAL 1 day)) date 
  ) n
ON
  n.local_date = o.local_date
  WHEN MATCHED THEN UPDATE SET value = o.value
  WHEN NOT MATCHED THEN INSERT VALUES (local_date, value)

When running this in parallel, you will observe two issues:

  1. You will encounter an error message when an UPDATE statement is used on the same partition. The error message will look like this: _Could not serialize access to table your_project.your_dataset.mergetest due to concurrent update.
  2. You will also observe that the queries that were triggered a bit later are for a long time (several seconds) in a pending mode and not running.

Even though MERGE statements are very useful and powerful, keep in mind when you have a production environment that is designed to parallelize data ingestions into the same table. The query will either crash due to partition conflicts in the update or, when having more than 2 tasks running in parallel, queries will be in the pending mode and you lose the benefits of parallelization.

One way to avoid that is to use DELETE / INSERT statements or creating temporary tables that are merged together in a batch at a later point.

5. Partition and Cluster Even Temp Tables

Temporary (or TEMP) tables are very useful when you are constructing a complex query with several subqueries. They also come in handy when you are working with big data where you want to load the data once and then use it several times in subsequent parts of the query. So when you are running into a potential timeout issue that your query runs too long, they might be what you are looking for.

However, often users see the benefits only in the possibility of creating a temporary table that is materialized while the whole query is running. But that is not the full picture as, similar to every other materialized table, TEMP tables can be partitioned and clustered. This provides additional advantages with which you can reduce your query costs and also improve your runtime (when working with really big data).

The attached query is not very useful but it shows the concept. The query creates two TEMP tables based on the publicly available Chicago taxi trips dataset where one is using the full dataset and the other only a subset of that. In the last query, those two tables are joined together to filter down to the subset. Since the TEMP tables are partitioned and clustered, the actual query that joins those two temp tables together bills only 1GB while without partitioning it would bill 36GB:

CREATE TEMP TABLE trips
PARTITION BY
  trip_start_date
CLUSTER BY
  unique_key AS
SELECT
  unique_key,
  taxi_id,
  DATE(trip_start_timestamp) trip_start_date,
  trip_total
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`; 

CREATE TEMP TABLE trips_filtered
PARTITION BY
  trip_start_date
CLUSTER BY
  unique_key AS
SELECT
  unique_key,
  taxi_id,
  DATE(trip_start_timestamp) trip_start_date,
  trip_total
FROM
  `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
  EXTRACT(year
  FROM
    trip_start_timestamp) = 2022
  AND SUBSTR(taxi_id, -1, 1) = "b";

SELECT
  DISTINCT taxi_id
FROM
  trips
JOIN
  trips_filtered
USING
  (unique_key,
    taxi_id,
    trip_start_date)

In addition, when working with big datasets and performing spatial indexing, performing the joins on the clustered and partitioned TEMP tables will improve your runtime as well.

Summary

We all make mistakes and that is a good thing, as long as we learn from them. Those 5 issues mentioned in this story are for me the most important to mention because they are not very well known and beginners encounter them rarely. Few people think about them but when getting more and more experienced, at some point, those 5 issues might impact the data or production environment in a negative way. So pay attention to those and also remember the attitude when working with data in general: Always assume bugs and duplicates, always!


Related Articles