The world’s leading publication for data science, AI, and ML professionals.

Understand BigQuery Schema Auto-detection

And Choose To Manually Define and Control your BigQuery Table Schema

Photo by Michał Jakubowski on Unsplash
Photo by Michał Jakubowski on Unsplash

Schema auto-detection is an attractive feature people think of when they need to create a database table. Think about it! Being able to automatically detect a data schema sounds really great because it allows to save the time that was traditionally devoted to defining data schemas manually.

It’s no surprise Bigquery offers the schema auto-detection capacity for a couple of data types including CSV, NEWLINE_DELIMITED_JSON (aka JSONL) and AVRO.

Although I can’t deny the usefulness of the feature for rapid prototyping in development phase, I’ll argue that it should be used with a lot of care in automated production context, especially when you deal with big data. And here is why:

BigQuery infers the schema of tables based on a sample of the data rather than looking at the whole data, at least for big data.

Let me show what I mean with 2 simple examples.

Example1: Load a STRING Column that Looks Like an Integer Column

Consider the following Data. It contains 10 001 codes which are represented with 5 characters. All the codes are composed of 5 numeric characters except for the last code which starts with the letter ‘a’.

Image By Author, A Sample of 5 rows from the Codes Data
Image By Author, A Sample of 5 rows from the Codes Data

To load this data from a CSV file named _test_sample1.csv to a BigQuery table named _test_sample1, we’d use the following command:

However we get a nasty error saying that BigQuery is not able to parse the code a2153.

Image By Author
Image By Author

Actually based on a sample of data, BigQuery infers that the column code is __ an integer. It then tries to cast the value a2153 to an integer which obvisously is not feasible.

Well, that’s the first issue we encountered in my company when we tried to automate things by taking advantage of BigQuery schema auto-detection. But the worse is yet to come.

Example2: Load JSON Records with Optional Fields

Let’s try this time a JSONL formatted data. The data contains the addresses of 10 0001 houses identified by a code. The houses might have a secondary address (column address 2) in addition to the main address (column address 1). In our sample data, only one house happens to have a secondary address, which is house 34512.

Image By Author, A Sample of 5 rows from the Addresses Data
Image By Author, A Sample of 5 rows from the Addresses Data

We run a command similar to the previous one to load the data into a BigQuery table.

And here also, we get back an error indicating that there is no address.addresss2 column in the schema auto-detected by BigQuery.

Image By Author
Image By Author

Once again, the auto-detection has been based on a sample of rows that do not contain the row for the house 34512.

Solution: Manually Define and Control Your BigQuery Table Schema

We can solve the 2 issues described above very easily by defining a schema and providing that schema to the bq load command.

For the codes data, we would do

And for the addresses data we would first define a schema file. Then we would run the load command.

The key here is to declare the address.address2 column as NULLABLE.

End Notes

Schema auto-detection in BigQuery might give you bad headaches if you are not aware of how it works. To make a long story short, BigQuery samples your data (at least for big data) and computes a schema based on that sample. And it’s often the case that the sample schema does not match the actual schema of the data. Hence, I would suggest manually defining data schemas based on the knowledge of the data. Please, find the sample data and the code snippet used through the article here.

Until the next writing, bye bye.


Related Articles