The world’s leading publication for data science, AI, and ML professionals.

Top 12 advice on BigQuery and SQL for beginners

Start your data warehouse project with BigQuery the right way

Photo by Cathryn Lavery on Unsplash
Photo by Cathryn Lavery on Unsplash

Introduction

BigQuery is one of many data warehouse solutions but stands out as one of the most popular. After a rapid research to compare four competing and familiar solutions used by different companies, BigQuery appears to be 2nd on the podium.

Adoption volume for each solution. Source: https://discovery.hgdata.com (Image by Author)
Adoption volume for each solution. Source: https://discovery.hgdata.com (Image by Author)

To analyze data, Google BigQuery, as well as many other data warehouse systems, uses SQL as a programming language.

So, whether you want to study SQL, launch a data warehouse project, or practice with one of the top data warehouse platforms, here is some Advice I’d love to share with you!


Advice #1: Use BigQuery for free

You can use Bigquery for free up to a certain point. There are typically two considerations to make: data storage and data analysis (querying).

You can have for free each month:

  • 10 GB of storage
  • 1 TB of querying

Other operations, such as batch loading data (streaming ingestion would be charged), copying data, removing data, or exporting data, are usually free. So, if your data volume allows it, don’t wait to start any longer!

You can find all the BigQuery pricing detail here:

Pricing | BigQuery: Cloud Data Warehouse | Google Cloud

More details about the Free Tier and Trial for all Google Cloud Platform services (BigQuery and others) can be found here:

Free Trial and Free Tier | Google Cloud

Advice #2: Understand Query pricing

The primary use case for BigQuery is to conduct queries on your data in order to extract useful information. There are two main categories of costs:

  • Storage cost
  • Query cost (analysis)

From the previous advice, we see that there is a free tier, but after that, for the analytical flat rate (which is the most common one) you’ll be charged 5$ per TB of data queried.

To have an estimation of how much data a query will require to parse, you can look at your web browser code editor, there is a notification that says:

This query will process "10 GB" when run.

Where the "10" will be adjusted for your query and the "GB" as well (this can be 24 MB, 102 KB, 14TB, etc…)

The number of bytes (here 1.2GB) that will be processed to run this query (Image by Author)
The number of bytes (here 1.2GB) that will be processed to run this query (Image by Author)

To have an idea of the cost, you can always convert the amount of data processed from your query into Terabytes, and then multiply it by 5$.

In our case, we have 1.2GB, which is equivalent to 0.0012 TB, then multiplied by 5 equal 0.006$.

As you can see, we will pay a very modest amount for this query, but I am confident you will run more than one!

In general, controlling the costs of your queries is a good practice. You can learn by following this link.

Control costs in BigQuery | Google Cloud

Advice #3: Explore BigQuery free datasets

If you don’t have data loaded and you want to practice your skills, BigQuery offers many datasets available to the general public through the Google Cloud Public Dataset Program.

Later on, these datasets can also be useful if you want to merge/enrich your own data. (e.g. With geographic data and more…)

You can go on the dataset marketplace. This is what the marketplace looks like, and when clicking on one of the available datasets, you would see the details (description of the data and description of each field).

Google BigQuery Dataset Marketplace (Image by Author)
Google BigQuery Dataset Marketplace (Image by Author)

I am personally using the following datasets:

  • Google Analytics Sample
  • Crypto
  • Utility_eu

Cost warning: Google is paying for the storage of these datasets but you will pay for the queries that you execute. (cf. Advice #1 – Using BigQuery for free).

You can query them using the project prefixbigquery-public-data in your Sql FROM statement as shown below:

Advice #4: Create your datasets in the same region

When creating a dataset, you will have to fill in the following information:

  • Dataset Id: The name you want to give to your dataset
  • Data location: Determines where your data is stored (Europe, US, Asia, etc…)
  • Default table expiry: If **** you want the tables in this dataset to be automatically deleted after a specified number of days or never
  • Encryption: How you want the data to be encrypted (it is by default encrypted)

The important part is to select across your project the same location in one of the three available: United States (US) or Europe (EU), or the Asia Pacific (Asia)

The main reason is that if you have a dataset in the EU, and you are trying to make a join/merge on a dataset in the US, BigQuery will return an error like the following:

Error running query

Not found: Dataset datastic:crypto was not found in location US

This also applies to other services (Cloud SQL etc…) if you are trying to join data from different locations.

Keep also in mind that once a dataset is created, the location cannot be changed. (but you can copy/move the dataset to another location, which requires an additional effort)

Also, regions and locations come with different pricing and specificities that you can find more about on Google documentation.

Dataset locations | BigQuery | Google Cloud

Advice #5: The LIMIT statement doesn’t reduce cost

When running a query, BigQuery will evaluate how many bytes need to be processed (cf. Advice #2 – Understand query price).

Processed bytes per query using LIMIT clause or not (Image by Author)
Processed bytes per query using LIMIT clause or not (Image by Author)

On the left, we observe that this query will use 1.2 GB of data without using a LIMIT clause. On the right, we observe that this query will process the exact same amount of data using the clause.

So what are the benefits of this clause?

The advantage is that it reduces the amount of data displayed in the user interface, making it faster to display.

Additionally, when you do have a clustered table, the LIMIT clause can reduce the number of bytes scanned, and indeed, reduce the cost of the query.

Advice #6: Select only the columns you need

BigQuery is a column-oriented database, which in short, is very efficient to scan and aggregate (sum or average, etc…) data over large databases.

It is efficient because it doesn’t have to read data from an entire row. Imagine the data that would need to be processed, if you had a billion rows and 40 columns, and you would need to read the data contained in all the 40 columns for every row when you only need two or three.

As a good practice, it is recommended to not use the wildcard SELECT * FROM table when possible, but to actually select only the columns you need, for example:

SELECT col_1,col_2 FROM table;

This will significantly minimize the cost of your queries.

You can find more details about how the data is stored and represented in BigQuery below.

Overview of BigQuery storage | Google Cloud

Advice #7: ORDER BY at the end and reduce data before using a JOIN

The way you write your queries has an impact on both costs and execution performance. This can be prevented by avoiding SQL anti-patterns and using proper syntax.

Here are a few optimizations I can share:

  • Ordering results once at the end when you don’t need it before
  • Reducing the amount of data before you perform a JOIN
  • Using WHERE and DISTINCT statements at the start of your query to reduce the amount of data required.
  • Using approximate aggregate functions

Along with the other recommendations in this article (cf. Advice #5 – Select only the columns you need), another excellent practice is to decrease the quantity of data necessary at an early stage to reduce the load on each following step of your query.

In general, avoiding SQL anti-patterns is best practice. An SQL anti-pattern is simply a common mistake one makes when developing SQL code.

Avoiding SQL anti-patterns | BigQuery | Google Cloud

GitHub – boralp/sql-anti-patterns: List of anti patterns in SQL

You can also look at SQL query optimization and performance in further depth.

Introduction to optimizing query performance | BigQuery | Google Cloud

Advice #8: Formatting your queries

Formatting your queries will make a difference for your readers or when completing a code review, and it’s a habit I encourage developing from the beginning.

For mac users, you can use the following shortcut in the BigQuery browser editor: cmd+shift+f

It will change the look of your queries from this:

To a more readable and clear format, with highlighted functions, names, clauses, and statements like this:

Advice #9: Stay up to date

Google developers are constantly adding new features to their products and it is the case for BigQuery. This can be new functions, announcements, bug fixes, or simply changes. It is worth staying informed about these changes.

These notes include changes for BigQuery, but also BigQuery ML, BigQuery Data Transfer Service, and BigQuery Omni.

Release notes | BigQuery | Google Cloud

You can programmatically access release notes in BigQuery, which will also give you the flexibility to sort by date and also by type of note (fix, issue, feature, service, breaking changes, etc…)

Advice #10: Adding comments to your query

Commenting code is a good habit regardless of the language (SQL, Python, R, C++, Javascript, etc…).

It explains what the code does, and SQL queries are not exempted from explanations or details.

In BigQuery, but also in many other systems, you can comment using a double dash (hyphen) as follow:

For mac users, you can use the keyboard shortcut cmd+/ to auto comment each line of the query.

Advice #11: Learn to debug your queries

Debugging and improving your query may take some practice. But, if you find that your query is excessively slow or too expensive, you can read the query execution details tab in the query editor.

Execution details of a query (Image by Author)
Execution details of a query (Image by Author)

An excellent post discusses the BQvisualiser, a tool that can be used to evaluate where your query might be improved. This tool connects directly to your project.

Visualizing BigQuery query plans

More information on query execution can be found here:

Query plan and timeline | BigQuery | Google Cloud

Advice #12: Community UDFs

Although UDFs (user-defined functions) are already a more complex topic, there are community resources available to assist you to get started or at least reuse them for your own project!

GitHub – GoogleCloudPlatform/bigquery-utils: Useful scripts, udfs, views, and other utilities for…

UDFs are functions that allow you to run Javascript or SQL code as a function. As an example, suppose you have a field on your table that contains city names and you want to sanitize them, you could write a function using a Javascript library and apply it to your entire table at once.


Let’s put it into action. ✨

I hope these twelves recommendations helped you make a good start with BigQuery or simply explore and become familiar with the technology.

Many of the data warehouse technologies have similar qualities and would certainly be excellent options for your applications. I would claim that it is still important to double-check specifications and details based on your particular requirements.

Please let me know if you found this article informative and if it did help you in the comments!🤓


Related Articles