Automate BigQuery Dataset & Table Size Reporting using BigQuery or Google Sheets

Collect table size and row counts from your BigQuery projects automatically, and store it all in BigQuery and/or Google Sheets.

Nick Young
Towards Data Science

--

tell me more…?

UPDATE JAN 18, 2021: Updated the process and repo code so that you only need to specify your GCP Project IDs, instead of needing to manually enter all your projects AND dataset IDs. So…it’s even easier now!

I wanted to see if it was possible to easily automate the process for tracking how my different BigQuery datasets and tables changed over time. At some point, I’ll also add in the ability to get email or Google Chat notifications when a threshold has been breached (table increased/decreased too much in between checks), but that’s for another day. Today, we’re just collecting information about our tables for easy reporting and visualization.

There’s obviously a bunch of different ways to do this, including scheduled queries in BigQuery itself, but that doesn’t handle the process of pushing the data into a Google Sheet automatically either. Plus… I wanted to see if this was possible in AppScript, because now I can repurpose the query function to perform other things in future too.

Here are all the things we’ll use:

  1. BigQuery Datasets/Tables — to check their size, across multiple projects
  2. AppScript — to handle the code and schedule the checks to run automatically
  3. BigQuery table — to store what we collected, if we don’t want to use Sheets
  4. Google Sheet — to store what we collected, if we don’t want to use BQ

Let’s get started.

Create an appscript project

  • Go to script.google.com and create a new project
  • Because there’s not a git integration with appscript, you’ll need to just copy the files (contents of the files) manually
  • Copy the contents of the files from the github repository shared here: https://github.com/usaussie/appscript-bigquery-reporter
  • You should end up with 3 files — Jobs.gs, Helpers.gs,Config.gs (renamed from the sample provided)

Update the project to use your BigQuery / Sheet info

  • Update the Config.gs file to point to your own BQ Project(s) and datasets that you want to track, then also provide a BQ table to store the results =, and finally a google sheet url.
// BigQuery Project ID - all lowercase
const BQ_PROJECT_ID = 'your-gcp-project-id';
// Create a dataset in the BigQuery UI (https://bigquery.cloud.google.com)
const BQ_DATASET_ID = 'your-gcp-dataset-id';
// Table ID to store the collected stats
const BQ_TABLE_ID = 'your-gcp-table-id';// for storing the metrics in google sheets too
const SHEET_URL = "https://docs.google.com/spreadsheets/d/your-google-sheet/edit";const MAIN_SHEET_TAB_NAME = "Sheet1";// these are the tables & datasets that you
function project_dataset_list() {
return [
['gcp-project-id-a', 'dataset_one'],
['gcp-project-id-a', 'dataset_two'],
['gcp-project-id-b', 'dataset_one'],
['gcp-project-id-c', 'dataset_one']
];
}
  • That’s it…that’s all the stuff you have to code by yourself….and it’s really not even coding….it’s just pointing at IDs you already manage.
  • If you don’t already have a BQ table to store the results, don’t worry, you can do that from this project too.

Enough setup, let’s do this…

  • First, let’s create our BQ table — open the Jobs.gs file, and run the create_tables_one_time() function and the set_sheet_headers() function once.
/**
* Only run this once to create the initial tables to hold the collected data.
*/

function create_tables_one_time() {

var my_tables = tables_to_create();

for (i = 0; i < my_tables.length; i++) {

// generate correct function / table info from detected string
var tableFunction;
tableFunction = new Function('return ' + my_tables[i]);
var thisTable = tableFunction()();

var tableJson = constructTableJson(thisTable, BQ_PROJECT_ID, BQ_DATASET_ID);
createTable(thisTable.tableId, BQ_PROJECT_ID, BQ_DATASET_ID, tableJson);

}

}
/*
*
* ONLY RUN THIS ONCE TO SET THE HEADER ROWS FOR THE GOOGLE SHEETS
*/
function set_sheet_headers() {

var sheet = SpreadsheetApp.openByUrl(SHEET_URL).getSheetByName(MAIN_SHEET_TAB_NAME);
sheet.appendRow(["timestamp","project","dataset","table", "size_bytes", "row_count"]);

}
  • That should make Google popup and prompt to allow your new appscript project to connect to BigQuery on your behalf, and write to Google Drive/Sheets. Note, this is your project…so you’re just granting permission to…yourself :-)
Self-permission granted!
  • Run the job_get_bq_stats() function, and wait a few seconds (depending on how many tables you’re checking).
function job_get_bq_stats() {

var this_timestamp = Utilities.formatDate(new Date(), "UTC", "yyyy-MM-dd'T'HH:mm:ss'Z'");

var projects = project_dataset_list();

// loop the outer array
var stats_array = []
for (let i = 0; i < projects.length; i++) {

var my_query = construct_select_query(projects[i][1]);
var query_data = runQuery(projects[i][0], my_query);

// if there are results, add the extra info (timestamp, project etc) ready for storage/insert into our sheet/bq table
if(query_data.length > 0) {

for (let q = 0; q < query_data.length; q++) {

stats_array.push([
this_timestamp,
projects[i][0],
projects[i][1],
query_data[q][0],
query_data[q][1],
query_data[q][2],
]);

}

}

}

// write to bigquery
var insert_query = construct_insert_query(stats_array);
runQuery(BQ_PROJECT_ID, insert_query);

//write to google sheet now
// write collected rows arrays to the sheet in one operation (quicker than individual appends)
var ss = SpreadsheetApp.openByUrl(SHEET_URL).getSheetByName(MAIN_SHEET_TAB_NAME);
ss.getRange(ss.getLastRow() + 1, 1, stats_array.length, stats_array[0].length).setValues(stats_array);

}
  • After the function finished executing, you can check either your BigQuery table, or your google sheet to see the results.
Screenshot from Google Sheets with the collected BQ data
  • You should now see rows that contain your project ID, dataset ID, table ID, and then the table size (bytes) and number of rows.
I have the data!

Schedule this to run automatically

  • Click on the triggers section in your appscript project, and follow the prompts to run the job_get_bq_stats() function on a daily basis (or hourly depending on your needs).
Screenshot of setting up the trigger in appscript to run this job automatically
I’m triggered!

That’s it!

Expand on that…

The next things I’d like to add, would be around alerting when datasets grow/change “too” much between checks.

Note, you can obviously connect this to a Google Datastudio dashboard to chart your table data over time, and to identify outliers etc.

Finally….here’s another link to the git repo for the code…feel free to use, enhance, augment for your own needs.

https://github.com/usaussie/appscript-bigquery-reporter

I like to share :-)

--

--

Cloud stuff, data, analytics; Google, Internet2 Advisory Boards & working groups. Higher Ed IT since 2002. @techupover and @usaussie on twitter