BigQuery, PII, and Cloud Data Loss Prevention (DLP): Take it to the next level with Data Catalog

How to Create Data Catalog tags by inspecting all your BigQuery data with Cloud Data Loss Prevention.

Marcelo Costa
Towards Data Science

--

Background by Kelli Tungay on Unsplash

Do you know what kind of sensitive data your organization holds? Are you keeping track of every change applied across all your tables and columns? Are you confident to answer questions an auditor may have on data regulations?

Background by Sandra Grünewald on Unsplash

Having an auditor knocking in your door is not the scariest thing, data breaches can be way scarier! From fines to customer loss and legal ramifications — the consequences can be brutal and long-lasting.

Classifying and managing your sensitive data is a good way to start dealing with this, but doing this with manual work is not the way to go.

A little food for thought, who wants to go over thousands, millions of data assets, and then label each one of them saying they are sensitive or not? What if tables have confusing column names, I won’t start talking about how error-prone this is, but you get the idea…

We need to automate this workload somehow.

In this article, we will look at a fully automated solution to discover sensitive data across all your Big Query assets, by using DLP and Data Catalog (whenever I mention DLP it stands by Data Loss Prevention). And if you are not familiar with DLP and Data Catalog, the first and second parts will interest you.

Otherwise jump straight to the third part, if you want to learn how the DLP and Data Catalog magic works.

Themes covered in this article are:

  1. Cloud Data loss Prevention overview
  2. Data Catalog overview
  3. DLP to Data Catalog Tags overview
  4. Script execution walkthrough

Without further ado, let’s go!

We will provide a GitHub repository with the working example.

Data Catalog overview

Data Catalog is a managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage all their data.

Data Catalog offers a central and more secure data catalog across Google Cloud, allowing organizations to have a unified view of all their data assets. The service automatically ingests technical metadata for BigQuery and Cloud Pub/Sub and allows customers to capture business metadata in schematized format via tags, custom APIs, and the UI, offering a simple and efficient way to catalog their data assets.

Features

Data Catalog Features from official docs

In this article, we will take advantage of Tag Templates, Tags, and Data Catalog Search.

For a quick view on Data Catalog Tags structure:

Tag mental model from gcp-datacatalog-diagrams

We will go in more details later on, but bear in mind, that since we are focusing on Big Query assets, each Big Query table represents an Entry in Data Catalog, we will create Tags to classify the sensitive data, and Tag Templates will represent the DLP info types.

DLP to Data Catalog Tags overview

Now that you are familiar with DLP and Data Catalog, let’s continue with the architecture overview:

Reference Architecture from dlp-to-datacatalog-tags

The script uses a Java workload, users are able to execute it in their local machine, Cloud Shell, or a Compute Engine VM.

If a quick video demonstration of the script execution piques your interest, take a look at:

Demonstration video from dlp-to-datacatalog-tags

Read the Big Query resources

The workload starts by reading the Big Query resources from a Google Cloud project, then we split the Big Query rows into many chunks. We need this, because of a usage limit for the DLP content API, where part of the magic happens.

Call DLP

For the DLP usage limits we can divide them into two:

  • Request size

First, we need to make sure the request size does not exceed the usage limit, at the time of this writing we ended up with the following value:

Code showing Request size limit from DlpDataCatalogTagsTutorial.java
  • Requests per minute

Second, there’s a Request per Minutes enforcing limit, so we added simple sleep and retry logic:

Code showing sleep logic from DlpDataCatalogTagsTutorial.java

DLP Inspect Templates

Next, to call DLP we are going to use DLP Inspect Templates, they are useful for decoupling configuration information. In this use case, we are using it to store our infoTypes:

DLP inspect template from dlp-to-datacatalog-tags

Cloud Data Loss Prevention (DLP) uses information types — or infoTypes — to define what it scans for. An infoType is a type of sensitive data, such as a name, email address, telephone number, identification number, or credit card number.

And finally, we create the Inspect request:

Code showing inspect content logic from DlpDataCatalogTagsTutorial.java

Write Data Catalog Columns Tags

Once we get the inspection results from DLP we need to translate them into Data Catalog Tags, we start by retrieving the column name:

Code showing retrieve column logic from DlpDataCatalogTagsTutorial.java

Next, we set up the Data Catalog Tag Template:

Code showing create template logic from DlpDataCatalogTagsTutorial.java

Remember the Big Query Entry we saw in the Tag mental model? We need to build the Entry linkedResource:

Code showing linkedResource logic from DlpDataCatalogTagsTutorial.java

And finally, we create the Tag request:

Code showing tag creation logic from DlpDataCatalogTagsTutorial.java

Wait a minute! datacatalogBatchClient? Beware that the recommended way to access Google Cloud APIs programmatically is by using the client libraries, in our case we are dealing with a lot of Tags, so to gain performance we are batching those API calls.

For more details on its working, take a look at DataCatalogBatchClient.

Script execution walkthrough

If you got to this point and are feeling adventurous, you may execute the script running a Cloud Shell tutorial, I’ve automated every step of it, if you want to check it out:

I will skip the Enable Required APIs, and Set up your Service Account steps, since they are common use cases — for the full walkthrough, please take a look at the GitHub repository above.

Create Big Query tables

First, we create some Big Query resources with fake PII, to run the script later on.

# Install bq-fake-pii-table-creator using PIP
pip3 install bq-fake-pii-table-creator --user
# Create the first table
bq-fake-pii-table-creator --project-id $PROJECT_ID --bq-dataset-name dlp_to_datacatalog_tutorial --num-rows 5000
# Create a second table
bq-fake-pii-table-creator --project-id $PROJECT_ID --bq-dataset-name dlp_to_datacatalog_tutorial --num-rows 5000
# Create a third table with obfuscated column names
bq-fake-pii-table-creator --project-id $PROJECT_ID --bq-dataset-name dlp_to_datacatalog_tutorial --num-rows 5000 --obfuscate-col-names true

Create the inspection template

Second, we create the inspection template:

# Generate your OAUTH 2.0 token with gcloud
TOKEN=$(gcloud auth activate-service-account --key-file=$HOME/credentials/dlp-to-datacatalog-tags-sa.json && gcloud auth print-access-token)
# Call API using Curl
curl -X POST \ https://dlp.googleapis.com/v2/projects/$PROJECT_ID/inspectTemplates \
-H "Authorization: Bearer ${TOKEN}" \
-H 'Content-Type: application/json' \
-d '{
"inspectTemplate":{
"displayName":"DLP 2 Datacatalog inspection Template",
"description":"Scans Sensitive Data on Data Sources",
"inspectConfig":{
"infoTypes":[
{
"name":"CREDIT_CARD_NUMBER"
},
{
"name":"EMAIL_ADDRESS"
},
{
"name":"FIRST_NAME"
},
{
"name":"IP_ADDRESS"
},
{
"name":"MAC_ADDRESS"
},
{
"name":"PHONE_NUMBER"
},
{
"name":"PASSPORT"
}
],
"minLikelihood":"POSSIBLE",
"includeQuote":false
}
},
"templateId": "dlp-to-datacatalog-template"
}'

At the time of this writing, gcloud does not support the create inspect template API call, so we are using Curl.

We set up a pre-defined list of DLP Info Types in the API call, you may change it or add new types using as reference: Built-in Info types or by creating custom ones.

Execute the script

For full details on the CLI args, and how to build the script below, please look at the community tutorial.

java -cp target/dlp-to-datacatalog-tags-0.1-jar-with-dependencies.jar com.example.dlp.DlpDataCatalogTagsTutorial \
-dbType "bigquery" \
-limitMax 1000 \
-dbName dlp_to_datacatalog_tutorial \
-projectId $PROJECT_ID \
-threadPoolSize 5 \
-inspectTemplate "projects/$PROJECT_ID/inspectTemplates/dlp-to-datacatalog-template" \
-minThreshold 100

On this execution, we are scanning only the dlp_to_datacatalog_tutorial Dataset, running with 5 threads, and using dlp-to-datacatalog-template we created before.

Let’s run it and check the results:

Script execution

After the script finishes, we can go to Data Catalog and verify the sensitive data:

Data Catalog Entry — Schema & column tags

In the first Table by looking at the column names, it’s clear what kind of sensitive data we have. Anyhow, by tagging those columns, we are able to easily search for our sensitive data.

Do you remember that angry auditor knocking in your door?

Now if he comes asking us, if you store any PII, like credit card numbers… We just do a simple query in Data Catalog! We can even use IAM and give him the Data Catalog Viewer role, so he can look by himself and save us his visit.

Data catalog search looking for Credit Card Numbers

Jokes aside, Data Catalog search syntax is really powerful being powered by the same Google search technology that supports Gmail and Drive. So check their documentation to see all the possibilities.

And do you remember that tricky Table with obfuscated column names:

Data Catalog Entry — Schema & column tags

DLP to the rescue! We can see that, we have some columns tagged, and by expanding the Tag details:

Data Catalog Tag details

Oh my! DLP found 303 CREDIT_CARD_NUMBERS on that column, with a 1000 rows sample, thanks DLP!

Closing thoughts

In this article we saw how to combine Data Catalog and DLP to discover sensitive data in your Google Cloud projects.

With GPDR, CCPA and other regulatory acts becoming a reality, now it’s more important than ever to have a structured data governance process that allows you to know what data your company stores, where it is located and how the business is impacted by it.

In this field, there is always room for improvements, what if we could extend this script to automatically block access to sensitive data, when it is discovered? This is certainly possible, by using Data Catalog policy tags, so stay tuned for new posts, talking about this! Cheers!

References

--

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai