In today’s world, data collection and processing are regulated and organizations have no choice but to comply with these regulations. In consequence, companies started to rethink the way they design their information systems, data stores, and business processes with privacy in mind.
One foundational element of implementing data protection principles is Data Classification.
What is data classification?
Data classification is often defined as the process of organizing data into groups of categories, in a way that helps companies to use it and protect it more efficiently. Data classification helps us understand what we have in terms of semantics to better protect it.
However, this component is often a hard problem to solve… some companies go the manual way, and some others use ML to automatically classify their data sets. Either way, tackling this problem is expensive and can be ineffective based on how and where the data is stored.
Snowflake to the rescue
If Snowflake is present in your data stack, you might want to leverage its native Data Classification feature. After scanning and analyzing the content and metadata of your data warehouse objects (tables, views, etc..), this feature will determine the appropriate semantic and privacy categories. It will help you with discovering & tagging PII data and significantly reduces the complexity and cost of governing and protecting your data.
But before you decide on using Snowflake’s native data classification feature, there are a few important things that you should consider:
1. Data Types
Although you can classify semi-structured data (VARIANT type columns with JSON objects), the feature is limited to analyzing a VARIANT with one single data type, for example: a varchar or a number. If your tables don’t contain any JSON fields, this shouldn’t be much of a problem. However, if you heavily rely on Snowflake’s ability to store and query semi-structured data, you should remember that it can’t be combined with the data classification feature. You will need to think about a multi-step process, where (1) you flatten your columns and make sure it’s one of the supported data types, then (2) you run the classification.
2. Integration
Speaking of processes, the second point is about finding the right step at which you need/want to perform the classification of your data. Most likely, you already have put in place established data pipelines, that are feeding many databases in different environments. So, at which point do you concretely classify your data? Perhaps, you might be thinking, right after dumping it into the data warehouse.
If so, is the data quality at this stage good enough to be reliably classified with high confidence? What about the data volume? Maybe it is better if the classification takes place further downstream after the data is cleaned and modelled, right? How will you handle compliance, governance, and security in that case? What about data that will never make it to the business/metrics layer? These are some of the questions that you need to answer thoroughly before even starting to classify your data.
3. Automation and scalability
In their blog, Snowflake describes the native data classification feature as if it will remove all manual processes. This can be the case in ideal scenarios with tailored datasets, however, the real world use-cases are much different; data warehouses usually contain multiple environments, databases, and data shares. In fact, Snowflake offers three stored procedures; one that can be used to classify all tables in a schema, the second to classify all tables in a database, and the third one, for applying the classification findings on the classified object columns using tags. A manually triggered (or even scheduled) stored procedure simply doesn’t live up to the expectations in terms of automation, scalability, and monitoring. Especially because there’s no easy way to classify new or changed objects only.
In contrast with the blog article mentioned above, Snowflake’s documentation suggests a workflow, where users can choose to manually review the classification output and modify it as necessary. The problem with this approach is that it’s hard to scale; not only because it involves human attention but also because of the lack of a user interface that facilitates the review and approval process. You need to build your own tooling to bridge this gap.
4. Performance
Performance assessment is multifaceted but, I will only discuss one side; full table scans.
To analyze columns in a table/view you need to run the following function:
EXTRACT_SEMANTIC_CATEGORIES('<object_name>' [,<max_rows_to_scan>])
Besides the object name (e.g table name), it takes one optional parameter called <_max_rows_to_scan>_ which represents the sample size. If you don’t explicitly set it to a number between 0 and 10000, it will default to 10000 rows. At first, I thought that the sample size has an important impact on performance (query run time), but soon after experimenting with the feature, I realized that no matter how big or small I set the sample size, Snowflake will perform a full table scan every time I call the function. The sample size will mostly affect the accuracy of the classification result, but not the performance. If you are planning to run the classification process on a frequent schedule, you should evaluate performance. If you find that the classification is slow, you can either throw more compute power to speed things up, or use techniques like Fraction-based Row Sampling to bypass a full table scan.
5. Extensibility
Once the _EXTRACT_SEMANTICCATEGORIES function runs the classification algorithm, the next step is to apply the generated result on the target object columns as tags.
As of the publishing date of this article, the available classification tags are as listed below:
{
"name": [
"PRIVACY_CATEGORY",
"SEMANTIC_CATEGORY"
],
"allowed_values": [
[
"IDENTIFIER",
"QUASI_IDENTIFIER",
"SENSITIVE",
"INSENSITIVE"
],
[
"EMAIL",
"GENDER",
"PHONE_NUMBER",
"IP_ADDRESS",
"URL",
"US_STATE_OR_TERRITORY",
"PAYMENT_CARD",
"US_SSN",
"AGE",
"LAT_LONG",
"COUNTRY",
"NAME",
"US_POSTAL_CODE",
"US_CITY",
"US_COUNTY",
"DATE_OF_BIRTH",
"YEAR_OF_BIRTH",
"IBAN",
"US_PASSPORT",
"MARITAL_STATUS",
"LATITUDE",
"LONGITUDE",
"US_BANK_ACCOUNT",
"VIN",
"OCCUPATION",
"ETHNICITY",
"IMEI",
"SALARY",
"US_DRIVERS_LICENSE",
"US_STREET_ADDRESS"
]
]
}
These tags are already defined for you and are stored in the CORE schema in the Snowflake read-only shared database. This means that, if you want to automatically apply the tags by using the _ASSOCIATE_SEMANTIC_CATEGORY_TAGS_ stored procedure, you are limited to this list of available tags. Given the fact that many identifiers and quasi_identifiers are US-focused, you might need to think about defining your own list of tags. But, the real challenge is to figure out how this new list will work together with the native one. As a result, you will go through extra steps such as creating and setting the tags:
CREATE [ OR REPLACE ] TAG [ IF NOT EXISTS ] ...
ALTER TABLE ... MODIFY COLUMN ... SET TAG
Final thoughts
To sum up, designing and building a data classification solution is not an easy task. Snowflake provides a good starting point that already abstracts away many challenges with a call to a single function. However, don’t expect it to automagically scan your whole data warehouse and surface any PII using tags. Data engineers still need to architect the end-to-end process; including but not limited to building some tooling to facilitate the manual review process and optimizations for the data volume, budget, and usage patterns. The five points listed above might not cover every aspect of productizing the PII classification feature in Snowflake. So, if you have something different to add, or if you think that some aspects can be addressed with a better approach, please write a comment and share your thoughts.