Take the right action on your data, based on what the data really represent and not on what you think they are

In a previous article I showed how to create with IBM Cloud Pak for Data an automatic process to discover data and ingest them in a catalog while enforcing governance policies. One of the key elements of this process is the ability to recognize what kind of data are ingested. This is what is called Data Classification – not to be confused with classification in the ML context.
In this article I will go deeper in this particular topic and explain the concepts behind the data classification process as implemented in IBM Cloud Pak for Data or the IBM Information Server portfolio.
What is data classification?
If you search for a definition of data classification on the internet, you will probably find many different explanations. However most sources will define it more or less like this:
Data Classification is the process of categorizing data in order to take more efficient actions on them.
This is quite a generic definition which is often used to describe a higher level business classification on the data set itself, such as "Confidential Information", "Sensitive Information" or "Personal Identifyable Information". This kind of data classification can be helpful to implement a data protection policy or other data governance rules, but doesn’t tell you in details what the data really represent.
In this article I will focus on a finer grained notion of data classification, where not only the data set as a whole needs to be categorized, but each field or even each value in the data set is associated with the finest possible description of what they represent.
In this mind, let me try to restate the definition of what I mean under data classification:
Data Classification is the process of identifying the most specific type of information represented by a field or an individual value in a data set, in regard to the real world entities that the data relate to.
Well, this doesn’t make the definition simpler, but it describes in a more accurate way, what this article is about.
Concretely, this means that if you have a data set as a CSV file or a spreadsheet with columns containing values but no real metadata describing their meaning – or if you have a table with cryptic column names – , then data classification should allow you to reverse-engineer what type of information each column of the data set contains, just by looking at the data.
Why is data classification necessary?
In a perfect world, where everything is perfectly documented, you wouldn’t have to care about data classification, because you would only have to look at the documentation of a data set to find out what each field represents in the real world.
In the reality however, there is not always such a documentation available. And if there is one, it may be outdated as the data model evolves over the time. In the best case the data set may use field names which are self-explanatory, but you never know how a field originally used for a particular purpose may have been miss-used for a different purpose.
But why classifying data at all?
Knowing what data really represent is fundamental in order to decide what you can do or what you have to do with them:
- If you are a building a ML model, you need to understand what each field really represents in order to choose the right features to be used by the model.
- If you are applying a ML model on data, you need to be sure that the fields really contain what the model expects.
- If you are measuring the data quality (see my previous article on this topic), knowing the classification of the data will define what kind of expectation you should have on them. For example, knowing that a field contains phone numbers will define automatically how values in that field should be validated.
- If you are migrating data from one system to another, data classification will help you identify where information required by a target field should come from, and ensure that the data are compatible.
- If you are responsible for the data governance and the enforcement of data protection rules (see my previous article on this topic), the data classification will help you ensure that you really protect sensitive data wherever they are, even if they are hidden in undocumented fields having cryptic names.
These are only a few examples…
How does data classification work?
The first step in the data classification process is to find out what kind of information you want/need to detect in the data. Each type of information to be detected is what is called in the IBM portfolio a data class.
Next, you need to figure out how each type of information can be detected. Each data class is associated with a data classifier, which is basically a test or an algorithm to apply on the data to determine whether you have found the searched data class or not.
Optionally you can associate the data classes with some rules that will automatically take some actions on the data when a data class is detected. Examples of such actions can be to automatically tag the data set or data field with a business term or a business classification (ex: PII, or Sensitive Information), or to apply a specific data quality constraint on the fields where the data class have been detected.
Ideally you don’t have to start this process from scratch but can use a predefined model, like the IBM Industry Data Models, which provide a set of relevant business terms and data classes that are relevant for a given industry.
Even if you don’t use a complete industry model, data classification tools usually provide an out of the box list of standard data classes for common concepts. You can then review which data classes are relevant for your use case, which should be modified and which are missing and should be created.

Once the list of data classes to detect is defined, the automatic data classification process can start. Basically the analysis applies the classifiers of each data class on all the data and reports when a classifier returns a positive match. There are of course some optimization techniques to do the work faster, but basically the standard classification process is more or less a brute-force approach.
Defining a classifier
As you can see, the key of the problem is to define a good classifier which can detect the presence of a certain data class in the data.
The IBM products allow you to define a classifier as:
- a regular expression,
- a lookup to a reference table containing the values that the domain can have,
- a regular expression on the name of the field to detect data classes based on the field names instead or in addition to the data it contains,
- A custom logic written in Java – at the point of this writing, this is only available on Information Server and not yet on Cloud Pak For Data.
Writing a classifier can be simple for some domains where there is a clear definition of what a value should be to belong to a domain. For instance an URL, or an Email-Address, can be detected with regular expressions. Country codes can be detected with reference tables. Some other domains like credit card numbers may require more logic to verify the checksum of the number, but this can be done with a few lines of Java code.
For some other domains, where there is no such clear specification, detecting the domain can be a science by itself and you need some creativity to invent a good heuristic which not only finds successfully the searched data classes, but also doesn’t create too many false positives on data from another domains.
The notion of scope
Before we go deeper in the data classifiers, we need first to introduce the notion of scope of a classifier.
The scope of a classifier is the smallest granularity of data on which a data classifier can successfully detect a data class.
This may not sound quite obvious, but some domains may require more contextual information than others in order to be successfully detected.
Scope "value":
As we saw before, it is easy to detect Email addresses, or URLs, credit-card numbers just by looking at a single value, because the format of these values is very specific to the domain, and you don’t need to know where a value is taken from, in order to detect if it matches one of these data classes. Such classifiers work at the scope "value", which means that they can take a clear decision on every single value of any field of a data set.

Data classes implemented as a regular expression or list of values are automatically of scope value. For a custom Java implementation, the scope needs to be specified and can be seen in IBM Cloud Pak for Data by looking at the definition of the data class as shown by the previous screenshot.
Scope "column"
On the other side, if you want to detect that a column contains some free text, it may not be enough to look at a single values to conclude that the information is free text. To reach a conclusion you probably need to look at multiple values in that column and see how metrics like the length of the values or the number of tokens are distributed for the column as a whole.
The same would be true for domains where there is a high risk of collision with other domains, like names that could be last names, as well as company names, or professions, or city names. All these domain cannot be described with a clear boolean test that can be done on a single value, but require more context in the form of more values coming from the same column or additional information on the column containing these values.
For such domains, the scope of the classifiers would be "column", meaning that they cannot take a clear match/no match decision on an individual value, but can take a decision for a column as a whole.
Unless a data class is defined to only look at the column name to take a decision, classifiers with a scope "column" are usually implemented with custom logic. This would be as a Java class or a JavaScript function when implemented the IBM products.
For some domains you may even need more than one column to take a decision. This could be the case when the data class of a column can only be detected accurately if another data class has been detected in one of the neighbour columns.
As a concrete example, imagine that you are looking at a column containing dates. Taken alone you can only conclude that it contains dates. But if you see that the column preceding this column in the data set contains credit card numbers, you may be able to classify it more accurately as a credit card expiration date.
IBM Information Server defines a special scope "dataset columns" for these kind of classifiers. They need to be implemented with a custom Java or JavaScript logic.
Inferred, selected and found data classes
No matter whether a data classifier works at the scope value or column, they are all used to classify a data field as a whole. As mentioned earlier, one of the goals of Data Classification is not only to find individual values of a certain data class within the data, but also to reverse-engineer the missing meta-data of the data set and determine the most probable usage of each field.
Let us expand our data classification terminology with a few more terms used in the IBM products:
The inferred data class of a data field is the most probable data class that this field represents from a data model point of view.
A data classifier working at scope "value" checks which data classes are matched by each value of a field. If a large majority of the values match the same data class, then it is very likely that the data field was intended to contain information of that type. The data class with the most matches becomes the inferred data class of the field.
In other words, the inferred data class is what the system thinks – from looking at the data – what a data field is supposed to represent.
There may be situations where a human may decide that a different data class better represents a field than the inferred data class. That may be because the classifier took a wrong decision, or because, after review, there is another data class without a good classifier, which better represents the field, or because the analysis was done only on a subset of the data which biased the result and a different more generalized data class would be a better choice.
When a human overrides the inferred data class with another data class, this data class becomes the selected data class of the field. The selected data class is then what is used when validating the data, assigning terms or running some automations based on the data classification.
Even when an inferred data class can be determined for a field, individual values in that field may match a different data class. Those data classes are the found data classes. There may not be enough values matching those data class to be relevant for the inferred data class, but they are findings that needs to be reported.
The notion of confidence
The next thing to understand is the notion of confidence:
The confidence of a data classification is the likelihood that a classified asset really contains information of the detected data class.
As I said previously, not all domains can be detected using a boolean match/no match algorithm which can always give a clear decision. For many domains, there is a notion of uncertainty that needs to be captured in order to identify the best data class. A good classifier should not only detect successfully data classes but must also be able to assess the confidence of the data classification.
Classifiers working at the scope "value" – like regular expressions or lookups to lists of reference values – can usually give a clear match/no match decision for every value of a data field. But data may be dirty and a field may contain values belonging to different data classes. Or the classifier may not be perfect and only recognize part of the values as matching data class. For this reason the data classification of a field is always associated with a confidence.
In case of a classifier of scope "value", we can compute the confidence of a data classification for the field as being the percentage of non missing values which match the data class. It is important to notice that only the non missing values should be considered when computing the confidence, because the fact that a value is missing is not an indication that the presumed data class of a field is incorrect. All it indicates is that the information for that field is not mandatory.

There is no standard formula for how classifiers of scope "column" should compute their confidence. It is up to the implementer of the classifier to decide how to return a confidence between 0% and 100% representing the likelihood that the classification is correct. Such classifiers will not generate "found data classes", since they do not classify individual values and only consider a field as a whole. But they can generate "data class candidates", representing different possible data classes that could potentially describe the data field, each of them having a different confidence.
Determining the inferred data class of a field
By applying all data classifiers to all data fields to analyze, the analysis will identify for each field a list of data class candidates associated with a confidence. In most of the cases the confidence will be less than 100%. The question is: Which data class, if any, should be assigned as an inferred data class of a field?
This is where the notion data classification threshold comes into the game:
The data classification threshold of a data class is the minimum confidence that a data classification must have to consider its data class as a possible inferred data class of a field.
In the IBM products, there is a default global data classification threshold of 75%. Each data class however can override this default, which may make sense in some cases where you know that the classifier doesn’t cover all possible values of a domain and a lower confidence is therefore expected.

The inferred data class will be chosen among the list of data class candidates which have a confidence higher than their confidence threshold – or higher than the global confidence threshold in case no specific confidence threshold is defined for the data class.
If there is more than one candidate above the threshold, the priority of the data class is used in addition to the confidence.
The priority of a data class is the order of preference to use to determine which data class candidate should become the inferred data class of a field.
The priority can be seen as a relative number to prefer certain data classes over other data classes even if it is not the data class with the highest confidence.
You may wonder why a priority is necessary at all and why the inferred data class is not just the one with the highest confidence. The reason for that is that some data classes may be much more generic than others. If you look back at my definition of data classification at the beginning of this article, it should find the most specific data class representing the data. Often the more generic a data class is, the higher its confidence will be, but the less valuable it is for the data steward. Let me illustrate that with a simple example:
Assume that we have a data class for "Text", representing all kind of data that may be entered in a free form text field. Such a data classifier would check that the values of the analyzed fields are mostly unique, do not have a constant format, have more than one token, etc…
Assume that we have a second data class for "Postal address", recognizing any free text value having certain tokens like "Street" or "Avenue" as being an unstandardized mail address.
Any data field containing unstandardized mail addresses would be classified as "Text" with a confidence of 100%, because a list of mail addresses can be seen as a free text information. But at the same time the same field would be classified as "Postal address" with probably a lower confidence – let’s say 80%— because the address classifier may miss a few exotic addresses which do not have the most common tokens you would expect in an address.
Without the notion of priority, such a field would always be classified as a "Text" and the more useful classification "Postal address" would be missed, because its confidence is lower than the confidence of the more generic class. For this reason, the data class for "Postal address" should use a higher priority, so that it takes precedence on "Text" even if it has a lower confidence.
It is not very difficult to find very generic data classes with a very high confidence. It is more difficult but also more useful to find the most specific but still relevant data class of a field, even if we loose a bit confidence.
When taking all these notions together, the inferred data class of a data field is determined by the following rules:
- First consider only the data classes whose confidence is above their confidence threshold.
- If there are more than one data class matching this condition, then the classes with the highest priority in that list are considered.
- If there are still more than one candidate with the same priority, then the one with the higher confidence should become the inferred data class.
Summary
Hopefully this introduction could give you a better understanding of the data classification process and all its challenges, and how these challenges are addressed in the IBM products.
Data classification is as good as the data classifiers used to analyze the data. Different domain may require very different approaches and creative solutions when it comes to implement a good classifier.
Classification projects usually always start with a customization of the list of data classes to search. A good data classification framework should therefore provide enough flexibility to allow custom data classifiers of very different nature to work together and it should be able to reconcile their results into a consistent result.