The world’s leading publication for data science, AI, and ML professionals.

Catalog external assets for a 360° data lineage

How to customize IBM Watson Knowledge Catalog to support any kind of external assets and improve your data governance.

Photo by DreamQuest on pixabay
Photo by DreamQuest on pixabay

In a previous article I have shown how to automatically catalog a high number of data sets by using IBM Cloudpak for Data and in particular Watson Knowledge Catalog.

A good enterprise catalog is the key stone of Data Governance. It is the place where the existence of all data and governance assets can be documented as well as their relationships with each others. Capturing the relationships between the assets is essential in order to answer data lineage questions, determine the dependency graph of any given asset, or do an impact analysis.

For example, a well maintained catalog should help answer questions like:

Where do the data from this data set come from?

Where are the data from this data set used?

Which asset will be impacted if this asset is modified or deleted?

A business catalog can only answer these questions if all the assets being directly or indirectly related to the assets of interest are properly catalogued. Such assets can be of very different nature like:

  • ETL flows – or data flows in general – moving and transforming data from a source to a target.
  • Reports using directly or indirectly the data from a data set.
  • ML models being built from data derived from a data set.
  • Jupyter notebooks running experiments from data obtained from a data set.
  • etc…

This list can be long and the type of assets playing an important role in the Data Lineage may be very different depending on the technology used by your company.

To complicate the problem, real life scenarios seldom involve only homogeneous solutions from a single vendor. It is very frequent that companies use solutions from different vendors, running on different systems but all playing a role in the way how data assets are produced, transformed or consumed.

No software vendor can provide a business catalog which can support out of the box all kind of assets you may encounter in a real life scenario. Therefore it is critical that the catalog can be customized to support new asset types, load them and captures their relationships with other catalogued assets, no matter what the nature of these new assets is.

The goal of this article is to show how to do this with IBM Watson Knowledge Catalog.


IBM Watson Knowedge Catalog

IBM Watson Knowledge Catalog is the business catalog of IBM Cloudpak for Data and supports out of the box very different asset types such as:

  • Data Connections
  • Data Assets (Tables, files, schemas, databases , etc…)
  • Business Terms
  • Data Classes
  • ML models
  • etc…

The list is not exhaustive and keeps increasing for each new version of Cloudpak for Data.

But like any software, unless your processes only involve assets created with Cloudpak for Data, the chances are high that your data and governance flows involve external assets which are not supported out of the box by the catalog, but still should be catalogued in order to get a complete data lineage. Such external assets could be ETL flows, reports or data models from other vendors, physical devices in an IoT flow, or assets from other IBM products managed outside Cloudpak For Data.

Fortunately IBM Watson Knowledge Catalog provides an API to register new type of assets and load external assets via a REST API. We will see in the rest of this article how to use this API with a simple example.

Catalog IBM Information Analyzer Data Rules

The customization of Watson Knowledge Catalog could be demonstrated with any kind of asset, but for the purpose of this article, I will demonstrate the process by showing as an example how to add support for IBM Information Analyzer data rules running in a classical Information Server outside Cloudpak for Data. You should be able to transpose the procedure to a different type of asset.

Overview of Information Analyzer data rules

Data rules is an asset type created and executed in IBM Information Server, which defines a non trivial data quality constraint that has to be verified by all rows of a data set where the rule is applied. For more details on data quality constraints, see my previous article "How to quantify Data Quality".

Data Rules defined in an IBM Information Analyzer project
Data Rules defined in an IBM Information Analyzer project

Explaining the details of data rules is outside of the scope of this article, but for the purpose of this exercise, the only thing that you need to understand for the moment is that a data rule is an asset with the following characteristics:

  • It is an asset defined in an IBM Information Server installation on a host which may be different than where IBM Cloudpak for Data is running.
  • Like any other asset, a data rule has a name and description and an ID which uniquely references it within the system where it is defined.
  • A data rule is defined in the context of a data quality project – also named workspace.
  • A data rule is related to at least one data set, but it can also be applied to the result of a join of multiple data sets.
Binding details of a data rule using two data sets
Binding details of a data rule using two data sets

Information Analyzer data rules have other properties – like the binding details or output definitions, etc… – but in order to keep the example simple, we will only consider the previously listed properties for the purpose of this exercise, because they are sufficient to capture the dependencies between the data rules and the other assets.

An IBM Information Server installation may have hundreds of data rules defined it it. Each of them may be related to different data assets. In the rest of this article, we will see the different steps necessary to:

  1. Define a new asset type to represent an Information Analyzer data rule in IBM Watson Knowledge Catalog
  2. Extract the relevant metadata of all data rules of an external IBM Information Server.
  3. Load the meta-data of the extracted data rules into the catalog and create the relationships between the new loaded data rules and the already catalogued data assets.

In order to keep this example simple, we will assume that all data assets referenced by the data rules to load are already catalogued in Watson Knowledge Catalog. This could be done by running an automatic discovery of the data sources referenced by the data rules, as described in my previous article.

Defining a new custom asset type in Watson Knowledge Catalog:

In order to simplify the commands that we will use along this article, let us first define in an UNIX shell an environment variable _WKC_HOST_ containing the host name of the Cloudpak for Data system where we want to catalog the data rules.

export WKC_HOST=cpd_cluster_host

(You need to replace cpd_cluster_host with the host name of IP address of the target system)

Next, let us obtain an authentication token by running the following command as described in the CP4D documentation.

curl -k -X POST https://$WKC_HOST/icp4d-api/v1/authorize -H 'cache-control: no-cache' -H 'content-type: application/json' -d '{"username":"username","password":"password"}'

(Replace username and password with the credential details of an authorized user)

The response of that command is a JSON document as follow:

{"_messageCode_":"200","message":"Success","token":"...some_long_string..."}

Let us copy the value of the token property and put it in another environment variable, so that we don’t need to copy that token in each command:

export TOKEN=...some_long_string...

Now that we have an authentication token, we are ready to use the WKC REST API to manipulate the asset types of Watson Knowledge Catalog as described in the documentation. You can use the REST API to explore the default asset types and retrieve the JSON representation of their definition to better understand how they are defined.

In our case, we will create a new custom asset type named data_rule, by using the REST API POST {service_URL}/v2/asset_types?catalog_id={catalog_id} as described in this section of the documentation.

Custom asset types are specific to a catalog. Watson Knowledge Catalog may contain multiple catalogs, so that you need to find out the ID of the catalog where you want to import the new assets. You can easily determine the id of the catalog by opening it in the UI and looking at the URL in your browser. The catalog ID is a long string just after the path element /catalogs/ in the URL, as highlighted in the following example:

https://yourserver.yourcompany.com/data/catalogs/b134fdf3-7d3f-40e0-835b-7ce54a62cc7e?context=icp4data

Here again, let us copy that ID and put it into an environment variable:

export CATALOG_ID=<catalog_id pasted from the URL>

The POST command to create a new asset type requires as a body a JSON document which defines the meta-data of the new asset type to create. At a minimum, the document should contain the name and description of the new asset type and a description of its fields. In our case, we will use the following properties:

  • Name: _datarule
  • description: Information Analyzer Data Rule
  • asset fields:
  • host: host name of the Information Server system containing the rule.
  • projectName: name of the Information Analyzer project defining the rule.
  • rid: unique identifier of the data rule in the source Information Server.

Note that we don’t need to define a property for the data assets related to the data rules, as these will be defined by using asset relationships which WKC allowed us to define between any kind of asset.

Based on this initial minimum list of fields to catalog for a data rule, the payload of the REST API to create the new asset type can be defined as follow:

{
  "description": "Information Analyzer Data Rule",
  "name": "data_rule",
  "fields": [
     {
        "key": "host",
        "type": "string",
        "facet": true,
        "is_array": false,
        "search_path": "host",
        "is_searchable_across_types": true
     },
     {
        "key": "projectName",
        "type": "string",
        "facet": true,
        "is_array": false,
        "search_path": "projectName",
        "is_searchable_across_types": true
     },
     {
        "key": "rid",
        "type": "string",
        "facet": false,
        "is_array": false,
        "search_path": "rid",
        "is_searchable_across_types": true
     }
  ]
}

Save the definition of the new asset type in a file _data_rule_asset_type.json_ and run the following command to create the new asset type in the target catalog.

curl -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer $TOKEN" -d @data_rule_asset_type.json "https://$WKC_HOST:443/v2/asset_types?catalog_id=$CATALOG_ID"

If the command is successful, it should return a success code (200) with the JSON definition of the new created asset type.

Let us next try to create a sample asset of the new type by using the API POST {service_URL}/v2/assets?catalog_id={catalog_id} as described in the documentation.

Like for the creation of the new asset type, we need to prepare a payload in JSON format describing the new asset to create in the target catalog.

Create a file dummyDataRule.json with the following content:

{
  "metadata": {
    "asset_type": "data_rule",
    "name": "Dummy Data Rule 1",
    "description": "Some Dummy Data rule",
    "tags": ["tag1", "tag2"],
    "origin_country": "us",
    "rov": {
      "mode": 0
    }
  },
  "entity": {
    "data_rule": {
        "host": "iaHost1",
        "projectName": "iaProject1",
        "rid": "rid1" 
    }
  }
}

This payload contains the minimum information required to create a new asset:

  • _asset_type defines the type of the asset to create. In our case we use the newly created custom asset type data_rule_.
  • name and description are the name and description of the new dummy data rule that we want to create.
  • tags is optional and contains an array of catalog tags to associate to the new created asset. Tags can be used to facilitate the search or to provide additional information on the asset. In this example we associate two dummy tags tag1 and tag2.
  • _origin_country and rov_ define the location and permission of the asset. For the purpose of this article, we will use some default values.
  • The field entity must contain a field of the same name as the asset type – _data_rule_ in our case – which itself contains the field values that we defined when creating the new custom asset type in the previous step. In our case, based on the previous definition of the asset type datarule, we can define 3 properties host, projectName and rid_.

Let’s submit the creation of this new data rule in the catalog by running the following cURL command:

curl -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer $TOKEN" -d @dummyDataRule.json "https://$WKC_HOST:443/v2/assets?catalog_id=$CATALOG_ID"

If the command is successful, it should return a JSON description of the new created asset containing its id and you should be able to see the new created dummy data rule in the Watson Knowledge Catalog UI when opening the target catalog.

Dummy data rule loaded in Watson Knowledge Catalog
Dummy data rule loaded in Watson Knowledge Catalog

Create relationships with other assets

Like for any other asset, the new created custom asset can have relationships to other assets. You can define those relationships by using the UI, but in our example we want to do that programmatically, since we want to automatically import a large number of assets.

The API for creating a relationship between assets is documented under this link. In short, the API is PUT /v2/assets/{asset_id}/relationships/{relationship_name}?catalog_id={catalog_id}with a JSON body specifying the asset and catalog ids of the target assets.

In order to test this feature, let’s find out the id of a data set in the same catalog and define a relationship of type "uses/used by" between our dummy data rule and that data set. An easy way to find out the id of a data set is to open the data set in the UI and copying its id from the URL in the browser.

The JSON payload that we have to pass to the PUT command to create the relationship is as follow:

{
  "relationship_targets": [
    {
      "catalog_id": "19f11301-8392-4733-b747-751f5230e566",
      "asset_id": "81fa1a02-4773-4801-a53b-1d374fb83b07"
    }
  ]
}

Replace the values for the _catalog_id and the asset_id with the ids for the data asset – or it can be any other asset type – that you want to link to the data rule and save the payload in a file addRelationship.json. Then create the relationship by using the following cURL command – you will need to replace the placeholder _ with the id of the dummy rule created previously.

curl -k -X PUT --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer $TOKEN" -d @addRelationship.json "https://$WKC_HOST:443/v2/assets/<datarule_id>/relationships/uses?catalog_id=$CATALOG_ID"

After the command has run you should be able to see the new relationship of type uses when opening the data rule in the UI. If you open the data set, you will see the reverse relation ship of type used by pointing to the data rule.

New relationship between the data rule and a data asset
New relationship between the data rule and a data asset

Automating the process

Now that we have successfully defined the new asset type and tested that data rules can be imported into the catalog by using REST APIs, let’s automate the process to extract and catalog all data rules from an external IBM Information Server installation.

The first problem is to extract the meta-data of the rules. Information Server provides several mechanisms to do that. You can either use a REST API, or write SQL queries against the meta-data database (XMETA). In this article we will use SQL.

Explaining the SQL API of IBM Information Server is out the scope of this article. For more details you can refer to the Information Server documentation. For the simplicity of the article, I have prepared a ready-to-use SQL query which allows to retrieve the name and description as well as the path of the related data sets for all data rules of a given Information Analyzer project:

SELECT DISTINCT PROJECT.NAME AS PROJECT, 
       RULE.NAME, 
       RULE.SHORTDESCRIPTION AS DESCRIPTION, 
       CONCAT('/',CONCAT(COL.SCHEMA,CONCAT('/',COL.TABLE))) AS PATH FROM IAVIEWS.IARULE RULE
INNER JOIN IAVIEWS.IAPROJECT PROJECT ON RULE.PROJECTRID=PROJECT.RID
INNER JOIN IAVIEWS.IARULEBINDING BINDING ON BINDING.RULERID=RULE.RID
INNER JOIN IAVIEWS.IAREGISTEREDCOLUMN COL ON BINDING.TARGETRID=COL.RID
WHERE PROJECT.NAME='<ProjectName>'

This query will return a result as shown in the next illustration. If a rule is bound to more than one table, the result will contain one row for each bound table. Therefore in the example shown below, the project _"test_rules" contains 2 data rules, one is named "child_test_BANK_CLIENTS" and is bound to a data set with the path /BANK1/BANK_CLIENTS. The other one is named "credit_card_age_BANK2" and is bound to two data sets with the paths /BANK2/BANK_ACCOUNTS and /BANK2/BANK_CUSTOMERS_.

Result of the SQL query to retrieve the data rules of an Information Analyzer project
Result of the SQL query to retrieve the data rules of an Information Analyzer project

From there, we need to write some script that will:

  • run the previous SQL query to retrieve all data rules of selected Information Analyzer projects from an IBM Information Server installation
  • for each extracted data rule, execute the Watson Knowledge Catalog REST API to create a new data rule asset in a target catalog
  • search the catalog for data sets with the same path as the data sets referenced by the data rules, and create the associations between the new created data rules and the resolved data sets in the catalog.

In order to keep this article simple, I have implemented all these steps in Python in a Jupyter noteboook:

After running all the cells of the notebook – you will need to adjust a few variables with the details of your system, as described in the notebook – , you will see in Watson Knowledge Catalog all the external data rules represented as normal catalog assets.

Data Rules imported as custom assets in IBM Watson Knowledge Catalog
Data Rules imported as custom assets in IBM Watson Knowledge Catalog

When opening the data rules, you can see that the relation to their related data sets have been created – assuming that the data asset were already imported in the catalog when the notebook was executed.

Asset details for the imported data rules, showing the relations to the data assets
Asset details for the imported data rules, showing the relations to the data assets

The same relationship can also be seen from the data asset, which contains a reverse relationship of type "Is used by" pointing to the data rule.

Relationship between the data asset and the data rule, shown from the data asset
Relationship between the data asset and the data rule, shown from the data asset

Customize the asset preview

The new created custom asset are shown like any other asset type natively supported by Watson Knowledge Catalog with the exception of the Asset preview tab which remains by default empty.

By default asset preview is not available for custom asset types
By default asset preview is not available for custom asset types

It is also possible to embed a custom asset preview UI in the catalog as long as it is a Web UI which can be opened in a Web browser.

In our example the Information Analyzer Web UI for looking at the details of a data rule can be opened with the following URL:

https://<iis_hostname>/ibm/iis/igcui/assets/<datarule_rid>

Watson Knowledge Catalog provides a way to register a URL to invoke when the UI for previewing a particular asset type needs to be displayed. This is done by slightly updating the JSON definition of the asset type that we created previously:

{
  "description": "Information Analyzer Data Rule",
  "fields": [
     {
        "key": "host",
        "type": "string",
        "facet": true,
        "is_array": false,
        "search_path": "host",
        "is_searchable_across_types": true
     },
     {
        "key": "projectName",
        "type": "string",
        "facet": true,
        "is_array": false,
        "search_path": "projectName",
        "is_searchable_across_types": true
     },
     {
        "key": "rid",
        "type": "string",
        "facet": false,
        "is_array": false,
        "search_path": "rid",
        "is_searchable_across_types": true
     }
  ],
  "external_asset_preview": {
    "base_client_url": "https://<iis_hostname>/ibm/iis/igcui/assets",
    "url_parameters_from_asset_attributes": [],
    "url_path_from_asset_attributes": "rid"
  }
}

The additional attribute marked in bold specify that, whenever an asset of this type needs to be previewed, WKC should invoke a URL built from appending the _base_client_url and the value of the asset property specified in the attribute url_path_from_asset_attributes_.

In order to test this change, replace _ with the name of your Information Server host name from which the data rules have been imported, and save the modified JSON document in a file named data_rule_with_preview_asset_type.json_. Then update the definition of the asset type by using the following command:

curl -k -X PUT --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer $TOKEN" -d @data_rule_with_preview_asset_type.json "https://$WKC_HOST:443/v2/asset_types?catalog_id=$CATALOG_ID"

This command will update the existing definition of the data_rule asset type. After the update, if you reload the asset preview screen of the catalog, you should see a message – at least on Firefox – indicating that for security reason the page cannot be embedded but you can load it in a different browser tab. This is a security feature from the Information Server UI preventing the original UI to be embedded in another UI, but this will still provide a convenient way to open this UI in a different tab. By deploying a different web application, it would be possible to embed UI panels which do not have this security limitation.

Summary

We have seen in this article how important it is for a catalog to support custom asset types in order to capture the full lineage. We have also seen how this can be easily done in Watson Knowledge Catalog by using its REST API. As a concrete example, we have customized a catalog to support external Information Analyzer data rules as a new asset type, and run a script to import all external data rules to our catalog. Then we have seen, how to create programmatically relationships between the assets. Finally we have explored how the UI itself can be customized with a specialized asset viewer.

The same procedure can be applied to support any kind of asset type for which it is possible to extract metadata information in order to import them to the catalog. We have only explored a few REST APIs. Other API commands would also allow us to define special types of relationships between the assets to better capture the semantic.


Related Articles