The world’s leading publication for data science, AI, and ML professionals.

DataHub Hands-On Part I

Introduction

Photo by Alexander Sinn on Unsplash
Photo by Alexander Sinn on Unsplash

In preparation for a new lecture, I was searching for an up-to-date open-source data cataloging tool to enable my students practicing the Data Governance (DG) theory. Among others, I came across LinkedIn’s "DataHub". Besides the rich set of features, such as data discovery, data lineage and data quality, it convinced with a very simple setup routine using docker. Since I wanted to document my test cases, configurations and findings anyway, I decided to share them as well. I am planning to release 3 story parts which will cover:

  1. Motivation for Data Governance, Background on DataHub, DataHub setup and basic glossary design
  2. Data ingestion, linking data entities and data discovery
  3. Users and roles, data validation, data lineage and summary

So, let’s get started with part 1!

Data Governance

The chicken and egg problem I’ve seen in the industry is that the problems that DG tackles can’t be easily solved with IT tools, but DG isn’t at all effective without powerful IT tools either. So where to start? A company needs to understand (or better: feel) the pain of bad data across the organization (e.g. bad data slows down business processes), otherwise there is no motivation and support for DG. Then you need to convert that pain into quantitative measures, more specifically Data Quality (DQ) metrics. And once you know your enemy, you can set a goal and devise a plan to achieve that goal. For example, the completeness of master data for a specific column in a table is easy to measure and can be a starting point. Let’s say you have 60% completeness now and want to reach 90% within a year, then you need to identify and fix the root cause(s) of missing data (e.g. GUI mandatory). And of course you need to track the progress e.g. to check whether the completeness is noticeably increased by this fix and really reaches 90% after one year – because if not, someone has to raise their hand and initiate a "plan b".

The reality shock…

Sounds pretty simple… but… the main problem is that you need someone who is able to see and quantify the bad data in the first place, someone who is responsible for driving this improvement, you need experts to fill in missing data maintain (or someone who has a good algorithm), you need someone who is technically able to fix something in an IT system, you need someone from the management team who supports necessary changes in business processes and someone who can implement the DQ keeps an eye on it regularly. And of course all these people need the time and need to prioritize other tasks down. And that’s the hard part of DG – identifying, motivating and organizing the right people to work in a structured way on improving the data landscape, because in reality it’s not just a small isolated problem, but – depending on company size and business complexity – a colorful bouquet of mutually influencing problems.

Unfortunately, a DG tool is not a magic wand that makes all data problems and organizational gaps disappear after installation. But at least it helps to measure and visualize DQ, speed up root cause (and impact) analysis, and establish the connection between business conditions and IT entities. And these capabilities impact the throughput rate of DG activities in an organization that still require human effort – but maybe only a small percentage compared to DG without any tool support.

And that’s what I want to explore in this article: how does DataHub fit into those expectations?

DataHub

The overall architecture of DataHub is depicted in following diagram:

Source: Datahubproject.io
Source: Datahubproject.io

We can see that on the left there are a number of connectors to different database technologies, covering both SQL and NoSQL. With these connectors it is easy to set up a metadata ingestion process as the required configuration is guided by a wizard. If the required connector is not available, it is still possible to configure the recording manually. The metadata is forwarded to the so-called serving tier, which consists of an internal data store (e.g. MySQL), a search engine (Elastic), Graph Index (neo4j) and a View More that can be queried via REST and GraphQL.

The end user works with the DataHub frontend, e.g. to search for entities, navigate the catalog, and maintain glossary terms. API and stream integrations are other features that I won’t cover in my test cases.

DataHub Setup

In general, you have the choice between a Managed DataHub (SaaS) or a self-hosted DataHub. I chose the second option, which in short means deploying a DataHub instance in a Docker environment. I won’t go into all the details of each step as they are well documented on the Datahub product page. The basic requirements are Docker Desktop, jq for JSON processing on the command line, Python 3.7 (minimum) and sufficient hardware (2 CPUs, 8 GB RAM, 2 GB swap space and 10 GB disk space). Once everything is ready (and all paths are set correctly 🙂 ), you can easily install and deploy DataHub using CMD commands.

Source: Author
Source: Author

The user front end can then be accessed via the browser. A default user named "DataHub" is configured to be used for initial access. When it works, you’ll see a Google-like start-up screen with a central search box – however, there’s nothing to search for yet as the catalog of data is empty. Since I’ve already done some data ingestion and glossary maintenance, my DataHub instance is already showing additional information about available domains, datasets, and platforms.

Source: Author
Source: Author

Glossary Design

Metadata Concepts

On the top right corner, we see a menu bar with a few options. In order to add business knowledge to the data catalog, we are interested in the "Govern" option. Here we see following sub-options:

Source: Author
Source: Author

An appropriate catalog design requires the right understanding of the available metadata concepts:

  • Domain: a broader business area where we want to organize related data assets (e.g. all data assets which belong to Sales organization)
  • Term group: Similar to a folder of an operating system that may contain terms or even other term groups.
  • Terms: words or phrases with a specific business definition assigned to them.
  • Tags: informal, loosely controlled labels that help in search & discovery.

There are other DG frameworks and data catalogs on the market with slightly different or additional metadata concepts, e.g. data objects, business processes and policies. Depending on the flexibility and scope a company expects from a DG tool, this can be considered as a weakness of DataHub. On the other hand, the various information needs to be gathered from somewhere and needs to be maintained, which means considerable effort – so the metadata concept set provided by DataHub is a good starting point and still allows building comprehensive data models.

Metadata Maintenance

The question now is how to use these concepts appropriately. There should be guidelines developed by the central DG office so that each business area (aka domain) uses the concepts in the same way. For example, you could say that "term groups" should be treated like business processes. Let’s practice this example: We want to have a business process (term group) "Equipment Procurement" that contains all terms related to the procurement of an equipment (e.g. equipment name, equipment manufacturer, equipment itself). These terms belong to the "Manufacturing" domain. This can be done in DataHub as follows:

Source: Author
Source: Author

Then we create our term group, one of its terms and we associate the term to the according domain.

Source: Author
Source: Author

Once our term is created, we can set an official definition and add also further information as links:

Source: Author
Source: Author

Semantic modelling

That’s cool – but we can do even more semantic data modelling in our glossary! Let’s say we want to specify terms that are detailed information of an equipment, like a name or a maker. Then we can build such a hierarchy as well in DataHub using the "contains" relationship, which is similar to an object/attribute relationship:

Source: Author
Source: Author

In addition, we can use inheritance in our glossary. For instance, we want to differentiate between equipment types like "measurement equipment" and "logistic equipment" that are not different from "equipment" but only more specific types of "equipment". In this case, we can connect these terms via the "inherits" feature:

Source: Author
Source: Author

Conclusion Part 1

Although the semantic modeling offers some cool features, my impression is that there is not always a technical logic behind it in the DataHub. For example, the inheriting terms don’t actually inherit metadata from the "parent term" (e.g., descriptions, ownership, or links) and the related "subterms" (via "contains") don’t really know about this relationship because it doesn’t appear in their dataset. The structural information is unidirectional only. What I’m really missing is an inheritance of "contains" terms (like attributes in object-oriented programming), since terms that are part of a generic term should also be part of the associated "child" term, otherwise you have to maintain this information redundantly or you end up with inconsistent metadata. Another design option could be to avoid "contains" relationships in inherited terms, since the "parent term" can be linked and clicked – so that one can easily find the relevant "contains" relationships in the "parent" dataset. Also missing, I think, is a two-way relationship view, perhaps in a visual form as well, so that a data steward can see the big picture at a glance. Again, what is good about the glossary is the possibility to share entries, e.g. by mail or simply direct link, which supports the communication process between data stakeholders or can be integrated into other IT tools.

Well done if you made it this far 🙂 In the next part, we will dive into the data ingestion and discuss how to connect the glossary (business world) to datasets (IT world).


Written By

Topics:

Related Articles