Understanding Data Fabric in Less Than 3 Minutes

What is data fabric? Is it as revolutionary as it may sound?

Rendy Dalimunthe
Towards Data Science

--

The landscape of data management architecture is moving at a lightning speed. New technologies and jargon are being introduced every several months, sometimes making it harder for professionals to keep up with the development.

Photo by Fatos Bytyqi on Unsplash

Recently, a new concept called “data fabric” enter the market, promising to speed up users’ access to disparate data. But what is the data fabric? Is it as revolutionary as it may sound? Or is it just another “old technology” wrapped up with the new term?

This story will try to uncover the concept of data fabric using ultra-simple explanations. Hoping to infuse better understanding to data management professionals at any level.

To begin the journey of understanding data fabric, let us take a long step back to the traditional architecture of data management. The illustration of such architecture is depicted below.

Traditional data management architecture | Source: Author

As you can see in the above diagram, the traditional architecture mandates all data from various sources to be consolidated in centralized repositories. Those repositories usually consist of data lakes, data warehouses, and data marts. All governed by ETL to ensure data is stored in the correct schema.

This approach has its merit as data in the central repository become so structured, hence, easy to be consumed for analytical and reporting purposes. This traditional approach stays relevant for 3 decades.

However, as the volume of data produced increased in geometric scale, this approach unravels its flaws. With the traditional approach, data has to be consistently replicated, a mechanism that could be inefficient and time-consuming. Moreover, with the new paradigm of data security, moving sensitive data regularly is not a wise move. Not to mention that certain type of data is better to be left at their native location. So, the traditional consolidated approach, while good for some types of organizations, is not suitable for the modern data-driven organizations that produce, move, and use data at a neck-breaking speed.

Then data management experts come up with a brilliant solution called data virtualization. This out-of-the-box method aims to provide users with access to multiple data without having to move those data to a central repository. To do this, an abstraction layer is introduced.

Data virtualization architecture | Source: Author

The abstraction layer is a set of middleware providing a virtual view of data from various sources that are mapped to it. That mapping is governed by metadata so that all attributes of the mapped data (column names, table structure, tags, etc.) are preserved. A data catalog is also available to help organize data according to business definitions, allowing users to find what they need more quickly. More importantly, the middleware also provides security governance that rules the user access rights, so only the rightful users can access certain data.

Data virtualization is an unprecedented approach because all of a sudden we do not have to continuously consolidate data from multiple sources. Saving valuable time, resources, and obviously, cost. The solution has served many organizations well, ranging from fast-moving startups to mature enterprises. One notable use case of data virtualization is how a pharma company leverages data virtualization to speed up the data delivery to its researchers. Removing “unnecessary” ETL and cutting project development time by half.

As we have understood the concept of data virtualization, it’s time to look at the illustration below.

Data fabric | Source: Author

Look at how similar the architectural diagram above with the data virtualization. There is no central repository and there are several components that remind us to the previous approach. In addition, new components such as a recommendation engine and knowledge graph are appear.

This is the architectural representation of the data fabric approach.

At this point, we can safely deduce that data fabric is an evolution of data virtualization, with the additional built-in capabilities that increase its advantages even further.

One particular game-changer component is the introduction of AI/ML technology, able to propel the mapping and cataloging process to the next level. Machine learning does that by identifying subtle patterns among the contents in the data catalog, resulting in the creation of a data relationship that might not be established using the previous virtualization approach. Moreover, AI also powers the recommendation engine and knowledge graph, so the data discovery aspect improved significantly.

The simple yet chronological explanation above hopefully can enlighten our mind that “nothing fancy” in data fabric. It is an out-of-the-box concept indeed, nevertheless, the building blocks that power it has been around for quite some time.

Take the knowledge graph as an example. It is a comprehensive diagram representing the interconnectedness of real-world entities such as parties, facts, or events. Simply said, the knowledge graph helps us understand our data comprehensively. And it has been around since the 1970s.

But the emergence of AI/ML, automates the process of building a knowledge graph, leaving the majority of manual tasks behind. And with the rise of data fabric, knowledge graph finally found its killer use case.

To conclude, data fabric is the combination of several technology enablers that facilitate all parties to access and use all available data at their organization’s disposal. That without having to move the data around. It is not a single software that we can implement in one go, but rather, it is an incremental journey starting with the enablement of data virtualization.

Master the data virtualization first and you can have a smooth voyage in implementing data fabric.

--

--

Independent researcher focusing on data management, customer experience and the application of blockchain