A Guide to Agile Data Mastering with AI

Master data provides the common vocabulary for transactions and operations.

Published in

Towards Data Science

16 min readOct 22, 2021

What is master data?

Organizations deal with a variety of data — transactional, unstructured, hierarchical, metadata and entity data. For example, for a customer, organizations generate data like transactions, customer account information including demographics, marketing touchpoints and interactions. All this data is collected as sporadic data points spread over different data applications.

The common business entities — people, things, places and concepts are the master data. Master data defines the top-level business objects. According to Gartner:

“Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.”

Master data enables the sharing of information among the enterprise. It provides the common vocabulary for transactions and operations. It is the foundation on which all business applications are built.

Data silos

Let us walk through the usual way in which data is stored and managed in an organization.

As a company grows, so do its data systems. To take care of a business line, or to manage a product, or to handle a new territory, a team of people works together. Now, this team needs specific tools to save and manipulate their data. They may prefer a homegrown CRM or a spreadsheet over the corporate Salesforce. For their specific product line, they will probably build a few custom applications. Their supply chain systems may be different from that of the headquarters. In a few cases, they may also go by corporate tools and systems. So it is a mix and match.

All this is perfectly ok, they need to get the job done. They need to comply with local regulations. They have to cater to their audience and deliver results. Having their data under their control empowers them to act fast and respond quickly to growing business needs. They can represent entities the way it makes the most sense to them. Internal IT or contracted engineers, most likely a combination of both, help them to run and manage these systems.

Large companies also work with other large companies. They acquire or merge, adding more mix to their data systems cocktail!

In summary

Business units and departments build and use specialized applications
Data is saved in these standalone applications
These silos are essential for operational efficiency

Data silos are lonely warriors.

What causes data silos?

Data silos can be a result of multiple factors.

Structural

Department wise roles and responsibilities and the way the company is structured usually lead to data silos.

Geographical

Along with that, geographic boundaries and regulations, for example, data locality, can be a factor. Different countries have different ways of doing business, like in India, mobile phone information is very prevalent, so a customer record will likely have that. But in the Americas, people are private and do not like sharing their mobile information that easily. Then there are regulatory requirements per country so that business and the data get structured around that. Data may also be present in regional languages.

Technical

Most applications, including the ERPs and CRMs do not play very well with integration, so these specialized systems also lead to technical reasons for having data silos. You may like to partition the data — like shard it for a region to manage it better. So that could be a technical reason as well for the data silo

Examples of data silos

Customer data silos

Here is an example of how customer data can be siloed. The following are some enterprise applications having relevant information about the customer

Web datastore
In-store data
Product-wise datastore
Email Marketing Data
Legacy systems
Customer data through M&A

In the above example of a customer data silo, multiple systems holding bits and pieces of the customer journey and interaction with the enterprise. And this is usually not just these systems. Back in 2017, the average marketing department was using 16 different tech platforms.

Procurement data silos

A typical procurement process involves multiple stakeholders and systems. The Vendor is registered in an ERP, then specifications are submitted and approved in another system. Inspection of vendor premises and compliance is done by another team with their process and application database. Contracting and legal are involved, and they bring additional data and insights. The delivery team that approves the vendor supplies is another actor with their systems.

Vendor Registration
Requirements Specification
Inspection
Contract
Delivery

Data Mastering

What people realize is that they can derive even further value from their data when they combine these unit data systems or data silos. Take the case of customer data for example. We can join and merge customer data across multiple product lines and learn buyer propensity. We can discover opportunities to cross-sell and upsell. We can build personalized communication and targeted channels. Companies can onboard customers faster and more accurately. We can increase sales effectiveness and delight the customer by delivering products and services they want, through the channel they prefer, simply by knowing more about them.

We can also manage risk and compliance — understand how much risk they have associated with a customer or a household in the case of say insurance. Or say which systems house customer data for GDPR Subject Access Request (SAR).

Vendor Data Consolidation across data silos yields terrific insights about vendor outlay across business units, regions and geographies. Companies can understand pricing consistencies across divisions, onboard new suppliers faster, and manage risk associated with Vendor location and labor practices.

Similarly, integrating other data assets like supplies can help share them across divisions, or figure out the best supplier for a given supply, or drive pricing consistency across business units. Integrating the data silos to understand Customers, Products, Vendors, Partners, Supplies, Employees and other business entities yields tremendous value.

Data mastering is essentially believing that

The Whole is greater than the sum of its parts

** data mastering is also known as master data management or MDM.

Benefits of data mastering

There is tremendous value in integrating the data silos across various industries. It is important to understand here that we are not breaking these data silos. The data silos exist as they are. We want them to stay — to remain as functional as they are, as agile as they are. Let the respective teams manage them and own them and continue to drive the business while being performant. Through data mastering, along with the operational data, we are also trying to look at the overall picture and augment our focussed view with a holistic organizational view. That way, we can supercharge our already smart teams with the insights we gather from the unified and integrated data. The functional teams can consume the information from the MDM and operate more effectively, save costs, reduce risks and stay compliant.

In effect, we are using the collective wisdom in each of the silos and making an informed choice over trusted data.

Key components of a data mastering system

To build a unified view of entities, an MDM system integrates the data silos and gets the data together in a place where it can be processed. It then unifies it. The data mastering system helps the business to piece together an entity — say a customer by defining it as a combination of attributes like the first name, last name, email, phone number, organization, address, designation, etc.

Schema Mapping for Data Mastering

Once an entity is defined as a collection of attributes, attributes from different silos are mapped to this entity. Some mapping is straightforward, you just tell the system that last_name from silo1 can be mapped to lastName or lName from silo2 to lastName. In some cases, we need to prepare or transform the data a bit, like concatenate first_name and last_name from data silo 3 and map that to the customer_name attribute of our unified customer data. We may also need to clean out some records which contain erroneous values like NA or blanks, or 1900 for the birth year as we do not want our output to be corrupt or wrong.

Data matching

After schema mapping, the data mastering system groups the data of an entity from different silos together. Humans enter data into their systems in multiple ways, many of the data records could come from partners or customer-supplied values. Hence there are a lot of variations in terms of typos, salutations, abbreviations, languages, etc. So the system allows you to configure which records to group or match to represent one customer or product or vendor.

In a rule-based system, these rules are handcrafted by a collaboration of IT and Business and regularly tweaked to avoid hits and misses and to converge to the desired outcome.

Data Mastering — Golden Record

Once the data of an entity is grouped, we want to build a true value — a golden copy in data management parlance. We define how to pick up a representative set of attributes across the source systems which contributed to the group. We could trust one source system like CRM more for email data, we could be more confident of the address in the application database only when it has been updated within the last 3 months. If a source system has a lot of blank values in the record which got matched with the group, we may want to omit any value from it. The golden record can be exposed to the data silos, they may use that to enrich their source system and gather more information about their data.

Hierarchy

Understanding relationships, how an entity relates to itself, and other entities is critical. Assigning hierarchies or categories to records helps us do that. Some hierarchical data may be organization-specific, like who reports to whom. But in many cases, especially for parts, services, and supplies, there are taxonomies like UNSPSC or eClass which define categories they belong to. Most large organizations also have their whole internal classification schemes, many times more than one.

A data mastering system allows us to define custom hierarchies or choose standard hierarchies and assign records to them. That way we can understand the relationships and club together different records that belong to the same group. For example, through hierarchy management, we can say how many supplies we procure under Electrical Equipment or Stationery.

Data Mastering Process Flow

Let us try and understand how a typical master data management system works.

Getting data

The master data management system connects to the data silos and gets data to a central location for further processing.

Massaging

This step involves schema mapping, data standardization and normalization. To integrate the records together, we need the input records to have the same set of fields. We also need the fields to represent the information captured within them in the same way. For example, categories like gender should be either Male/Female or 1/2 in all records.

Matching

Reconciling multiple records to say which of them belong to the same entity is called matching. Let us look at customer records from 3 different source systems which need to be matched.

During data matching, these 3 records which have variations across all attributes and have no common identifier would be assigned a single matching identifier or cluster identifier indicating that they are linked to each other.

Merging

Once the records are clustered so that records belong to an entity have a unique identifier, we can combine them to build a single clean, consolidated and trusted view. This is also known as the golden record.

Publishing

Downstream applications like analytics, reporting, compliance, risk and operations need access to the mastered entities for their respective use cases. This publishing of records gets the silos to refer to the same data attributes and improve their functions.

Challenges in data mastering

Overall, as we are talking of integration, there are a lot of things to be considered in terms of an end to end data mastering system

Sponsorship

Sponsorship challenges arise from who sponsors the data mastering project. Is it a business benefit or a pure technical benefit, where applications are sharing data and in turn getting enriched data? Typical mastering benefits are spread across business units, hence budgeting becomes a big question here.

Ownership

Which department is responsible for building it and managing the mastering solution — is it the IT which services across the enterprise? Is it the business unit that will be consuming the data? As the master data is touching across sales, marketing, operations, support, procurement, compliance and risk, it is important to establish a clear owner who can balance the individual needs of each team. Typically, the owner is someone high up in the organization with the power and the mandate to drive the master data management program across the company.

Coordination

All stakeholders and systems need to coordinate with each other to ensure data mastering happens seamlessly.

Governance

A lot of data is sensitive, or need not be exposed to each individual or team. So when we are unifying and mastering the data, the right access controls are paramount. Besides, it is important to establish the owners of each dataset.

Process

What kind of processes are needed for inter-department coordination across geographies? How does the master data management system impact the work processes existing in the enterprise currently?

Thus it is a combination of ownership, coordination, governance, processes that have to come together in data mastering.

Technical

Data Formats

The applications to be mastered save entity information in different formats and datastores. A master data system may need to consume data from a mix of relational databases, NOSQL datastore, cloud storage, and local file systems. the formats may vary from proprietary to JSON, XML, CSV, Parquet, Avro, and others.

Schema Variations in Data Sources

Entity representation often varies across systems. The name attribute in one source system may be split into first name, middle name, and last name in another. Even when the attributes are the same, the column naming may be different, for example, firstName, fName, first_name could all represent the first name. Similarly, categories may not have been the same in each. Male and Female records could be marked as M and F in one system, 1 and 2 in another.

Data Matching

Computers understand equality or comparisons but not fuzzy matching. Matching records with attribute differences and no single common identifier is very challenging as there are no existing operators that can do it. Hence there are 3 main challenges in a data matching system

Matching definition

We need to really be careful about the matching process and how we define what attributes and records can indeed be said to be the same. Typos, prefixes, suffixes, blank values, extra characters, and abbreviations are some of the common variations that need to be accounted for.

Scale

Once the notion of similarity is established, we need to compare every record to every other record and the number of computations increases quadratically with a growing number of records. The following table lists the comparisons needed for a single attribute record.

As the data volumes increase 10 fold, the number of comparisons increases 100 fold, making data matching a computationally challenging problem.

Variety

Enterprises deal with different kinds of data and the need for mastering applies to all. Customers, vendors, products, employees, and supplies all have different attributes and thus we need to take care of each domain accordingly.

Languages

Large multinational corporations save data in regional and local languages. Matching records based on sound or string similarity becomes challenging here as each language has its complexity. For example, Chinese names are short strings with many characters common even among completely different names.

Hierarchy Management

The world is built on relationships. Customers have families, employees, employers, and other relationships. They deal with specific products and services. Suppliers have local, regional and global offices. Organizations have multiple addresses — legal, billing, shipping addresses can all be different for them. All master data needs to be modeled effectively in the MDM system.

Critique of current master data management tools

Master data management is needed widely across multiple industries. Every large or small organization which deals with customers or suppliers and has a few different operational systems like ERP, CRM, PoS, and application databases eventually has to get a master data management system in place to establish the single source of truth. Thus, most master data management software existing today provides generic tools and libraries to cater to the different industries they serve. This allows master data management tools to be deployed across different domains. However, there is a lot of conditional logic that needs to be adapted for each domain and industry.

Enterprise software traditionally has been built around relational databases. With a 3 tier architecture having a web-based or fat client, a middle tier with application logic and data persistence in relational databases, most master data management systems are rule-based. A rule-based master data management has a big challenge of scale, both in terms of the number of data sources and attributes as well as in the sheer number of comparisons needed for data matching. Imagine trying to master 100s of source systems with 10s of attributes. Schema mapping through manual actions alone would take up a few months for this exercise.

Similarly, imagine building up rules for entity matching. Taking care of variations in the different attributes is a painfully tough exercise. Also, as it is a computationally expensive exercise, we want to avoid comparing everything with everything and so traditional rule-based master data management systems follow a tedious process of data profiling, standardization, normalization, fuzzy key definition, similarity rule definition, and database tuning.

To make sure we can match records through traditional matching, we try to get them all to the same base. Data profiling is done to understand attribute properties and missing values. Attributes are standardized so that they represent information in the same way, for example, if 1 represents Female in one data source but F in another, both of them will be changed to 1 or F or Female. Post this, normalizing values by changing the field values to the same case, removing special characters, changing St. to Street, or vice versa is done. Profiling the data again gives us the fuzzy match keys, which are essentially a way to group records into buckets so that only records within a bucket are compared. Similarity rules are defined on much cleaner data. A good deal of database tuning is needed too, owing to the computations involved, even after the definition of a fuzzy match key.

Due to all this workflow in configuration and implementation, the typical costs of a master data management deployment are multi-million, with 4 times the implementation costs as the costs of the base license. This leads to very long deployment cycles spanning a few years. Many master data projects shoot their budgets and are unable to master every system. The addition of new systems for data mastering is costly and time-consuming, with multi-month effort. On top of it, rule-based master data management systems suffer from performance and scalability issues, with long database tuning cycles and the inability to process millions of records.

AI in MDM

What is AI

The term AI is used broadly for a wide variety of applications. There are discussions of human intelligence, perception, and judgement being imitated by machines. A lot of exciting breakthroughs in audio, visual, and text analysis through AI techniques open us to limitless possibilities. Machine Learning, a set of tools and techniques within AI is a promising area where a set of algorithms derive patterns from data and make deductions. Machine Learning can be supervised where we already have many data points which we know about or unsupervised where we do not have results, only input data points.

Broadly, during supervised machine learning, we learn from the data whose results we already know. Which records belong to the same Vendor? Which record is a Stationery Spend? This input dataset is known as the training dataset. Next, we build representations from that data. This is called Feature Engineering. So say we have a pair of records which we are matching. One feature can be the number of common characters. Or for the spend classification, it can be the words of the supplies.

The model/algorithms are optimized to get as close to the results as possible on the training dataset. Eventually, when we are happy with the model performance, when we know the predictions are accurate and the error rate is less, we can move it to production. We can build a data pipeline that derives the features from the raw data and applies the model and gets the prediction. Now the good part, If we have done a good job in choosing our AI model and building our features, we will not need representative samples for each variation in the data. That is where rules fail and AI shines. It can generalize. Sometimes it overgeneralizes, but there are techniques to balance that.

AI in MDM

Owing to the complexity and scale involved, the use of AI in MDM enables us to get results faster and at lower costs. Data mastering is a critical part of an enterprise data stack, and by leveraging AI, we can remove many of the painful aspects of older MDM technology and replace it with state of the art learning systems which are fast to deploy and great to scale.

Agile Data Mastering At Scale with AI

When we look at the core of a data mastering system, what we realize is that the technology landscape has altered dramatically since legacy master data management systems were built. ETL is a very mature field and there is a proliferation of open source technologies like Kafka, Change Data Capture, Apache Nifi which can help us get data to a system or get data out of a system. Each cloud provider also provides its suite of data ingestion and extraction tools. Hence, it is only the core data massaging, matching, and governance that we need to focus on.

Agile Data Mastering focuses on solving the core master data functions, letting us use state of the art technologies which we are already familiar with for data extraction and loading. By using AI, agile data mastering allows us to quickly build MDM systems that recommend actions and augment our master data workflows. AI also allows such tools to scale by automatically profiling and learning heuristics of the data, cutting through the noise of dirty data so that schema mapping or fuzzy match keys can be learned and minimal data massaging and zero normalization is needed.

A Guide to Agile Data Mastering with AI

Master data provides the common vocabulary for transactions and operations.

What is master data?

Data silos

What causes data silos?

Structural

Geographical

Technical

Examples of data silos

Customer data silos

Procurement data silos

Data Mastering

The Whole is greater than the sum of its parts

Benefits of data mastering

Key components of a data mastering system

Schema Mapping for Data Mastering

Data matching

Data Mastering — Golden Record

Hierarchy

Data Mastering Process Flow

Getting data

Massaging

Matching

Merging

Publishing

Challenges in data mastering

Sponsorship

Ownership

Coordination

Governance

Process

Technical

Data Formats

Schema Variations in Data Sources

Data Matching

Hierarchy Management

Critique of current master data management tools

AI in MDM

What is AI

AI in MDM

Agile Data Mastering At Scale with AI

Written by Sonal Goyal