This is an excerpt from my book Data Engineering on Azure. The code examples in this article use Azure Data Explorer. If you want to run them, create an Azure Data Explorer cluster and database. Once this is set up, you can use the Data Explorer web UI to connect to your cluster and database and run the code samples. Azure Data Explorer uses KQL, the Kusto Query Language. If you are not familiar with it, you can check out the quick reference. The book covers Azure Data Explorer in more depth, but this article is meant as a stand-alone.
#
In a large enough enterprise, getting a broad view of how the systems are used isn’t an easy task. What usually happens is that different parts of the business generate and master their own identities.
For example, the Website team masters cookie IDs to identify users who aren’t signed in, and profile IDs for signed in users. The Payments team uses customer IDs to identify customers and subscription IDs to keep track of which subscriptions customers are paying for. The Customer Success team has a support customer ID to identify customers in their support tools, and email addresses of customers. An identity keyring pulls together all these identities across the different systems and allows us to quickly find all connections. Figure 1 shows the various identities used by the different teams and how a keyring groups them together.
The larger the business, the more identities we have and the harder it is to get the broader view of how users interact with it. Having a keyring allows us to correlate activity across the various systems: we can see, for example, how the time it takes to resolve a customer issue in the Customer Success team impacts user retention, tracked by the Payments team, or we can see how an A/B test run on the Website impacts the subscriptions a user signs up for.
Building an identity keyring
Various systems have some connections between identities. For example, the Website team might have a table which matches cookie IDs to profiles once users log in, and the user profile includes an email address. The Payments team keeps a mapping of customers to subscriptions, and also have the profile ID for a given customer ID. The Customer Success team connects each of their own IDs to an email address. Figure 2 shows these connections and how bringing them together allows us to group all identities in the system.
Let’s create the Azure Data Explorer tables containing these identities:
.set Profiles <|
datatable (ProfileId: int, Email: string, CookieId: guid) [
10002, '[email protected]', '657d31b9-0614-4df7-8be6-d576738a9661',
10003, '[email protected]', '0864c60d-cc36-4384-81a3-e4c1eee14fe7'
]
.set Customers <|
datatable (CustomerId: int, ProfileId: int) [
1001, 10002,
1005, 10003
]
.set Subscriptions <|
datatable (CustomerId: int, SubscriptionId: guid) [
1001, 'fd10b613-8378-4d37-b8e7-bb665999d122',
1005, '55979377-ed34-4911-badf-05e07755334c'
]
.set SupportCustomers <|
datatable (SupportCustomerId: int, Email: string) [
21, '[email protected]',
22, '[email protected]'
]
These tables come from different systems but end up ingested into our data platform. Once we have this raw data, we can build a keyring by grouping all related IDs. The schema of our keyring table consists of a GroupId
, uniquely identifying a group of related identities, a KeyType
, which specifies which identity we capture in each row, and a KeyValue
, which is the value of the identity.
Here is the first batch of ingestions:
.create table Keyring(GroupId: guid, KeyType: string, KeyValue: string)
.append Keyring <| Profiles
| project GroupId=new_guid(), KeyType='ProfileId', KeyValue=tostring(ProfileId)
.append Keyring <| Profiles
| join (Keyring | where KeyType == 'ProfileId'
| project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='Email', Email
.append Keyring <| Profiles
| join (Keyring | where KeyType == 'ProfileId'
| project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='CookieId', tostring(CookieId)
.append
is similar to .set
, but as .set
creates a new table, .append
expects an existing table to ingest into.
We first generated new GUIDs and we set the key type to be 'ProfileId'
and the key value the ProfileId
from the Profiles
table. Next, we joined the Profiles
table with the Keyring
table on ProfileId
, which gave us the GroupId
, and we added the emails. Third, we joined the Profiles
table with the Keyring
table on ProfileId
and added the CookieId
values.
At this point, we "unrolled" the Profiles
table into the keyring schema. Let’s add the Customers
and Subscriptions
IDs to it:
.append Keyring <| Customers
| join (Keyring | where KeyType == 'ProfileId'
| project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='CustomerId', tostring(CustomerId)
.append Keyring <| Subscriptions
| join (Keyring | where KeyType == 'CustomerId'
| project GroupId, CustomerId=toint(KeyValue)) on CustomerId
| project GroupId, KeyType='SubscriptionId', tostring(SubscriptionId)
This is similar with what we did before, except when we bring in SubscriptionId
we have to join on CustomerId
instead of ProfileId
. This is not a problem: we can join on any identity that we already have in the keyring to find the GroupId
and extend the group with other identities.
Finally, let’s also add the customer support IDs, joining on email:
.append Keyring <| SupportCustomers
| join (Keyring | where KeyType == 'Email'
| project GroupId, Email = KeyValue) on Email
| project GroupId, KeyType='SupportCustomerId', tostring(SupportCustomerId)
Understanding keyrings
We now pulled together the IDs from all these different tables into one and grouped them together. If we query the keyring table, we see something like the following table:
Now given any ID in the system, we can easily retrieve all connected IDs. For example, given a SupportCustomerId
(21), we can retrieve all related keys:
Keyring
| where KeyType == 'SupportCustomerId' and KeyValue == tostring(21)
| project GroupId
| join kind=inner Keyring on GroupId
A keyring enables us to correlate different datasets and get a holistic view of how our systems are used. We use a schema in which we can plug as many types of IDs as needed, with a KeyType
column which gives us the type of ID and KeyValue
column which stores the ID value.
The steps to build a keyring are:
- Generate a group ID and start by ingesting one identity (
ProfileId
in our example). - For each new identity type, join with the keyring on a known connection to get the
GroupId
, then add the new identities to their respective groups.
Below is a graph-oriented view on building a keyring:
Identity keyring as a graph
Another way to think about an identity keyring is as a graph problem. Each identity in the system represents a node in the graph, and each known connection represents an edge. For example,
ProfileId
andProfiles
table), we have edges between pairs of these nodes.Building a keyring means identifying all groups of connected identities. In graph terms, this means identifying all connected components of the graph and assigning a
GroupId
to each connected component. As a reminder, a connected component in a graph is a subgraph in which there is a path between any pair of nodes, and there are no other connections to the super-graph.An alternative way to build a keyring is using a graph database: we load all nodes and edges then we traverse to find connected components.
Keyrings give us a unified view over all identities in our systems. This enables us to correlate otherwise disjoint datasets and connect information across multiple teams and systems.