The world’s leading publication for data science, AI, and ML professionals.

Identity keyrings

Correlating the different identities used throughout your enterprise

This is an excerpt from my book Data Engineering on Azure. The code examples in this article use Azure Data Explorer. If you want to run them, create an Azure Data Explorer cluster and database. Once this is set up, you can use the Data Explorer web UI to connect to your cluster and database and run the code samples. Azure Data Explorer uses KQL, the Kusto Query Language. If you are not familiar with it, you can check out the quick reference. The book covers Azure Data Explorer in more depth, but this article is meant as a stand-alone.

#

In a large enough enterprise, getting a broad view of how the systems are used isn’t an easy task. What usually happens is that different parts of the business generate and master their own identities.

For example, the Website team masters cookie IDs to identify users who aren’t signed in, and profile IDs for signed in users. The Payments team uses customer IDs to identify customers and subscription IDs to keep track of which subscriptions customers are paying for. The Customer Success team has a support customer ID to identify customers in their support tools, and email addresses of customers. An identity keyring pulls together all these identities across the different systems and allows us to quickly find all connections. Figure 1 shows the various identities used by the different teams and how a keyring groups them together.

The larger the business, the more identities we have and the harder it is to get the broader view of how users interact with it. Having a keyring allows us to correlate activity across the various systems: we can see, for example, how the time it takes to resolve a customer issue in the Customer Success team impacts user retention, tracked by the Payments team, or we can see how an A/B test run on the Website impacts the subscriptions a user signs up for.

Building an identity keyring

Various systems have some connections between identities. For example, the Website team might have a table which matches cookie IDs to profiles once users log in, and the user profile includes an email address. The Payments team keeps a mapping of customers to subscriptions, and also have the profile ID for a given customer ID. The Customer Success team connects each of their own IDs to an email address. Figure 2 shows these connections and how bringing them together allows us to group all identities in the system.

Let’s create the Azure Data Explorer tables containing these identities:

.set Profiles <|
datatable (ProfileId: int, Email: string, CookieId: guid) [
    10002, '[email protected]', '657d31b9-0614-4df7-8be6-d576738a9661',
    10003, '[email protected]', '0864c60d-cc36-4384-81a3-e4c1eee14fe7'
]
.set Customers <|
datatable (CustomerId: int, ProfileId: int) [
    1001, 10002,
    1005, 10003
]
.set Subscriptions <|
datatable (CustomerId: int, SubscriptionId: guid) [
    1001, 'fd10b613-8378-4d37-b8e7-bb665999d122',
    1005, '55979377-ed34-4911-badf-05e07755334c'
]
.set SupportCustomers <|
datatable (SupportCustomerId: int, Email: string) [
    21, '[email protected]',
    22, '[email protected]'
]

These tables come from different systems but end up ingested into our data platform. Once we have this raw data, we can build a keyring by grouping all related IDs. The schema of our keyring table consists of a GroupId, uniquely identifying a group of related identities, a KeyType, which specifies which identity we capture in each row, and a KeyValue, which is the value of the identity.

Here is the first batch of ingestions:

.create table Keyring(GroupId: guid, KeyType: string, KeyValue: string)
.append Keyring <| Profiles
| project GroupId=new_guid(), KeyType='ProfileId', KeyValue=tostring(ProfileId)
.append Keyring <| Profiles
| join (Keyring | where KeyType == 'ProfileId' 
    | project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='Email', Email
.append Keyring <| Profiles
| join (Keyring | where KeyType == 'ProfileId'
    | project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='CookieId', tostring(CookieId)

.append is similar to .set, but as .set creates a new table, .append expects an existing table to ingest into.

We first generated new GUIDs and we set the key type to be 'ProfileId' and the key value the ProfileId from the Profiles table. Next, we joined the Profiles table with the Keyring table on ProfileId, which gave us the GroupId, and we added the emails. Third, we joined the Profiles table with the Keyring table on ProfileId and added the CookieId values.

At this point, we "unrolled" the Profiles table into the keyring schema. Let’s add the Customers and Subscriptions IDs to it:

.append Keyring <| Customers
| join (Keyring | where KeyType == 'ProfileId'
    | project GroupId, ProfileId=toint(KeyValue)) on ProfileId
| project GroupId, KeyType='CustomerId', tostring(CustomerId)
.append Keyring <| Subscriptions
| join (Keyring | where KeyType == 'CustomerId'
    | project GroupId, CustomerId=toint(KeyValue)) on CustomerId
| project GroupId, KeyType='SubscriptionId', tostring(SubscriptionId)

This is similar with what we did before, except when we bring in SubscriptionId we have to join on CustomerId instead of ProfileId. This is not a problem: we can join on any identity that we already have in the keyring to find the GroupId and extend the group with other identities.

Finally, let’s also add the customer support IDs, joining on email:

.append Keyring <| SupportCustomers
| join (Keyring | where KeyType == 'Email'
    | project GroupId, Email = KeyValue) on Email
| project GroupId, KeyType='SupportCustomerId', tostring(SupportCustomerId)

Understanding keyrings

We now pulled together the IDs from all these different tables into one and grouped them together. If we query the keyring table, we see something like the following table:

Now given any ID in the system, we can easily retrieve all connected IDs. For example, given a SupportCustomerId (21), we can retrieve all related keys:

Keyring
| where KeyType == 'SupportCustomerId' and KeyValue == tostring(21)
| project GroupId
| join kind=inner Keyring on GroupId

A keyring enables us to correlate different datasets and get a holistic view of how our systems are used. We use a schema in which we can plug as many types of IDs as needed, with a KeyType column which gives us the type of ID and KeyValue column which stores the ID value.

The steps to build a keyring are:

  • Generate a group ID and start by ingesting one identity (ProfileId in our example).
  • For each new identity type, join with the keyring on a known connection to get the GroupId, then add the new identities to their respective groups.

Below is a graph-oriented view on building a keyring:

Identity keyring as a graph

Another way to think about an identity keyring is as a graph problem. Each identity in the system represents a node in the graph, and each known connection represents an edge. For example, ProfileId and Email are nodes and because they’re connected (in the Profiles table), we have edges between pairs of these nodes.

Building a keyring means identifying all groups of connected identities. In graph terms, this means identifying all connected components of the graph and assigning a GroupId to each connected component. As a reminder, a connected component in a graph is a subgraph in which there is a path between any pair of nodes, and there are no other connections to the super-graph.

An alternative way to build a keyring is using a graph database: we load all nodes and edges then we traverse to find connected components.

Keyrings give us a unified view over all identities in our systems. This enables us to correlate otherwise disjoint datasets and connect information across multiple teams and systems.


Related Articles