Making Sense of Big Data

HDBSCAN Clustering with Neo4j

HDBSCAN is a clustering algorithm that identifies islands of closely related elements in a sea of noisy outliers.

Nathan Smith
Towards Data Science
17 min readJan 15, 2021

--

I recently came across the article “How HDBSCAN works” by Leland McInnes, and I was struck by the informative, accessible way he explained a complex machine learning algorithm.

Clusters identified with HDBSCAN

Unlike clustering algorithms like k-means, with HDBSCAN, you don’t have to decide in advance how many clusters you expect to find in the data. I wanted to learn more about HDBSCAN after I saw Maarten Grootendorst use it in a tutorial for extracting topics from texts.

McInnes’s description of the inner workings of HDBSCAN will resonate with Neo4j users. He talks about the graph of relationships among items and how the algorithm finds a minimum spanning tree as part of the process.

His “graphy” description made me wonder if I could implement HDBSCAN with Neo4j. When I use HDBSCAN for work, I use the Python package instead of implementing the algorithm in Neo4j. I see no reason to reinvent the wheel, especially when I can easily output artifacts from the hdbscan package to networkx, and then import the graphml to Neo4j.

However, trying to explain a concept to someone else is a great way to deepen your own understanding. “Explaining” a concept to a computer can be even tougher than explaining to a person, and might provide some interesting insights along the way.

By building HDBSCAN in Neo4j, we’ll also get practice with useful Neo4j Graph Data Science algorithms and APOC procedures. Let’s dive in!

McInnes breaks his description of HDBSCAN into 5 sections:

  1. Transform the space
  2. Build the minimum spanning tree
  3. Build the cluster hierarchy
  4. Condense the cluster tree
  5. Extract the clusters

It’s a somewhat lengthy process, but each section is straightforward. We’ll follow his outline as we retrace his steps in Neo4j.

Setup your environment

To walk through this exercise, I recommend using a free Neo4j Sandbox that comes preloaded with a Game of Thrones dataset. Sign in at Neo4j Sandbox, select the Graph Data Science project, and launch it.

Launch a Neo4j sandbox project

When your sandbox is ready, choose “Open with Browser.”

Transform the space

HDBSCAN needs us to know how far apart the elements in our dataset are from each other in some type of metric space. The HDBSCAN Python package can use many distance metrics out of the box.

For our Game of Thrones dataset, we’ll create a custom distance metric using the node similarity algorithm in the Graph Data Science library. The node similarity algorithm often works best on a bipartite graph, but I think it still makes sense in this application.

We’ll say that two characters are similar if the sets of characters that they interact with overlap significantly. This should help our similarity score reflect the characters’ social standing. For example, a leader and a subordinate might have frequent interactions with each other, but few commonalities in their wider social circles.

Before we run the node similarity algorithm, let’s create an in-memory graph projection containing the Person nodes and the :INTERACTS relationships. The weight property on the :INTERACTS relationships represents the number of interactions for the characters. The graph contains separate :INTERACTS relationships for each Game of Thrones book. We’ll sum the weights across all books in our in-memory graph.

CALL gds.graph.create(
'gotGraph',
'Person',
{INTERACTS:
{
type: 'INTERACTS',
orientation: 'UNDIRECTED',
aggregation:'SUM',
properties:{
weight:{property:'weight'}
}
}
}) YIELD graphName, nodeCount, relationshipCount

Now we call the node similarity algorithm and write the results back to our main graph as a :SIMILAR_TO relationship with a similarity score in the range 0 to 1. Because we want a densely connected graph, we set the topK parameter to write up to 50:SIMILAR_TO relationships for each node.

CALL gds.nodeSimilarity.write('gotGraph', 
{
writeRelationshipType: 'SIMILAR_TO',
writeProperty: 'similarity',
topK:50
})
YIELD nodesCompared, relationshipsWritten

HDBSCAN expects a measure of distance, where higher numbers correspond to items that are more distantly related. Our similarity score is the opposite, where higher numbers correspond to items that are more closely related. We can transform similarity to distance by subtracting from 1.

MATCH (:Person)-[rel:SIMILAR_TO]->(:Person)
SET rel.distance = 1 - rel.similarity

Let’s check out the distribution of distance scores that were created.

MATCH (p1:Person)-[s:SIMILAR_TO]-(p2:Person)
WHERE id(p1) < id(p2)
RETURN round(s.distance*10)/10.0 AS roundedDistance,
count(*) as pairs
ORDER BY roundedDistance

Here’s a plot of the results.

Histogram of distances

It looks like we have a few pairs of characters that are very close together, a potentially interesting group around 0.5, and a large number in the 0.8–0.9 range.

Now that we have a metric space established, we’re ready to transform it. We want to know whether each node lives in a densely or sparsely populated neighborhood. One way to estimate that is to find the distance to the kth nearest neighbor. We call this distance the core distance.

You can experiment with different values for k. Since many of our Person nodes have a fairly small number of neighbors, I set k to 3.

:param k=> 3

Now we can calculate the core distance for each Person node. If a Person has fewer than 3 neighbors, we default their core distance to 2.

MATCH (p1:Person)-[s:SIMILAR_TO]-(p2:Person)
WITH p1, p2, min(s.distance) AS distance order by distance
WITH p1, collect(distance) AS neighbors
SET p1.coreDist = coalesce(neighbors[$k-1], 2)

The final step in our transformation of the space is to calculate the mutual reachability for each pair of Person nodes.
Mutual reachability is the highest among three values: the distance between the nodes, and both nodes’ core distances.

MATCH (m1:Person)-[s:SIMILAR_TO]->(m2:Person) 
SET s.mutualReachability = apoc.coll.max([m1.coreDist, m2.coreDist, s.distance])

Transforming the space in this way emphasizes the isolation of nodes that live in sparsely populated areas of the graph.

Let’s see how our distribution of mutual reachability scores turned out for someone in a dense part of the graph, Cersei Lannister.

MATCH (p:Person {name:"Cersei Lannister"})-[s:SIMILAR_TO]-(p2)
RETURN DISTINCT p2.name as character,
round(p2.coreDist * 1000)/1000.0 as p2CoreDist,
round(s.distance*1000)/1000.0 as distance,
round(s.mutualReachability*1000)/1000.0 as reachability
ORDER BY round(s.distance*1000)/1000.0
LIMIT 5
╒═══════════════════╤════════════╤══════════╤══════════════╕
│"character" │"p2CoreDist"│"distance"│"reachability"│
╞═══════════════════╪════════════╪══════════╪══════════════╡
│"Joffrey Baratheon"│0.636 │0.544 │0.64 │
├───────────────────┼────────────┼──────────┼──────────────┤
│"Sansa Stark" │0.644 │0.613 │0.644 │
├───────────────────┼────────────┼──────────┼──────────────┤
│"Tyrion Lannister" │0.678 │0.64 │0.678 │
├───────────────────┼────────────┼──────────┼──────────────┤
│"Robert Baratheon" │0.683 │0.683 │0.683 │
├───────────────────┼────────────┼──────────┼──────────────┤
│"Jaime Lannister" │0.692 │0.689 │0.692 │
└───────────────────┴────────────┴──────────┴──────────────┘

We can see that while her son Joffrey Baratheon has smallest distance to Cersei, their mutual reachability score is bumped up to match the third closest person to Cersei, her brother Tyrion Lannister. In the relationships with Sansa Stark, Tyrion Lannister, Robert Baratheon, and Jaime Lannister, the mutual reachability is determined by the second person’s core distances, not Cersei’s.

MATCH (p:Person {name:"Cersei Lannister"})-[s:SIMILAR_TO]-(p2)
WITH DISTINCT p2, s.mutualReachability - s.distance as difference
RETURN avg(difference)

The mean difference between the distance and mutual reachability values for Cersei’s relationships is 0.01

That’s a big contrast with a minor character like Thistle.

MATCH (p:Person {name:"Thistle"})-[s:SIMILAR_TO]-(p2)
RETURN DISTINCT p2.name as character,
round(p2.coreDist * 1000)/1000.0 as p2CoreDist,
round(s.distance*1000)/1000.0 as distance,
round(s.mutualReachability*1000)/1000.0 as reachability
ORDER BY round(s.distance*1000)/1000.0
LIMIT 5
╒══════════════╤════════════╤══════════╤══════════════╕
│"character" │"p2CoreDist"│"distance"│"reachability"│
╞══════════════╪════════════╪══════════╪══════════════╡
│"Bump" │0.875 │0.0 │0.875 │
├──────────────┼────────────┼──────────┼──────────────┤
│"Haggon" │0.875 │0.0 │0.875 │
├──────────────┼────────────┼──────────┼──────────────┤
│"Harma" │0.6 │0.875 │0.875 │
├──────────────┼────────────┼──────────┼──────────────┤
│"Mance Rayder"│0.765 │0.97 │0.97 │
├──────────────┼────────────┼──────────┼──────────────┤
│"Jon Snow" │0.807 │0.991 │0.991 │
└──────────────┴────────────┴──────────┴──────────────┘

His few neighbors are either very close or very distant. A k value of 3 sets the mutual reachability for all of his neighbors to a high value.

MATCH (p:Person {name:"Thistle"})-[s:SIMILAR_TO]-(p2)
WITH DISTINCT p2, s.mutualReachability - s.distance as difference
RETURN avg(difference)

The mean difference between the distance and mutual reachability values for Thistle is 0.35

MATCH (p1:Person)-[s:SIMILAR_TO]-(p2:Person)
WHERE id(p1) < id(p2)
RETURN round(s.mutualReachability*10)/10.0 AS roundedMutualReachability,
count(*) as pairs
ORDER BY roundedMutualReachability

Here’s a plot of the transformed data.

Histogram of mutual reachability

We see that as a result of the transformation, the distribution has shifted a bit towards the higher end of the range. Relationships to some minor characters have ended up in to the 2.0 area. A higher value for k would shift the distribution more.

Build the minimum spanning tree

The next section in “How DBSCAN works” describes building the minimum spanning tree for our transformed metric space. This is a set of relationships that covers every node, with no cycles in the relationships, and the minimum total weight for the relationships. The minimum weight spanning tree algorithm from Graph Data Science library will find this tree for us.

We load a new graph into the graph catalog that includes thePerson nodes and the :SIMILAR_TO relationships with the mutualReachability property.

CALL gds.graph.create(
'gotSimilarity',
"Person",
{
SIMILAR_TO: {
type: 'SIMILAR_TO',
properties: 'mutualReachability',
orientation: 'UNDIRECTED'
}
}) YIELD graphName, nodeCount, relationshipCount

We call the Minimum Spanning Tree algorithm and write the results back to our graph as an :MST relationship. We have to specify a starting node for the algorithm to run. I arbitrarily chose Sansa Stark. You can try a different Person and generate a slightly different tree based on the execution of Prim’s algorithm. The choice of starting node shouldn’t impact the final clusters very much because each node will still end up linked to one of its closest neighbors in the minimum spanning tree.

MATCH (m:Person {name:"Sansa Stark"})CALL gds.alpha.spanningTree.minimum.write('gotSimilarity', {startNodeId: id(m),
relationshipWeightProperty:'mutualReachability', writeProperty:'MST',
weightWriteProperty:'minTreeDistance'})
YIELD effectiveNodeCount, createMillis, computeMillis, writeMillis RETURN *

At this point, Neo4j Bloom network visualization offers a nice way to look at our data. From the Neo4j sandbox project page, choose “Open with Bloom.

Click the key icon near the “Open with Bloom” button to reveal the sign-in credentials. After you sign in, choose the Game of Thrones perspective.

We’re interested in the :MST relationship. Click in the search box, type “MST,” and select the MST relationship.

Press “<Return>”, and Bloom will execute the query. Click the Activate Hierarchical Layout button in the bottom right of the visualization to get a nice view of our tree. You might need to use the rotate button next to the hierarchical relationship button to get the root of your tree to appear at the top. The colors and sizes of the Person nodes have been set with rule-based formatting that isn’t needed for our exercise. You can remove that formatting if you wish.

Portion of minimum weight spanning tree visualized in Bloom

Build the cluster hierarchy

The next step in the algorithm is to create clusters based on the minimum spanning tree. HDBSCAN is a hierarchical (also known as agglomerative) clustering algorithm. Hierarchical clustering is used in several clustering algorithms, including the Louvain community detection algorithm in the Graph Data Science Library.

The basic hierarchical clustering algorithm goes like this:

  1. Think of each item in the data set as a cluster of size 1.
  2. Find the two clusters that are closest to each other, and merge them into a single cluster.
  3. Repeat step 2 until you have merged all the clusters into one big cluster
  4. Decide which of the intermediate clusters between step 1 and step 3 are interesting for your analysis.

Here’s a visualization of the cluster relationship tree, called a dendogram, that represents small clusters at the bottom of the visualization merging together into ever larger clusters as we move towards the top of the visualization.

Example of single linkage tree dendogram generated with HDBSCAN

It’s important for us to define what we mean when we say clusters are “close to each other” in step 2 of the hierarchical clustering description above. In HDBSCAN, we use the relationships in the minimum spanning tree to define “close.”

We start with the relationship in the tree that has the lowest weight and merge the clusters at the ends of that relationship. We’ll move to the next lowest-weight relationship and continue merging in this way until we have exhausted all of the relationships in the tree.

HDBSCAN also helps us decide what clusters are “interesting for our analysis” in step 4 of the hierarchical clustering description. We’ll see how that happens in the final section of this exercise, Extract the Clusters.

Go back to Neo4j Browser to execute the following commands. First, we’ll create an APOC custom function to identify the cluster that is furthest along the hierarchy for any given node. Then we can use that function to build our cluster hierarchy.

call apoc.custom.asFunction('getLastCluster',
"MATCH (n) where id(n) = id($n)
WITH n MATCH clustPath = (n)-[:IN_CLUSTER*0..]->(c)
WITH clustPath order by length(clustPath) desc
LIMIT 1
RETURN nodes(clustPath)[-1] as lastCluster",
"NODE",
[["n","NODE"]],
false,
"Starting from a given node, get the node at the end of the longest path of IN_CLUSTER relationships")

The Cypher statement below:

  1. Finds the shortest edge in the minimum spanning tree that has not yet been evaluated.
  2. Then it finds the deepest existing cluster for the nodes at the ends of the spanning tree edge.
  3. It creates a new cluster, and attaches the nodes to it.
  4. Calculates the size of the new cluster
  5. Marks the spanning tree edge as evaluated.
MATCH (a)-[m:MST]->(b)
WHERE m.evaluated is null
WITH a, b, m
ORDER BY m.minTreeDistance
LIMIT 1
WITH custom.getLastCluster(a).lastCluster as cluster1, custom.getLastCluster(b).lastCluster as cluster2,
m
CREATE (cluster1)-[r1:IN_CLUSTER]->(c:Cluster)
<-[r2:IN_CLUSTER]- (cluster2)
SET r1.splitDistance = m.minTreeDistance,
r2.splitDistance = m.minTreeDistance,
c.size = coalesce(cluster1.size, 1) + coalesce(cluster2.size, 1),
m.evaluated = True
RETURN *

Gunthor son of Gurn and Shrouded Lord were the first two characters to be joined in a cluster. If we hover over the :MST relationship, we see the minTreeDistance was 0.0. Run this query to see the set of people who Gunthor and Shrouded interact with.

MATCH (p1:Person {name:"Shrouded Lord"})-[:INTERACTS]->(p2)
RETURN p1, p2
UNION ALL
MATCH (p1:Person {name:"Gunthor son of Gurn"})-[:INTERACTS]->(p2)
RETURN p1, p2
Interactions for Gunthor son of Gurn and Shrouded Lord

They both interacted with just one person, Tyrion Lannister. That makes 100% overlap among their sets of neighbors, so they are as close as they can be according to our distance measure. There are at least two other characters who also interact only with Tyrion, so Gunther and Shrouded Lord’s mutual reachability is 0.0.

You can continue running the clustering query above to create new clusters one at a time. After you have done that a few times, run the statement below that uses the apoc.periodic.commit procedure to finish creating the clusters. It works on one minimal relationship at a time, until the result count returns zero.

call apoc.periodic.commit(
"MATCH (a)-[m:MST]->(b)
WHERE m.evaluated is null
WITH a, b, m
ORDER BY m.minTreeDistance
LIMIT 1
WITH
custom.getLastCluster(a).lastCluster as cluster1, custom.getLastCluster(b).lastCluster as cluster2, m
CREATE (cluster1)-[r1:IN_CLUSTER]->(c:Cluster)
<-[r2:IN_CLUSTER]- (cluster2)
SET r1.splitDistance = m.minTreeDistance,
r2.splitDistance = m.minTreeDistance,
c.size = coalesce(cluster1.size, 1) + coalesce(cluster2.size, 1),
m.evaluated = True
RETURN count(*)", {})

When you are done, you can go back to Neo4j Bloom to look at our cluster tree. Hit the sidebar icon in the top left sidebar to expand the perspective editor, and then click the Refresh Perspective button to pull in the latest data.

Scroll to the bottom of the category list, and add a category for “Cluster.”

Click the gear widget on the left to open settings and increase the Node query limit over 2,000.

Node query limit setting

Click the gear widget again to close the settings sidebar. Right click the canvas and clear the scene. Now, if you search for the IN_CLUSTER relationship in the search box, you can get a nice view of our cluster tree as each Person is added in.

Segment of cluster tree in Bloom

Condense the cluster tree

The next step in the process is to condense the cluster tree. Minimum cluster size is one of the important parameters that you can control when you are running HDBSCAN. If we set a low minimum cluster size, we will get more, smaller clusters. If we set a higher minimum cluster size, we will get fewer, larger clusters.

Execute this command in Neo4j Browser to set the minClusterSize parameter to 20.

:param minClusterSize => 20

The next command will assign any Person nodes that are attached to a cluster of 20 or less nodes to their first ancestor cluster that is above the minimum size.

MATCH (c1:Cluster)-[r:IN_CLUSTER]->(c2:Cluster)
WHERE c1.size < $minClusterSize and c2.size >= $minClusterSize
WITH c1, r, c2
MATCH (m:Person)-[:IN_CLUSTER*]->(c1)
CREATE (m)-[:IN_CLUSTER {splitDistance:r.splitDistance}]->(c2)

Then, we can delete the undersized clusters.

MATCH (c:Cluster) where c.size < $minClusterSize detach delete c

If we start from the top of the tree and work our way down, we see that many of the branches represent one or two nodes breaking away from the main cluster.

We can think of this as the main cluster continuing to exist, but shedding a few members. Other splits represent a larger cluster splitting into two viable subclusters. We’ll label the start of those viable subclusters as “cluster births”.

MATCH (c1:Cluster)-[r:IN_CLUSTER]->(c)<-[:IN_CLUSTER]-(c2:Cluster)
WHERE c1.size >= $minClusterSize and c2.size >= $minClusterSize
AND id(c1) < id(c2)
WITH r.splitDistance as splitDistance, [c1, c2] as clusters
UNWIND clusters as cluster
WITH cluster, 1/splitDistance as lambda
SET cluster:ClusterBirth, cluster.birthLambda = lambda
RETURN *

We also want to count the root of the tree as a cluster birth.

MATCH (c:Cluster) where not (c)-[:IN_CLUSTER]->()
SET c:ClusterBirth, c.birthLambda = 0
RETURN *

In the first ClusterBirth assignment query above, we set a value called lambda as 1/splitDistance.

Recall that the nodes near the root of our cluster tree correspond the biggest distances in the minimum weight spanning tree, while the leaf nodes of the cluster tree correspond to the smallest distances in the minimum weight spanning tree.

As the inverse of the split distances, the lambda values start small at the root and increase towards the leaves. We’ll use the lambdas later to decide which of the clusters to keep as our final result.

Next, we’ll label the nodes just before a cluster splits into child clusters or drops below the minimum cluster size as “cluster death.”

MATCH (c:Cluster) where not (c)<-[:IN_CLUSTER]-(:Cluster)
OR (c)<-[:IN_CLUSTER]-(:ClusterBirth)
SET c:ClusterDeath
RETURN c

Our final task in this section is to merge all the nodes in the path from cluster birth to cluster death into a single cluster.

To do this, we’ll use two APOC procedures. The apoc.path.expand procedure allows us to search along the :IN_CLUSTER relationships until we reach a :ClusterDeath node.

The apoc.refactor.mergeNodes procedure merges multiple nodes into a single node while preserving the nodes’ relationships.

MATCH (cb:ClusterBirth)
CALL apoc.path.expand(cb, "<IN_CLUSTER", "/ClusterDeath", 1, 800) YIELD path
with nodes(path) as mergeList
CALL apoc.refactor.mergeNodes(mergeList, {properties: "discard", mergeRels:true}) YIELD node
RETURN node

Since we merged more than two nodes to create some of our condensed clusters, we end up with some unnecessary self-relationships. Use this query to delete them.

match (c:Cluster)-[r:IN_CLUSTER]->(c)
DELETE r

Extract the clusters

Now we have a condensed hierarchy of clusters. You can go back to Bloom, clear the scene, and search for the Cluster — Cluster pattern to see the structure.

Cluster hierarchy

Each of the remaining clusters are above the minimum size threshold, and we’d like to select the highest quality clusters for our output. We use the lambda, which we calculated as 1/splitDistance to guide our choice.

For each node in a cluster, we calculate the difference between the lambda at cluster birth and the lambda at the point the node departs the cluster, either by falling out of the cluster or because the cluster has split into viable child clusters. The sum of these lambda values across all nodes in a cluster is called the cluster’s stability. Use this query to calculate the stability for our clusters.

MATCH (c:Cluster)<-[ic:IN_CLUSTER]-(n)
WITH c, sum((1/ic.splitDistance - c.birthLambda) * coalesce(n.size, 1)) as stability
SET c.stability = stability

We temporarily set all of the leaf clusters as “selected.”

MATCH (c:Cluster)
WHERE NOT (c)<-[:IN_CLUSTER]-(:Cluster)
SET c:Selected
RETURN c

Beginning with the splits farthest from the root, we now consider each cluster split to decide whether to select the children of the split or the parent node. We’ll number the cluster splits to keep track of the evaluation order.

MATCH p = (:Cluster)-[:IN_CLUSTER]->(c:Cluster)-[:IN_CLUSTER*]->(:Cluster) 
WITH c, length(p) as pathLength
WITH c, max(pathLength) as distanceToRoot
ORDER BY distanceToRoot desc
WITH collect(c) as clusters
UNWIND range(1, size(clusters)) AS rank
WITH clusters[rank-1] as cluster, rank
SET cluster.evaluationOrder = rank

At each cluster split, we compare the parent stability with the sum of the child stabilities.
If the child stabilities are bigger, we set the parent stability to be the sum of the child stabilities.
If the parent stability is greater, we set the parent to be selected and deselect the child clusters.

The apoc.periodic.iterate procedure lets us iterate through the splits in evaluation order. We use the apoc.do.when procedure to handle the conditional update logic.

call apoc.periodic.iterate("
MATCH (c:Cluster) where c.evaluationOrder is not null
RETURN c.evaluationOrder as evaluationOrder
order by evaluationOrder",
"MATCH (c:Cluster {evaluationOrder:evaluationOrder})<-[:IN_CLUSTER]-(child)
WITH c.stability as parentStability, sum(child.stability) as childStability,
c AS parent,
collect(child) as children
call apoc.do.when(parentStability < childStability,
'set parent.stability = childStability',
'set parent:Selected FOREACH(child in children | REMOVE child:Selected)',
{parent:parent, childStability:childStability, children:children}) YIELD value
RETURN value",
{batchMode:"SINGLE"})

The clusters that retain the Selected label are the final output from HDBSCAN. You can return the cluster members with this query.

MATCH (s:Selected)<-[:IN_CLUSTER*]-(m:Person)
return id(s), s.size, collect(m.name) as names

I expected that characters would tend to interact with people in the same house, and so some houses would tend to dominate our output clusters. Based on this query, that seems to be the case for several of the clusters, while others are more of a mixed bag.

MATCH (c:Selected)<-[:IN_CLUSTER*]-(p:Person)-[:BELONGS_TO]->(h:House)
WITH c, h, count(*) as houseCount ORDER BY houseCount DESC
RETURN id(c), c.size,
collect({name: h.name, count:houseCount,
percent:houseCount*100/c.size})[..5] as houses
╒═══════╤════════╤════════════════════════════════════════════╕
│"id(c)"│"c.size"│"houses" │
╞═══════╪════════╪════════════════════════════════════════════╡
│3062 │438 │[{"name":"None","count":74,"percent":16},{"n│
│ │ │ame":"Stark","count":55,"percent":12},{"name│
│ │ │":"Night's Watch","count":45,"percent":10},{│
│ │ │"name":"Lannister","count":35,"percent":7},{│
│ │ │"name":"Baratheon","count":31,"percent":7}] │
├───────┼────────┼────────────────────────────────────────────┤
│3137 │59 │[{"name":"Night's Watch","count":31,"percent│
│ │ │":52},{"name":"Wildling","count":10,"percent│
│ │ │":16},{"name":"None","count":3,"percent":5},│
│ │ │{"name":"Stark","count":2,"percent":3},{"nam│
│ │ │e":"Kingswood Brotherhood","count":1,"percen│
│ │ │t":1}] │
├───────┼────────┼────────────────────────────────────────────┤
│2892 │114 │[{"name":"None","count":24,"percent":21},{"n│
│ │ │ame":"Stark","count":22,"percent":19},{"name│
│ │ │":"Baratheon","count":16,"percent":14},{"nam│
│ │ │e":"Tully","count":9,"percent":7},{"name":"L│
│ │ │annister","count":8,"percent":7}] │
├───────┼────────┼────────────────────────────────────────────┤
│3008 │41 │[{"name":"Greyjoy","count":16,"percent":39},│
│ │ │{"name":"None","count":5,"percent":12},{"nam│
│ │ │e":"Bolton","count":5,"percent":12},{"name":│
│ │ │"Stark","count":5,"percent":12},{"name":"Goo│
│ │ │dbrother","count":3,"percent":7}] │
├───────┼────────┼────────────────────────────────────────────┤
│2923 │43 │[{"name":"Lannister","count":14,"percent":32│
│ │ │},{"name":"Tyrell","count":5,"percent":11},{│
│ │ │"name":"None","count":5,"percent":11},{"name│
│ │ │":"Baratheon","count":3,"percent":6},{"name"│
│ │ │:"Redwyne","count":2,"percent":4}] │
├───────┼────────┼────────────────────────────────────────────┤
│3063 │25 │[{"name":"Lannister","count":10,"percent":40│
│ │ │},{"name":"None","count":10,"percent":40},{"│
│ │ │name":"Clegane","count":6,"percent":24},{"na│
│ │ │me":"Brotherhood without banners","count":4,│
│ │ │"percent":16},{"name":"Stark","count":2,"per│
│ │ │cent":8}] │
└───────┴────────┴────────────────────────────────────────────┘

Our algorithm run placed 463 characters in clusters. However, 332 characters with at least one :INTERACTION relationship remain outside the selected clusters, including some major characters.

MATCH (p:Person) where not (p)-[:IN_CLUSTER*]->(:Selected)
AND (p)-[:INTERACTS]-()
RETURN p.name order by p.pageRank desc limit 10
╒════════════════════╕
│"p.name" │
╞════════════════════╡
│"Jon Snow" │
├────────────────────┤
│"Stannis Baratheon" │
├────────────────────┤
│"Daenerys Targaryen"│
├────────────────────┤
│"Arya Stark" │
├────────────────────┤
│"Robb Stark" │
├────────────────────┤
│"Eddard Stark" │
├────────────────────┤
│"Catelyn Stark" │
├────────────────────┤
│"Theon Greyjoy" │
├────────────────────┤
│"Tywin Lannister" │
├────────────────────┤
│"Samwell Tarly" │
└────────────────────┘

This is a feature, not a bug of HDBSCAN. We have identified dense clusters in our data.

After that we can decide how to handle the edge cases. Perhaps some of these unclustered characters just don’t fit a common interaction pattern.

The HDBSCAN documentation gives suggestions for adjusting parameters to decrease the number of unclustered data elements, and for assigning unclustered data elements to clusters based on soft clustering.

I hope that working through the HDBSCAN algorithm with Neo4j has given you deeper insight into how this algorithm works.

You can compare and contrast this approach to other community detection and clustering algorithms. I hope HDBSCAN proves to be a useful tool in your data science toolkit.

--

--

Senior Data Scientist at Neo4j. Organizer of the Kansas City Graph Databases Meetup.