The world’s leading publication for data science, AI, and ML professionals.

Create a similarity graph from node properties with Neo4j

Build a similarity graph based on node properties, even when no relationships are present in the data model.

The original version of this article and the accompanying video used a version of Neo4j Sandbox that is no longer current. I have updated the code to use GDS 2.0.1 and the version of Neo4j Sandbox which is current in July, 2022.

Cluster analysis helps us uncover structures hidden within data. We can use unsupervised machine learning algorithms to group data items into clusters so that items have more in common with other items inside their cluster than they do with items outside their cluster.

Tasks such as customer segmentation, recommendation, and anomaly detection become easier when we organize a dataset into subsets of similar items.

The concept of similarity lies at the heart of cluster analysis. The Neo4j Graph Data Science library gives us tools for discovering the similarity between nodes in a dataset based on the node’s properties, even if there aren’t preexisting relationships in our data.

The Bloom graph visualization application then allows users to interactively explore similarity in an accessible, explainable way. This can help move unsupervised machine learning away from the realm of black box magic, towards being an intuitive tool for business users to employ. The Graph Data Science library subsequently allows us to find clusters of closely related nodes based on similarity relationships.

For this example, I used data about metropolitan statistical areas (MSAs) from the 2019 American Community Survey five-year estimates provided by the United States Census Bureau.

I selected four demographic features to describe each MSA:

  • population,
  • median household income,
  • median home price, and
  • the percent of the population over age 25 with a bachelors degree.

In this tutorial, we will use a similarity algorithm to identify pairs of MSAs that are similar based on the four demographic features. Then, we will use the similarity relationships to identify clusters of MSAs.

Photo by Mackenzie Weber on Unsplash
Photo by Mackenzie Weber on Unsplash

You can follow along with me by creating a free Neo4j sandbox. Choose to create a "Blank Sandbox – Graph Data Science", because we will load our own data. Then click the "Open" button to launch Neo4j Browser to get started.

Before we load data, we’ll create a unique constraint for the name of the MSA.

CREATE CONSTRAINT msa_name ON (m:MSA) ASSERT m.name IS UNIQUE

Use this command to load the MSA data from a CSV file on GitHub.

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/smithna/datasets/main/CensusDemographicsByMetroArea.csv" AS row
WITH row WHERE row.name CONTAINS 'Metro'
MERGE (m:MSA {name:row.name})
SET m.population = toInteger(row.population),
m.medianHouseholdIncome = toInteger(row.medianHouseholdIncome),
m.medianHomePrice = toInteger(row.medianHomePrice),
m.percentOver25WithBachelors = toFloat(row.percentOver25WithBachelors)
RETURN m

Note that the LOAD CSV utility treats every field as a string, so we use toInteger() and toFloat() functions to convert the numeric values to the appropriate data type. After running the command, you should see that MSA nodes have been created.

Nodes representing metropolitan areas. Image by author.
Nodes representing metropolitan areas. Image by author.

Unlike most Neo4j projects, we don’t have any relationships to load at the start. Instead, we’re going to use the Graph Data Science library to discover similar nodes and create relationships between them.

We’ll define similarity based on Euclidean Distance. The Euclidean distance formula is an extension of the Pythagorean theorem, which says that the square of the length hypotenuse of a right triangle is equal to the sum of the squares of the other two sides.

Pythagorean theorem. Image by author.
Pythagorean theorem. Image by author.

In our case, we are comparing MSAs along four dimensions instead of two, but the concept is similar. The Euclidean distance between two MSAs is the square root of the sum of the squared differences along each of our demographic metrics.

Euclidean distance formula
Euclidean distance formula

Before we calculate distance, we’ll want each of our four properties to have roughly the same range from the smallest value to the largest value. If there is a big difference in the ranges, then the properties with a big range will dominate the distance calculation; the properties with a small range will hardly influence the distance calculation at all. So we need to normalize all of them onto the same range.

Run this query to see the range of values for each of the four properties that we loaded.

MATCH (n)
WITH n, 
["population", "medianHouseholdIncome", "medianHomePrice", "percentOver25WithBachelors" ] AS metrics
UNWIND metrics as metric
WITH metric, n[metric] AS value
RETURN metric, min(value) AS minValue,
percentileCont(value, 0.25) AS percentile25, 
percentileCont(value, 0.50) AS percentile50, 
percentileCont(value, 0.75) AS percentile75, 
max(value) AS maxValue
Table showing range of values for each property. Image by author.
Table showing range of values for each property. Image by author.

The range of populations goes from 54,773 to 19,294,236. That’s a huge difference from the range of percent of people over 25 with a college degree, which runs from 12.9 to 67.4.

We can also see that there is a much bigger difference between the 50th and 75th percentile for population (336,612 citizens) than there is between the 25th and 50th percentile (94,771.25 citizens). I downloaded the data and plotted histograms on both a standard scale and a logarithmic scale to examine the distributions.

Histograms of metrics on unit and logarithmic scales. Image by author.
Histograms of metrics on unit and logarithmic scales. Image by author.

All of our metrics look like they approach a more normal distribution if we apply a log transformation. This makes sense to me on an intuitive level when I consider that adding 10,000 residents to an MSA with a population of 60,000 would be a big change to the character of the metro area, but adding 10,000 residents to an MSA with a population of 19 million would make a much smaller impact. On the other hand, a 10% change in population would be noticeable in an MSA of any scale.

Run the command below to apply a log transformation to population, median home price, and percent of population over 25 with a bachelors degree.

MATCH (m:MSA)
SET 
m.logPopulation = log(m.population),
m.logMedianHouseholdIncome = log(m.medianHouseholdIncome),
m.logMedianHomePrice = log(m.medianHomePrice),
m.logPercentOver25WithBachelors = log(m.percentOver25WithBachelors)

The MinMax scaler in the Graph Data Science library will apply the formula below to rescale each property value. The minimum value is set to 0, the maximum value is set to 1, and all other values fall between 0 and 1 proportionally to their positions in the original data set.

MinMax scaler formula
MinMax scaler formula

We’ll create an in-memory graph called "metro-graph" to efficiently perform the scaling calculation on all the nodes at once. We will load all nodes with the label MSA into the in-memory graph. Since we don’t have any relationships in our graph yet, we will use a * wildcard as a placeholder.

CALL gds.graph.project('msa-graph', 
'MSA', 
'*', 
{nodeProperties:["logPopulation", "logMedianHouseholdIncome", "logMedianHomePrice", "logPercentOver25WithBachelors"]})

Run the command below to create a new property in the in-memory graph called scaledProperties. The new property will be a vector (an ordered list) of four values: population, medianHouseholdIncome, medianHomePrice, percentOver25WithBachelors. The MinMax scaler will be applied to each property before it is added to the vector, so that all the values are within the range [0, 1].

CALL gds.alpha.scaleProperties.mutate('msa-graph', 
{nodeProperties: ["logPopulation", "logMedianHouseholdIncome", "logMedianHomePrice", "logPercentOver25WithBachelors"], 
scaler:"MinMax",
mutateProperty: "scaledProperties"}) 
YIELD nodePropertiesWritten

We can use scaledProperties that we just created to calculate the distance between the nodes. The Euclidean distance formula tells us how far apart MSAs are in 4-dimensional space. The higher the value of the distance, the less similar the two MSAs are. We’re interested in similarity between MSAs, so we will use an algorithm that transforms distance to similarity.

Euclidean similarity formula
Euclidean similarity formula

The higher the value of thesimilarity property, the more similar two MSAs are.

The K-Nearest Neighbors algorithm will create relationships in our graph from each node to the its closest neighbors according to the Euclidean similarity formula. The topK parameter tells the algorithm to create a relationship called IS_SIMILAR from each node to its 15 nearest neighbors. Before running the algorithm in write mode, let’s run it in stats mode and examine the distribution of similarity scores.

CALL gds.knn.stats("msa-graph",
   {
      nodeProperties:{scaledProperties:"EUCLIDEAN"},
      topK:15,
      sampleRate:1,
      randomSeed:42,
      concurrency:1
   }
) YIELD similarityDistribution 
RETURN similarityDistribution
-----------------------------------
{
  "p1": 0.8350143432617188,
  "max": 0.9999847412109375,
  "p5": 0.9390525817871094,
  "p90": 0.996307373046875,
  "p50": 0.9895782470703125,
  "p95": 0.9975738525390625,
  "p10": 0.9621849060058594,
  "p75": 0.9938201904296875,
  "p99": 0.998992919921875,
  "p25": 0.9813117980957031,
  "p100": 0.9999847412109375,
  "min": 0.77587890625,
  "mean": 0.9814872709261316,
  "stdDev": 0.027092271447973864
}

We can see that 99% of relationships have a similarity score above 0.835, and the minimum similarity is around 0.776. When we update the in-memory graph with similarity relationships, we will use the similarityThreshold parameter to exclude relationships with similarity scores in the bottom 1%.

CALL gds.knn.mutate("msa-graph",
               {nodeProperties: {scaledProperties: "EUCLIDEAN"},
               topK: 15,
               mutateRelationshipType: "IS_SIMILAR",
               mutateProperty: "similarity",
               similarityCutoff: 0.8350143432617188,
               sampleRate:1,
               randomSeed:42,
               concurrency:1}
              )

Running KNN in mutate mode created the IS_SIMILAR relationships in our in-memory graph projection. Now write them to the Neo4j graph so that they can be queried with Cypher and visualized in Bloom.

CALL gds.graph.writeRelationship(
    "msa-graph",
    "IS_SIMILAR",
    "similarity"
)

Let’s add a rank property to the IS_SIMILAR relationships so that we can filter the top ranked relationships for each node.

MATCH (m:MSA)-[s:IS_SIMILAR]->()
WITH m, s ORDER BY s.similarity DESC
WITH m, collect(s) as similarities, range(0, 11) AS ranks
UNWIND ranks AS rank
WITH rank, similarities[rank] AS rel
SET rel.rank = rank + 1

Now let’s take a look at our graph in the Bloom graph visualization application. Go back to the Neo4j Sandbox home page and choose "Open with Bloom."

Launching Bloom. Image by author.
Launching Bloom. Image by author.

Sign in with the credentials for your Sandbox. Since this is our first time using Bloom with this dataset, click on "Create Perspective." Then, click "Generate Perspective to let Bloom set up a perspective for you. Finally, click on your newly generated perspective to enter the Bloom scene.

Click "Use Perspective" to use your new Bloom perspective. Image by author.
Click "Use Perspective" to use your new Bloom perspective. Image by author.

Let’s configure the display of the "IS_SIMILAR" relationships so that the line weight corresponds to the similarity of the nodes. Click Relationships on the panel at the right of the screen, and then click the line next to the "IS_SIMILAR" relationship.

Click the line next to IS_SIMILAR. Image by author.
Click the line next to IS_SIMILAR. Image by author.

In the fly-out menu that appears, choose "Rule-based." Then, click the plus sign to add a new rule-based style.

Choose "similarity" from the property key drop down. Select the radio button for "range." Click the "Size" button to create a rule that will control line weight. Toggle the button to apply the size rule.

In the Minpoint box, enter 0.83 and set the minimum size to 0.25x. In the Maxpoint box, enter 1 and set the maximum size to 4x. Your configuration should look like the picture below.

Rule-based styling configuration. Image by author
Rule-based styling configuration. Image by author

Now, let’s use Bloom to search for an MSA and its closest peers. If you start typing "New York" in the Bloom search box, Bloom’s auto-complete should suggest the "New York-Newark-Jersey City, NY-NJ-PA Metro Area." Press tab to accept the suggestion. Once you have a node for New York’s MSA, Bloom will auto-suggest patterns involving that node. Choose the pattern that extends from the New York node with an IS_SIMILAR relationship, and press enter.

Bloom search box. Image by author.
Bloom search box. Image by author.

Bloom will return a visualization showing the New York MSA and its 15 most similar MSAs based on the four demographic measures that we used.

Most similar MSAs to New York. Image by author.
Most similar MSAs to New York. Image by author.

Based on the thickness of the outbound arrows, we can see that Los Angeles is the most similar MSA to New York.

We might be surprised to see that there are relationships going both to and from New York and Los Angeles, San Diego, Boston, Washington, and Seattle, but not the other cities in the visualization. We told the Euclidean distance algorithm to create relationships for the top 15 closest MSAs for each node. If I make a list of the 15 closest MSAs to New York, Atlanta is included. However, if I make a list of the 12 closest MSAs to Atlanta, New York does not make the cut. That is why there is a relationship directed from New York to Atlanta, but not a relationship directed from Atlanta to New York.

Let’s filter the display to see the top 5 most similar cities to New York. Click the funnel icon on the left of the Bloom screen to expand the filter panel. Choose IS_SIMILAR in the dropdown. Select the "rank" property. Select the "less than or equal to" condition, and set the value to 5. Toggle "Apply Filter" to highlight the top 5 strongest similarity relationships for New York.

Filtered relationships. Image by author
Filtered relationships. Image by author

We’ve explored the neighbors of a single MSA. Let’s now look at the big picture. Right click the Bloom canvas and choose "Clear scene" from the context menu. In the search box, enter "MSA" and press tab. Select the (MSA)-IS_SIMILAR-() pattern, and press enter. We’ll see the whole MSA graph.

Complete MSA similarity graph. Image by author.
Complete MSA similarity graph. Image by author.

Open the filter panel again. Create a new filter on the IS_SIMILAR relationships for "similarity" greater than or equal to 0.96. Toggle "Apply Filter," and then click the "Dismiss Filtered elements" button at the bottom of the panel. We start to see some islands and peninsulas differentiate from the main group of nodes. There are three outlier nodes where none of their 15 most similar neighbors have similarity greater than 0.96. We’ll probe the structure of the graph further with community detection algorithms.

Filtered MSA similarity graph. Image by author
Filtered MSA similarity graph. Image by author

Go back to Neo4j Browser to run the following commands for community detection. We will execute the Louvain community detection algorithm against the in-memory graph and write the community ids back to the main Neo4j graph. We tell the algorithm to use the similarity property to weight the relationships between nodes. By setting the consecutiveIds parameter to true, we tell the algorithm that we want the values of the communityId property that are generated to be consecutive integers.

CALL gds.louvain.write('msa-graph',
{relationshipTypes: ["IS_SIMILAR"],
relationshipWeightProperty:"similarity",
 writeProperty:"communityId"})
YIELD communityCount, modularities
RETURN communityCount, modularities
Community detection results. Image by author.
Community detection results. Image by author.

The output shows us that 6 communities were detected. Successive iterations of the algorithm increased the modularity of the community partitions from 0.625 to 0.667.

Run the query below to gather summary statistics about the communities we have defined. In addition to statistics, we’re adding context by listing the three MSAs in each community with the highest similarity to other MSAs in the same community. These MSAs should be typical examples of their community, as opposed to other MSAs that might have fewer or weaker in-community relationships.

MATCH (m:MSA)
WITH m 
ORDER BY apoc.coll.sum([(m)-[s:IS_SIMILAR]->(m2) 
WHERE m.communityId = m2.communityId | s.similarity]) desc
RETURN m.communityId as communityId,
count(m) as msaCount, 
avg(m.population) as avgPopulation,
avg(m.medianHomePrice) as avgHomePrice,
avg(m.medianHouseholdIncome) as avgIncome,
avg(m.percentOver25WithBachelors) as avgPctBachelors,
collect(m.name)[..3] as exampleMSAs
ORDER BY avgPopulation DESC
Table summarizing community features. Image by author.
Table summarizing community features. Image by author.

The knn and Louvain algorithms are The community with id 81 has the largest population MSAs, with average population almost 2 million. Community 54 also has high population MSAs, but incomes and home prices are lower than community 81. Communities 234 and 330 are pretty similar, but 234 has higher populations and 330 has higher housing prices. Community 75 has highly educated citizens compared to communities with similar population sizes. Community 264 has small population MSAs. Community 385 has low-income MSAs.

Let’s give the communities names that are more human friendly than the auto-generated ids.

MATCH (m:MSA) 
  SET m.communityName = CASE m.communityId 
  WHEN 54 THEN "Large mid-cost metros"
  WHEN 75 THEN "College towns"
  WHEN 81 THEN "Large high-cost metros"
  WHEN 234 THEN "Mid-size metros"
  WHEN 264 THEN "Small metros"
  WHEN 330 THEN "Mid-price metros"
  WHEN 385 THEN "Low-income metros"
  END
return m.communityName, m.communityId, count(*)

To make the communityName property searchable in Bloom, create an index on that property.

CREATE INDEX msa_community_name IF NOT EXISTS
FOR (m:MSA)
ON (m.communityName)

Let’s take a look at the results in Bloom. We’ll configure the perspective to reflect the new communities we have added. Click the icon in the top left of the screen to open the perspectives panel. Then, click the database icon to refresh the perspective to include the new communityName property we added to the database.

Refresh the Bloom perspective to include the new property in the database. Image by author.
Refresh the Bloom perspective to include the new property in the database. Image by author.

In the panel on the right side of the screen, click on the circle next to MSA to style the MSA nodes. In the fly-out menu that appears, choose "Rule-based." Then, click the plus sign to add a new rule-based style. Choose "communityName" from the property key drop down. Select the radio button for "unique values." Toggle the button to apply the color rule.

Rule-based styling configuration for communityName. Image by author.
Rule-based styling configuration for communityName. Image by author.

Let’s add a second conditional style to set the size of the nodes based on population. Click the plus to add a new rule-based styling rule.

Choose "population" from the property-key drop down. Select the radio button for "range." Click the "size" button to create a rule that will control node size. Toggle the button to apply the size rule.

In the Minpoint box, enter 55000 and set the minimum size to 1x. In the Maxpoint box, enter 19000000 and set the maximum size to 4x. Your configuration should look like the box below.

Rule-based styling for population. Image by author.
Rule-based styling for population. Image by author.

Now right click and choose "Clear Scene" from the context menu. In the search box, type "College towns" and press enter. Among the results that are returned are the State College, PA Metro, the Ithaca, NY Metro, the Champaign-Urbana, IL Metro, and the Charlottesville, VA Metro. Those metro areas are home to Pennsylvania State University, Cornell University, the University of Illinois, and the University of Virginia. Based on what we know of those metro areas, it seems like the algorithm has picked up some clear commonalities.

Some of the MSAs returned by a search for "College towns." Image by author.
Some of the MSAs returned by a search for "College towns." Image by author.

Let’s see which other metros are most similar to the college towns. Click Command-A to select all the nodes. Then right click one of the nodes and chose Expand: <-IS_SIMILAR- from the context menu.

Cities most similar to college towns. Image by author.
Cities most similar to college towns. Image by author.

When I zoom out, I can see that we have added several large metro areas (purple), medium metro areas (orange), small middle-income areas (teal), and one medium high-cost metro (red). Double click on a node or zoom in to see the node details. Several MSAs that we have added, such as Madison, WI Metro Area and Durham-Chappel Hill, NC happen to be home to major universities.

Experiment with this dataset to explore similarity graphs further. Decreasing the number of similarity relationships created by decreasing the topK parameter or setting the similarityThreshold parameter when you run gds.alpha.similarity.euclidean will tend to result in more communities when you run gds.louvain. Try scalers other than MinMax when you run gds.alpha.scaleProperties.

The graph data science library supports several other similarity algorithms. The K-Nearest Neighbors algorithm provides an efficient way to generate links based on cosine similarity.

Louvain community detection is one option among several community detection algorithms. See how different your results are by running the Label Propagation algorithm.

The data CSV on GitHub includes micropolitan statistical areas in addition to metropolitan statistical areas. Remove WHERE row.name INCLUDES 'Metro' from the LOAD CSV script to load these nodes and explore their properties.

You’ll find that graphs really are everywhere, even when relationships are not apparent in the data model at first.


Related Articles