Hands-on Tutorials, Marvel network analysis

Exploratory network analysis of Marvel Universe

Introducing the new k-nearest neighbors algorithm in the Graph Data Science library

Tomaz Bratanic
Towards Data Science
15 min readOct 23, 2020

--

A wise man once said that the 2020–30 decade will be the decade of graph data science. Actually, that happened just a few days ago at the Nodes 2020 conference, and that wise man was Emil Eifrem presenting at the keynote of the Nodes 2020. In case you missed the conference, all the presentations are already available online.

Only fitting Emil’s statement, a pre-release of the 1.4 version of the Neo4j Graph Data Science library was published a couple of days ago. It is a significant milestone for the GDS library. A lot of new features were added in this release. If you are interested to learn more, you can inspect the release statement. In this blog post, we will take a look at the new k-Nearest Neighbours algorithm. But before we do that, we will do a proper graph exploratory analysis.

I’ve written so many blog posts it takes an effort to find some excellent datasets I haven’t explored yet. This time I saw a cool repository on Kaggle containing data about the Marvel Universe. Unfortunately, I later realized that only comics and characters files have matching ids. As a graph analyst, we want to connect all the relevant data, and without matching ids, that’s a bit hard. I then realized that those matching ids were scraped from the Marvel API. I fetched some additional data from the API to enrich our graph. The information about the characters is not available over the API but is available on their website. This made me put on my Spider-Man suit and test my web crawling abilities. I’m quite proud to say that I learned to use Selenium and efficiently scraped the information about the characters from the Marvel website.

Graph import

You can easily import this Marvel Universe graph by running the cypher statements from this gist. It contains ten simple LOAD CSV cypher statements and an apoc.schema.assert procedure to define unique constraints and indexes. If you use Neo4j Browser, make sure you have enabled the multi-statement query editor.

This way, you can copy the content of the whole gist and not worry about executing each statement separately.

Graph schema

Now that we have imported the graph, we can examine its schema with the following procedure:

CALL db.schema.visualization()

If you run this procedure in Neo4j Browser, you will get this nice visualization of the graph schema.

In the center of the graph, there are characters, also known as heroes. They can appear in multiple comics, are part of an event, and can belong to a group. For some of the characters, we also know their stats like speed and fighting skills. Finally, we have social ties between characters that represent relative, ally, or enemy relationships.

To get a feel for the size of the graph, we can run the following APOC procedure:

CALL apoc.meta.stats() YIELD labels
return labels

Results

There are 1105 characters that have appeared in 38875 comics.
We have stats for 470 of the characters. There are also 92 groups and 74 events stored in the graph.

Exploratory graph analysis

To get to know our graph, we will begin with a basic graph data exploration process. First, we will take a look at the characters that have most frequently appeared in comics.

MATCH (c:Character)
RETURN c.name as character,
size((c)-[:APPEARED_IN]->()) as comics
ORDER BY comics DESC
LIMIT 5

Results

You might be wondering where I got this bar chart. Apache Zeppelin notebooks have a built-in feature that lets you create various visualization. Luckily for us, Andrea Santurbano has developed the Neo4j Connector for Apache Zeppelin, making it easy to write cypher statements to fetch or write data to the Neo4j database.

The top five most frequent characters come as no surprise. Spiderman is the most frequent or popular character. It is no wonder that they created a younger version of Spiderman just recently, given his popularity. Tony Stark, also known as Iron Man, is in second place. It seems that Logan, also known as Wolverine, was quite popular throughout history, but I think that his popularity slowly faded away in recent times. Steve Rogers, who goes by the more popular name Captain America, is also quite famous. It would seem that the recent Marvel movies showcased the more popular characters from the comics.

You might be wondering what the events are in our graph, so let’s take a look. We will examine the events with the highest count of participating heroes.

MATCH (e:Event)
RETURN e.title as event,
size((e)<-[:PART_OF_EVENT]-()) as count_of_heroes,
e.start as start,
e.end as end,
e.description as description
ORDER BY count_of_heroes DESC
LIMIT 5

Results

I have little to no idea what these events represent, but it is interesting to see that many characters participate. Most of the events span over less than a year, while the Acts of Vengeance spans over two decades. After being notified by Michael Piper that this data is likely to be incorrect, I have cross-referenced it with other sources, and it seems that this is indeed incorrect data. The Acts of Vengeance event ran from December 1989 to February 1992. I have also notified Marvel to correct this information in their API as an apology for spreading misinformation and that others won’t run into the same issue. And judging by the description, Loki had something to do with it along with 92! other characters. Unfortunately, we don’t have the connection between comics and events stored in our graph to allow further analysis. If someone will scrape the Marvel API, I will gladly add it to the dataset.

Let’s also take a look at the biggest groups of characters.

MATCH (g:Group)
RETURN g.name as group,
size((g)<-[:PART_OF_GROUP]-()) as members
ORDER BY members DESC LIMIT 5

Results

There are 41 characters in X-Men, which makes sense as they had a whole academy. You might be surprised by 31 members of Avengers, but in the comics, there were many members of Avengers, although most are former members.

Just because we can, let’s inspect if some members of the same group are also enemies.

MATCH (c1:Character)-[:PART_OF_GROUP]->(g:Group)<-[:PART_OF_GROUP]-(c2:Character)
WHERE (c1)-[:ENEMY]-(c2) and id(c1) < id(c2)
RETURN c1.name as character1, c2.name as character2, g.name as group

Results

It seems that Logan does not get along with some of the other X-Men. For some of the characters, we also have the place of origin and education available, so let’s quickly look at that. During the scraping, I noticed a hero originated from Yugoslavia, so I wonder if there are more characters from Yugoslavia.

MATCH (c:Character)
WHERE c.place_of_origin contains "Yugoslavia"
RETURN c.name as character,
c.place_of_origin as place_of_origin,
c.aliases as aliases

Results

Two characters originated from today’s Croatia, which is less than two hours drive from where I live. Let’s also check out all the characters that completed their Ph.D. degree.

MATCH (c:Character)
WHERE c.education contains "Ph.D"
RETURN c.name as character, c.education as education
LIMIT 10

Results

It looks like a lot of these heroes are quite employable. Only Nightshade seems a bit dodgy. It feels like something one would put on their LinkedIn profile to get noticed when searching for Ph.D. profiles. By the way, did you know that Professor X has four Ph.D.s and is also MD in psychiatry? Quite the educated men.

Analyzing communities of allies and relatives

We have examined basic graph statistics, and now we will focus more on network analysis. We will investigate the social ties between characters.
To start, we will calculate the degree values for each relationship type between characters and display the heroes with the highest overall degree.

MATCH (c:Character)
RETURN c.name as name,
size((c)-[:ALLY]->()) as allies,
size((c)-[:ENEMY]->()) as enemies,
size((c)-[:RELATIVE]->()) as relative
ORDER BY allies + enemies + relative DESC
LIMIT 5

Results

Scarlet Witch and Thor seem to have the most direct enemies. Wolverine has the most allies but also many enemies. It looks like Triton has a big family with 17 direct relative relationships. We can use the apoc.path.subgraphAll procedure to examine the relatives' community of Triton. Check the documentation for more information.

MATCH p=(c:Character{name:"Triton"})
CALL apoc.path.subgraphAll(id(c), {relationshipFilter:"RELATIVE"})
YIELD nodes, relationships
RETURN nodes, relationships

Results

I never knew that some of the Marvel heroes have quite a big happy family. It wouldn’t be accurate if there weren’t a black sheep of the family present. Maximus looks like the family’s black sheep here as he has four enemies within the family. You might wonder why ally and enemy relationships are shown when we only traversed the relative ties. Neo4j Browser has a feature that displays all connections between nodes on the screen.

If you have a dense graph, you should probably tick off this setting as otherwise a lot of connections will show up in the Neo4j Browser.

Weakly Connected Components algorithm

The Weakly Connected Components is a part of almost every graph analysis workflow. It is used to find disconnected components or islands within the network.

Example results of WCC algorithm

In this example, the graph consists of two components. Michael, Mark, and Doug belong to the first component, while Bridget, Alice, and Charles belong to the second component. We will apply the Weakly Connected Components algorithm to find the largest component of allied characters. As we don’t plan to run any other algorithms on this network, we will use the anonymous graph projection.

CALL gds.wcc.stream({
nodeProjection:'Character',
relationshipProjection:'ALLY'})
YIELD nodeId, componentId
WITH componentId, count(*) as members
WHERE members > 1
RETURN componentId, members
ORDER BY members DESC
LIMIT 5

Results

The largest component of allies has 195 members. Then we have a couple of tiny allies islands with only a few members. If we visualize the largest component of allies in the Neo4j Browser and have the connect results nodes option selected, we get the following visualization.

Although we have found the largest allies component, we can observe that many of the characters in the component are actually enemies (red relationships). To better understand why this occurs, let’s look at the following example.

In this example, Thanos is an enemy with Gamora, Magus, and Captain Marvel but is still a member of the same allied component. They belong to the same component because one can traverse the graph from any of the members to all the other group members using only the ALLY relationships (if we treat them as undirected).

Custom ally component algorithm

Suppose we wanted to find communities of allies where there are no enemies within the given component. The algorithm implementation is relatively straightforward, and you could use Neo4j custom procedures, for example. Still, if you are like me and don’t speak Java, you can always resort to your favorite scripting language. I have developed the custom Ally component algorithm in Python. First, we define some helper functions for fetching allies and enemies of a single node.

My implementation is relatively simple. The input to the algorithm is the list of all node ids in the largest allied components. Start from a single node, load its enemies into the enemies list and load its allies into a queue that will be processed later. Then we iterate over the allied queue. If a node is not an enemy with any of the existing nodes in the component, add them to the community list and add their enemies to the community’s enemies list. I’ve added some minor performance tweaks like if we have traversed the node already in the allies queue, we can remove that node from the global list of starting nodes.

In this code, the algorithm only returns the ids of nodes that belong to the largest allied component where there are no enemies within. It shouldn’t be a problem to mark these nodes in Neo4j, as you can match them by their ids. The largest component of allies, where there are no enemies within, has 142 members. If we visualize it in Neo4j Browser, we can see that there are no enemy relationships visible.

Analyzing characters’ stats

In the last part of our analysis, we will examine the stats of the characters. We have the stats available for a total of 470 heroes. This information was scraped from Marvel’s website. Here is an example of what the stats look like for a single character on the Marvel website.

Can you guess to whom they belong? They are Tony Stark’s (Iron Man) stats. The scale for stats ranges from zero to seven, and Iron Man does not have a single seven. Probably not the strongest of the heroes, even though he is one of the more popular ones. Now we will explore the characters with the highest stats average. Whenever I need some help with my cypher queries, I turn to Neo4j Slack. Luckily for us, Andrew Bowman is always around with great advice on optimizing and prettifying our cypher queries. This time he showed me the apoc.map.values procedure. It can be used to fetch all properties of a single node without explicitly writing the property keys.

MATCH (c:Character)-[:HAS_STATS]->(stats)
RETURN c.name as character,
apoc.coll.avg(apoc.map.values(stats, keys(stats))) as average_stats
ORDER BY average_stats DESC
LIMIT 10

Results

It seems many characters have their stats maxed out. I am not sure exactly how this data collection process works, but I found a fascinating heroine by the name of Squirrel Girl that could probably kick Iron Man’s ass with one hand while making sourdough bread with the other. Or polish her nails, not exactly sure what type of girl she is. The only thing certain is that she is a badass.

k-Nearest Neighbours algorithm

The k-Nearest Neighbour is one of the more standard graph algorithms and was already implemented in the Graph Data Science library before in the form of Cosine, Euclidian, and Pearson similarity algorithms. Those were basic implementation where the algorithms compared a given vector for all node pairs in the network. Because comparing all node pairs does not scale well, another implementation of the kNN algorithm was added to the library. It is based on the Efficient k-nearest neighbor graph construction for generic similarity measures article. Instead of comparing every node pair, the algorithm selects possible neighbors based on the assumption that the neighbors-of-neighbors of a node are most likely already the nearest one. The algorithm scales quasi-linear with respect to the node count instead of being quadratic. The implementation uses the Cosine similarity to compare two vectors.

First, we need to create a vector (array of numbers) that will be compared between the pairs of heroes. We will use the characters’ stats as well as their ability to fly to populate the vector. Because all stats have the same range between zero and seven, there is no need for normalization. We only need to encode the flight feature to span between zero and seven as well. Those characters that can fly will have the value of flight feature seven, while those who can’t fly will have the value zero.

MATCH (c:Character)-[:HAS_STATS]->(s)
WITH c, [s.durability, s.energy, s.fighting_skills,
s.intelligence, s.speed, s.strength,
CASE WHEN c.flight = 'true' THEN 7 ELSE 0 END] as stats_vector
SET c.stats_vector = stats_vector

We will also tag the characters that have the stats vector with a second label. This way, we can easily filter heroes with a stats vector in our native projection of the named graph.

MATCH (c:Character)
WHERE exists (c.stats_vector)
SET c:CharacterStats

Now that everything is ready, we can go ahead and load our named graph. We will project all nodes with the CharacterStats label and their stats_vector properties in a named graph. If you need a quick refresher or introduction to how the GDS library works, I would suggest taking the Introduction to Graph Algorithms course.

CALL gds.graph.create('marvel', 'CharacterStats',
'*', {nodeProperties:'stats_vector'})

Now, we can go ahead and infer the similarity network with the new kNN algorithm. We will use the mutate mode of the algorithm. The mutate mode stores the results back to the projected graph instead of the Neo4j stored graph. This way, we can use the kNN algorithm results as the input for the community detection algorithms later in the workflow. The kNN algorithm has some parameters we can use to fine-tune the results:

  • topK: The number of neighbors to find for each node. The K-nearest neighbors are returned.
  • sampleRate: Sample rate to limit the number of comparisons per node.
  • deltaThreshold: Value as a percentage to determine when to stop early. If fewer updates than the configured value happen, the algorithm stops.
  • randomJoins: Between every iteration, how many attempts are being made to connect new node neighbors based on random selection.

We will define the topK value of 15 and sampleRate of 0.8, and leave the other parameters at default values.

CALL gds.beta.knn.mutate('marvel', {nodeWeightProperty:'stats_vector', sampleRate:0.8, topK:15, mutateProperty:'score', mutateRelationshipType:'SIMILAR'})

Louvain Modularity algorithm

The similarity network is inferred and stored in the named graph. We can examine the community structure of this new similarity network with the Louvain Modularity algorithm. As the similarity scores of relationships are available as their properties, we will use the weighted variant of the Louvain Modularity algorithm. Using the relationshipWeightProperty parameter, we let the algorithm know it should consider the relationships’ weight when calculating the network’s community structure. This time we will use the write mode of the algorithm to store the results back to the Neo4j stored graph.

CALL gds.louvain.write('marvel',
{relationshipTypes:['SIMILAR'],
relationshipWeightProperty:'score',
writeProperty:'louvain'});

We can examine the community structure results with the following cypher query.

MATCH (c:Character)-[:HAS_STATS]->(stats)
RETURN c.louvain as community, count(*) as members,
avg(stats.fighting_skills) as fighting_skills,
avg(stats.durability) as durability,
avg(stats.energy) as energy,
avg(stats.intelligence) as intelligence,
avg(stats.speed) as speed,
avg(stats.strength) as strength,
avg(CASE WHEN c.flight = 'true' THEN 7.0 ELSE 0.0 END) as flight

Results

It would make sense to add the standard deviation for each stat, but it wouldn’t be presentable for a blog post. The community with an id 68 has the most powerful members. The average for most stats is 6.5, which means that they are almost entirely maxed out. The average value of flight at 2 indicates that around 30% (2/7) of the members can fly. The largest community with 106 members has their stats averaged between 2 and 3, which would indicate that they might be support characters with lesser abilities. The characters with stronger abilities are usually the lead characters.

We can also visualize the community structure of the inferred similarity network with Neo4j Bloom. We color the nodes based on their community membership. First, we have to store the mutated relationships back to Neo4j with the gds.graph.writeRelationship procedure in order to visualize them with Neo4j Bloom.

The pink community contains the strongest heroes. We can observe that they have their own cluster in the top right and also span a little towards the center.

Label Propagation algorithm

Label Propagation algorithm can also be used to determine the community structure of a network. We will apply it to the inferred similarity network and compare the results with the Louvain Modularity algorithm results.

CALL gds.labelPropagation.write('marvel',
{relationshipTypes:['SIMILAR'],
relationshipWeightProperty:'score',
writeProperty:'labelPropagation'})

We investigate the results of the Label Propagation algorithm.

MATCH (c:Character)-[:HAS_STATS]->(stats)
RETURN c.labelPropagation as community, count(*) as members,
avg(stats.fighting_skills) as fighting_skills,
avg(stats.durability) as durability,
avg(stats.energy) as energy,
avg(stats.intelligence) as intelligence,
avg(stats.speed) as speed,
avg(stats.strength) as strength,
avg(CASE WHEN c.flight = 'true' THEN 7.0 ELSE 0.0 END) as flight

Results

We can notice that the Label Propagation algorithm found twice as many communities as the Louvain Modularity algorithm. Some of them are relatively tiny. For example, the community with an id 693 has only three members, and all their average stats are at 1.0 value. They are the heroes that go by the name of Maggott, Deathbird, and Slayback. Funky names. The most powerful community has an id of 137 and only 23 members. Remember, the most powerful community found by the Louvain Modularity algorithm had 46 members and a slightly lower value of average stats.

We can again visualize the results of the Label Propagation community structure with Neo4j Bloom.

Again, the members of the pink community are the most powerful. We can notice that this time the top right cluster does not span into the center at all. We can zoom in on that community to inspect its members visually.

It seems that all things are in balance as they should be because the Squirrel Girl is right in the center of the most powerful community.

Conclusion

I hope you have learned some tricks on performing network analysis in Neo4j with the help of APOC and GDS libraries. There are still many things we could do with this graph, so expect a new post shortly. Until then, try out Neo4j and play around with the pre-release of the 1.4 version of the GDS library. If you need help getting started, I recommend looking at the Neo4j Graph Academy.

As always, the code is available on GitHub.

Edit: I have removed the faulty historical data about the characters due to the flaw in the data collection process.

--

--

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication. http://mng.bz/GGVN