Graph Analytics — Introduction and Concepts of Centrality

Published in

Towards Data Science

12 min readAug 13, 2019

The advent of social networks, big data and e-commerce has re-emphasized the importance of analyzing a unique type of data structure- one which depicts relationships among its entities, also known as a Graph. It is imperative to briefly introduce the concept of a “Graph” before I venture into the Introduction of Graph Analytics.

Let’s start by looking at a sample graph of friends presented below. I will be using the same graph in some of the following sections to further explain the concepts of graph analytics.

Figure 1 ( This graph was designed in Gephi )

The above picture depicts a graph of friends where the node/entity such as A,B etc. depicts a particular individual and a link (also known as an edge) between any two individuals depicts a relation (“friendship” in this case) between them.

Generalizing from the above example:

Graphs can be defined as a representation of relationships between “entities” or “things” where as these “entities” are the “nodes” (also known as “vertices”) of the graph and the relationships between them are represented by “links” (also known as “edges”) of the graph. The study of graphs is also known as “Graph Theory”

Further, by simply looking at the graph, one can analyze that A and B have a common friend C, which is not friends with D. The branch of data science that deals with extracting information from graphs by performing analysis on them is known as “Graph Analytics”.

Moving on-wards from introduction, lets venture into the world of graph analytics by exploring some fundamental concepts. In this article we will be particularly focusing on Centrality based concepts used in graph analytics. Don’t fret if you did not understand the aforementioned statement as I am going to cover everything from scratch as we move forward.

Centrality

In graph analytics, Centrality is a very important concept in identifying important nodes in a graph. It is used to measure the importance (or “centrality” as in how “central” a node is in the graph) of various nodes in a graph. Now, each node could be important from an angle depending on how “importance” is defined. Centrality comes in different flavors and each flavor or a metric defines importance of a node from a different perspective and further provides relevant analytical information about the graph and its nodes.

Degree Centrality

The first flavor of Centrality we are going to discuss is “Degree Centrality”.To understand it, let’s first explore the concept of degree of a node in a graph.

In a non-directed graph, degree of a node is defined as the number of direct connections a node has with other nodes. Looking at the graph below:

In a directed graph (each edge has a direction), degree of a node is further divided into In-degree and Out-degree. In-degree refers to the number of edges/connections incident on it and Out-degree refers to the number of edges/connections from it to other nodes.Lets look at a sample Twitter graph below where nodes are individuals and edges with arrows indicate the “Follows” relationship:

We can see that nodes E,C,D and B have an outgoing edge towards node A and hence follow node A. Thus, the in-degree of node A is 4 as it has 4 edges incident on it.

We can also see that node B follows both node D and node A, hence it’s out-degree is 2.

Now, Degree Centrality metric defines importance of a node in a graph as being measured based on its degree i.e the higher the degree of a node, the more important it is in a graph.

Re-examining the aforementioned friends graph (Figure 1) below:

The degree centrality of node A is 7, node G is 5, node C is 4 and node L is 1.

Mathematically, Degree Centrality is defined as D(i) for a node “i” as below:

Now lets briefly discuss a sample application of degree centrality to the above shown graph of friends. Looking at node A and G, they have a high degree centrality(7 and 5 respectively) and will be ideal candidates if we want to propagate any information to a large part of the network quickly as compared to node L which only has a degree centrality of 1.This information is very useful for creating a marketing or an influencing strategy if a new product or idea/thought has to be introduced in the network. Marketers can focus on nodes such as A,G etc. with high degree centrality to market their product or ideas in the network to ensure higher reach-ability among nodes.

Similarly, keeping in mind the sample Twitter graph (in Figure 2), if we actually examine a social network such as Twitter with millions of nodes and calculate in-degree centrality for various nodes, the nodes with high in-degree centrality (such as Kanye West, Lady Gaga and other celebrities) will be the nodes that have huge number of followers and could be ideal candidates to influence the public or promote commercial products. Now you know why celebrities or popular people are paid on social networks such as Instagram and Twitter to say certain things or promote certain products as commercial companies are aware that these individuals have a very high in-degree and have the ability to influence or reach a large number of people quickly .

Application/Usefulness of analyzing importance of nodes based on degree centrality is vast and depends on the nature of graph/network in consideration.

Closeness Centrality

The second flavor we are going to discuss is “Closeness Centrality”.To understand the same, first let’s understand the concept of “Geodesic distance” between two nodes in a graph.

The Geodesic distance d between two nodes a and b is defined as the number of edges/links between these two nodes on the shortest path(path with minimum number of edges) between them.

Let’s look at the graph below:

Let’s examine the geodesic distance between A and F to further clarify the concept. We can reach F from A by going through B and E or by going through D. However, the shortest path from F to A is through D(2 edges), hence the geodesic distance d(A,F) will be defined as 2 as there are 2 edges between A and F.

Mathematically, Geodesic distance can be defined as below:

d(a , b) = No. of edges between a and b on the shortest path from a to b, if a path exists from a to b

d(a , b) = 0, if a = b

d(a , b) = ∞ (Infinity) , if no path exists from a to b

Further, closeness centrality metric defines the importance of a node in a graph as being measured by how close it is to all other nodes in the graph.For a node, it is defined as the sum of the geodesic distance between that node to all other nodes in the network.

Again, looking at the previously introduced graph of friends in Figure 1 below, we can see that the Closeness centrality of node A is 17 while that of node L is 33.

Mathematically, Closeness Centrality C(i) of a node i in a graph can be defined as below:

Let’s briefly describe a sample application of Closeness Centrality by examining the friends graph above in Figure 5. Now let’s suppose that in the friend’s graph, each link/edge had a weight (attribute) of 1 minute associated with it i.e it would take 1 minute to transmit information from a node to its neighboring node such as A to B or B to C. Now lets suppose we want to send a piece of specific information (information will be different for each node) to each node of the graph and we need to select a node in the graph that can transmit it quickly to all the nodes in the network.

To solve the above problem, we can calculate the Closeness Centrality measure for all the nodes in the network. As we already calculated above for node A, if we select node A, the information can reach all the nodes by traversing 17 edges (i.e starting at A, information can be transmitted to all nodes in 17 minutes in a worst case scenario assuming sequential sends from A) as compared to node L, where it would take 33 minutes to transmit the information to all nodes.Clearly we can see the difference in importance of both the nodes A and L in terms of Closeness Centrality measure.

Betweenness Centrality

The third flavor of centrality we are going to discuss is known as “Betweenness Centrality” (BC). This metric defines and measures the importance of a node in a network based upon how many times it occurs in the shortest path between all pairs of nodes in a graph.To elaborate the metric further, let’s again look at our friends graph below:

Mathematically, Betweenness Centrality B(i) of a node i in a graph is defined as below:

Looking at node A, we can observe that it lies on the shortest path between the following pair of nodes : (D,M), (D,E),(G,C),(G,B),(G,F),(G,I),(K,C),(D,C) etc. and thus has the highest BC among all other nodes in the graph. We can also observe that both nodes G and C also have high Betweenness Centralities (BCs) as compared to other nodes (except A) in the graph

As discussed, if we look at our friends graph above (Figure 6), node A has a very high BC. If we were to remove it, it would lead to huge disruption in the network as there would be no way for nodes {J,H,G,M,K,E,D} to communicate with nodes {F,B,C,I,L} and vice versa and we would end up with two isolated sub graphs. This understanding marks the importance of nodes with high BCs.

A sample application of BC is to find bridge nodes in graphs.Nodes having high BC are the nodes that are on the shortest paths between a large number of pair of nodes and hence are crucial to the communication in a graph as they connect a high number of nodes with each other.Removing these nodes from the network would lead to huge disruption in the linkage or communication of the network.

A real life use case of the above application is in analyzing global terrorism networks. For example, if we have a network of terrorists or terrorist groups and other related individuals represented as nodes of a graph, we can calculate BC for each node and identify nodes with high BCs. These nodes (or terrorists in this case) will be bridge nodes in the network. This information is very useful for defense agencies as they can be highly effective in disrupting the whole terrorism network . Another use-case of this metric is to detect and monitor possible bottlenecks or hot-spots in computer networks or flow networks.

Eigen Vector Centrality

The last flavor of centrality that we will be exploring is known as the Eigen Vector Centrality. This metric measures the importance of a node in a graph as a function of the importance of its neighbors. If a node is connected to highly important nodes, it will have a higher Eigen Vector Centrality score as compared to a node which is connected to lesser important nodes.

Let’s look at the graph given below to further explain the concept:

The adjacency matrix A of the above graph will be as shown below:

Let’s assume that in the above graph, the importance of each node is measured by its degree, such that the higher the degree of a node, the more important it is in the graph. Degrees of various nodes are shown as below:

The above can also be represented as a matrix vector V as shown below:

Now, mathematically the Eigen Vector Centrality is calculated as below:

Figure 11 showing Eigen Vector Centrality Calculation — 1st Iteration

The resultant 1-D vector in the above equation gives the Eigen Vector Centrality (EVC) score for each of the nodes in the graph. Effect of the first iteration of multiplication can be visualized as shown below:

Figure 12 showing EVC scores of each node after 1st iteration of multiplication

As you can see above, node A and B both have a high score of 8 since both of them are connected to multiple nodes with high degrees (importance) while node E has a score of 3 since its only connected to a single node of degree 3.It is also important to observe that the EVC score value for each node in the resultant vector is nothing but the sum of degrees of its neighboring nodes.For example: EVC score for node A = degree(B) + degree(C) + degree(D) = 8

Now if the resultant EVC vector that we got above in the equation (Figure 11) is again multiplied by the adjacency matrix A, we will get bigger values for EVC score for each node in the graph, as shown below:

Figure 13 showing Eigen Vector Centrality Calculation — 2nd Iteration of multiplication

The effect of multiplying the resultant vector again (2nd iteration of multiplication) with the adjacency matrix can be visualized, as shown below:

Figure 14 showing EVC scores of each node after 2nd iteration of multiplication

Now, why did we multiple the resultant vector again with the adjacency matrix?

In short, the answer to that lies in the fact that multiplying the resultant vector again with the adjacency matrix of the graph helps the EVC score spread out in the graph so as to get a more globally prominent EVC score vs a localized EVC score for each node in the graph. If we observe, after the first iteration of multiplication, each node’s EVC score is a function of only its direct (1st degree) neighbors, thus is a localized score which might not be accurate at a global level in the graph.

Elaborating the above, if we visualize the above operations, we can observe the following:

After the first iteration of multiplication, each node gets it’s EVC score from its direct(1st degree) neighbors.
In the second iteration, when we multiply the resultant vector again with the adjacency matrix, each node again gets it’s EVC score from its direct neighbors but the difference in the second iteration is that this time, the scores of the direct neighbors have already been impacted by their own direct(1st degree) neighbors previously(from the first iteration of multiplication) which eventually helps the EVC score of any node to be a function of its 2nd degree neighboring nodes as well.
In subsequent iterations of multiplication, the EVC score of graph nodes keeps getting updated by getting impacted by EVC scores from neighboring nodes of farther degree (3rd, 4th and so on).

Repeated multiplication makes the EVC score of every node to eventually be a function of or dependent on several degrees of its neighboring nodes, thereby providing a globally accurate EVC score for each node.Usually the process of multiplying the EVC vector with the adjacency matrix is repeated until the EVC values for nodes in the graph reach an equilibrium or stop showing appreciable change.

The discussion of applications of Eigen Vector Centrality is vast and deserves a separate article in itself. One sample application of EVC is the calculation of Page Rank or Page Rank algorithm used by Google and many other companies to rank web pages on the internet by relevance. Page Rank is a direct variant of EVC. Web pages on the World Wide Web have links that point to/from other web pages. You can think of each web page being a node in the graph and each outgoing/incoming link as a directed edge leading to/from another web page on the web, thereby making up the whole World Wide Web graph. The graph of web pages in the world wide web undergoes several iterations of EVS calculation so as to calculate globally accurate relevance rankings of each web page.The web pages with high EVC scores can then be targeted for marketing and other commercial purposes.

The field of graph analytics is vast and has immense practical applications. The scope of this article was to cover the fundamentals of Centrality and hopefully will give the reader an insight into the fascinating world of Graph Analytics.

Below is a list of various Graph Analytics libraries and software that can be used for Graph Analytics:

Gephi (https://gephi.org/)
Cytoscape (https://github.com/cytoscape/cytoscape.js)
Neo4j (https://neo4j.com)
GraphAnalyticsLib (https://github.com/jb123/GraphAnalyticsLib)

Graph Analytics — Introduction and Concepts of Centrality

Written by Jatin Bhasin