The worldโ€™s leading publication for data science, AI, and ML professionals.

Degrees of Separation among footballers

Constructing a football graph and analysing various centrality measures and average path distance among footballers

Photo by Md Mahdi on Unsplash
Photo by Md Mahdi on Unsplash

Why consider degrees of separation?

We are all connected to each other by six or fewer acquaintances – Frigyes Karinthy

What makes Graph Theory beautiful is the fact that it can be applied to almost any domain. According to studies over the years in any social network, any two nodes are connected by fewer than six hops. Degrees of separation are the crux behind social networks, LinkedIn for instance suggests names typed into the search box based on how far away another user is from you. A connection’s connection is 2 degrees away from you while that person’s connection would be 3 degrees away and so on. This greatly improves the search quality since many users with the same names can be filtered based on proximity. Moreover, for research purposes degrees of separation are always analysed to glean information about social networks. Over the years degrees of separation amongst users have been shrinking which shows that more and more users find connections on social networking sites like Facebook and Instagram.


Applying the thought to Footballers!

The Football community is a network too. To construct this graph I thought of representing footballers by nodes and establishing an edge between two footballers if they play for the same club or the same national team.

The dataset and implementation in Python

For those who just want to see the entire code here it is on my Github.

I have used the Fifa 21 dataset from Kaggle for this problem. Here is how we would construct the graph by checking for common clubs and nationalities. I have used the set operation since performing intersections is very fast.

Here is how we merge the club and nationality into a set and then later on construct a graph. Please note the graph is an adjacency list representation that means the nodes each have a list of edges they are connected to. I have also created two dictionaries to map names to ids and vice versa.

Next let us create a function to find the distance between any two nodes. I have used the BFS algorithm and modified it. For those who want to know more about it here is the link. I will describe what I have done in brief.

  • Create a boolean array to check visited nodes of size number_of_nodes and also create a distance list of the same size ( to store distances of a source node to all other nodes)
  • Create a queue and add the source node after which mark it as visited. Now while the queue is not empty fetch neighbours for each node that are not visited
  • For each neighbour fetch its current distance from the source from the distance array (initially all are 0) and add 1 to distance array at the neighbours index. Add this neighbour to the queue & mark it as visited. Do this for all neighbours.
  • Now check the queue and fetch a node (let’s call it temp) that was added and fetch its neighbours. Now for each of these new neighbours fetch the distance at the position of temp ( which will be 1 after the first iteration) and add 1 to it and store it in the distance array at the new neighbours’ index.
  • Keep doing this till the queue is empty and return the distance[destination] which is the distance from source to destination.

Here is the code for it


Let’s have fun and see the distance between two famous players Messi and Erling Haaland!

As we can see the distance is three hops. Wondering what those are? The code also contains a helper function to print paths I found from this site. I have modified it a bit to print only 3 paths since our graph has ~35,000 edges.

Now you see it don’t you! Messi and Ter Stegen are teammates for Barcelona & Hummels is Ter Stegen’s teammate in the German National Team, and finally Haaland is Hummels’ teammate at Dortmund. This is how we had the shortest distance between them as 3.


The climax! Finding the average shortest distance between all footballers!

Although I could have extended the modified BFS method I will use networkx here simply because they approximate this distance very well and calculate the result very fast. However we need to construct an Adjacency Graph to use networkx and I have shown that in the complete code. I will just show you all the final result for brevity.

As we can see the distance comes up to 2.5 which shows how close the football network really is and aligns well with the six degrees of separation hypothesis. Plus we calculated the distance in 4.4 seconds which is impressive for a graph with 35,000 edges. Using a GPU accelerator would further bring this down to 3 seconds whereas the BFS for each footballer and then averaging it traditionally took about 9 seconds.


Bonus! – Graph Centrality Measures!

Those of you who are interested in social information networks would like this section, the rest can skip to the Outro.

  1. Degree Centrality

Degree Centrality

The degree centrality for a node v is the fraction of nodes it is connected to, Easiest to compute, the higher the degree the more central the node usually is.

Caveats – We could be easily mislead by a certain cluster which is far away from the core and hence isn’t influential but has a high degree centrality

As we can see Thiago of Liverpool has the highest degree Centrality.

2. Betweenness Centrality

where V is the set of nodes, ๐œŽ(๐‘ ,๐‘ก) is the number of shortest (๐‘ ,๐‘ก) paths, and ๐œŽ(๐‘ ,๐‘ก|๐‘ฃ) is the number of those paths passing through some node ๐‘ฃ other than ๐‘ ,๐‘ก .

Betweenness centrality finds wide application in network theory; it represents the degree to which nodes stand between each other. For example, in a network, a node with higher betweenness centrality would have more control over the network, because more information will pass through that node or player in our case.

An interesting result, Palacios springs up.

  1. Closeness Centrality

Where n is the number of nodes, d(v,u) is the shortest distance between two nodes.

Closeness can be regarded as a measure of how fast it will take to spread information to all other nodes. If a node has strong closeness centrality, it is in a position, with its relationships, to spread information quickly. These people can be important influencers of the network.

Rodri of Manchester City & Spain seems to be an important influencer!

rajbsangani/six-degrees-medium – Jovian


Outro

This quick experiment shows how we can form graphs wherever we can establish logical relationships and analyse them using popular Graph techniques. The degrees of separation in networks in general are shrinking and will continue to do so!

Please let me know what you thought of the article and would love to discuss about anything related to it. Please leave some claps if you enjoyed it! Check out my github for some other projects. You can contact me here. Thank you for your time!


Related Articles