Computing Assortativity Coefficients on a Social Network Dataset

Assortativity helps analysing pattern of connections in networks. Let’s use it to confirm if people tend to connect to similar people.

Gabriele Albini
Towards Data Science

--

Photo by fabio on Unsplash

In this article we will use some Facebook data to explore the concept of network assortativity (also called as homophily), which we define as the tendency of nodes to connect to their similar.

Networks or Graphs are data representation consisting in nodes (vertices) and edges (links): in this article we will consider only undirected and unweighted edges. We will first of all present the dataset we intend to use, going through the data loading and wrangling steps and presenting the network.

Next, we will introduce the concept of network assortativity. The main theoretical framework which will be used is the article from Newman et al., 2003 [1] which defines and explains the concept of network assortativity.

We will then apply this metric to the dataset, in order to confirm whether — as stated in the article –people tend to connect to others who are like them.

1. The data

The data [2] used in the article can be downloaded from this page. We will need two sets of files:

  1. the file “facebook_combined.txt.gz” contains edges from 4039 nodes of 10 networks. Edges are represented in an adjacency list format (i.e. [0,1] means there’s an edge between node 0 and node 1).
  2. the file “facebook.tar.gz” contains several other files. We will be using only “.feat” and “.featnames” files, which corresponds to network attributes (and their names) for all the nodes.

1.1 Loading the network

We will be using the NetworkX library in Python.

Importing the main network file is very easy:

Let’s take a quick glance at the network. The following representation allows to display two important features:

  1. The node colour will vary with the node degree, i.e. the number of connections that each node has.
  2. The node size will vary depending on the node betweenness centrality, a measure that quantifies how much a node lies on paths between other nodes, or, in other words, this metric quantifies how much the removal of a node with high betweenness centrality can break the network. The formula for this indicator is the following:
Image by author

n^i_st : represents the number of shortest paths from “s” to “t” passing by node “i”;
g_st : is the total number of shortest paths from “s” to “t” not necessarily passing by node “i”;
n²: is the total nr of nodes and the term 1/n² can be omitted.

Image by author

1.2 Adding nodes attributes

To assign attributes to nodes, we’ll need to parse the following files:

  • “.feat” files contain a matrix of 0/1 entries. Each row represents a node and each column an attribute. The entry if the node “i” has the attribute “j”
  • each row in “.featnames” files contains the name of the corresponding columns from the “.feat” files. In this way, we’ll be able to understand the name of each attribute.

With the nx library, we can assign attributes to nodes in the following way:

Finally, we can get all attribute information for each node, e.g. node id “200”:

Image by author

2. The attribute assortativity coefficient

In the study of social networks, analysing the pattern of connections between nodes plays an important role. Assortativity helps understanding if people tend to connect to similar or dissimilar nodes and this network property may affect the network structure: a strong assortativity on a discrete attribute may for instance break the network into subnetworks. Suppose that birthyear is a very strong assortativity attribute: we can therefore expect subnetworks of people of different ages connected together.

Newman et. all (2003) in their article define a way to measure the assortativity in a network, which is the assortativity coefficient (also available in the NetworkX library Link1, Link2):

  • Let’s consider an attribute A of a node. The attribute can take values: [A1, A2, …]
  • We can build a mixing matrix M where the entry e[i][j] represents the fraction of tot edges in the network (E) which connects nodes having attribute A = A[i] to nodes having attribute A = A[j]
Image by author
  • We then build the following quantities:
Image by author
  • The assortativity coefficient can be computed with the formulas below (the second formula uses the matrix notation of tr() trace. We’ll see an example below):
Image by author

2.1 Computing the assortativity coefficient

Let’s compute the coefficient for one attribute of the available dataset: “gender”. This attribute can assume 3 values in our dataset: “anonymized feature 77”, “anonymised feature 78”, and “None”.

First, the mixing matrix is obtained in this way:

Image by author

(Note: by setting normalized = False, we would have the effective edge count in our matrix).

We will now compute the coefficient using the matrix notation formula:

  • The matrix trace (sum of the main diagonal elements) represents the portion of connections where the gender is the same between the connected nodes. The trace in this case is: 53.65%
  • We then need the sum of the square of the mixing matrix

The final result is: 0.0841

This coefficient can vary from -1 (perfect disassortative case) to +1 (perfect assortativity), in this case we have a low but positive assortativity on this attribute.

3. Conclusion

By replicating the process above for each attribute, we can find the most assortative attributes in the network:

Image by author

As expected, it seems that the network is assortative especially on attributes related to: geolocation, birthyear, family name, school which can determine circumstances or reasons why people meet each other. The only slightly disassortative attribute is the political orientation.

References

  • [1] Newman et al. — “Mixing patterns in networks”, 2003 (available here)
  • [2] The data used in this post is from SNAP Stanford _ Large Network Dataset Collection http://snap.stanford.edu/data by Jure Leskovec and Andrej Krevl

--

--

Constant Learner, passionate about data analytics, ML and data visualization. Interested in work, tech, music & guitar