Detecting communities in a language co-occurrence network

Implementing community detection algorithms in Igraph with Python

Harry Bitten
Towards Data Science

--

Photo by Perry Grone on Unsplash

In this post, we are going to undertake community detection in the python package Igraph, to attempt to detect communities within a language co-occurrence network. This will be implemented using two popular community detection algorithms: Walktrap, and Label Propagation

Background

Global language co-occurrence networks (GLCNs) link languages that are likely to be co-spoken. Representing language co-occurrence as a network allows inference about international information sharing and knowledge transfer. This includes the diversity of information received by speakers of the language, the speed at which the information will be received, and the ability of native speakers to globally disseminate information.

Let us begin with the formal foundation over which we will build up the exchange of networks. This is the thought of a basic graph which is compromised by a set of nodes (or vertices) along with a set of links (or edges) among them. A graph may be directed (whereby edges indicate a one-way relationship) or undirected (edges indicate a two-way relationship, in that each edge can be traversed in both directions). Moreover, a graph may be two-fold, in which case any connection between two hubs either exists or does not exist. On the other hand, a diagram may be weighted, in which case any connection is furnished with a specific weight or esteem, which is commonly a (positive) number. The purpose here is to display an early on and brief exchange of the formal idea of a network with regards to the hypothesis of complex networks (and informal community investigation), and to portray (for the most part by models) several of the numerous computational systems which are normally utilized for the recognition of networks in a diagram theoretic foundation. Here, we portray the idea of language co-occurrence within Global language co-occurrence networks (GLCNs) by using existing algorithms to discover community structures in graphs

Walktrap

Walktrap, developed by Pascal Pons, is an algorithm in graph theory, used to identify communities in large networks via random walks. These random walks are then used to compute distances between nodes. Nodes are then assigned into groups with small intra and larger inter-community distances via bottom-up hierarchical clustering. It should be noted, of course, that this algorithm considers only one community per node, which in some cases can be an incorrect hypothesis.

Label Propagation

Label propagation, or LPA, proposed by U.N.Raghavan, is a near-linear community detection solution that benefits from fairly simple implementation. The algorithm works as follows:

  1. A network is characterized as a graph G(V, E), where V is the full set of nodes and E is the full set of edges.
  2. For node i(i ∈ V) , let Li denote the label of i, and N(i) denote the set of its neighbors
  3. At the start of the process, each node is assigned a unique label e.g. Li = i.
  4. These labels then propagate throughout the network, with each node updating its label at every iteration to the one shared by most of its neighbors.
  5. This process is repeated until each node has one of the most frequent labels of its neighbors, that is, none of the nodes need to change their label.

6. Finally, communities are constructed of nodes that share the same label

Dataset

The book translations global languages network from the Massachusetts Institute of Technology (MIT) will be used for this post. The dataset is a directed network, derived from over 2.2 million book translations published worldwide from 1979 to 2011. The translation data was obtained from UNESCO’s Index Translationum — an international index of printed book translations. In the network, nodes represent individual languages and edges indicate a translation of a book from one language to another. The frequency of the translation in the dataset — the co-occurrence frequency — can be used to weight the edges.

Properties of the obtained graph

Implementation

Our chosen algorithms (Walktrap and Label Propagation) will be implemented in Python using the iGraph package with preparation for the testing being fairly simple. Firstly, required for iGraph are only SourceNode, DestinationNode, and Weight, fortunately our dataset comes with these 3 columns as default. So first of all we can load the network into the workspace using an iGraph function: Read_Ncol, as shown below:

import cairocffi
import igraph
g = igraph.Graph.Read_Ncol('books.txt', directed = True)

We can then simply use

igraph.plot(g)

To get a graphical representation of the network:

Book Translation Network implemented in igraph

Community_walktrap in iGraph takes two optional parameters; weights and steps, where weights is a list containing edge weights and steps in the length of random walks to perform. The weights are automatically assigned as the third column with Read_Ncol, so to specify weights in the function, it is as simple as

weights=g.es[“weight”]

where ‘g’ is the graph that was read in. Generally speaking, 3, 4, or 5 steps are used in implementation. For this post, we’re going to use 4.

Running community_walktrap returns a VertexDendrogram object which is initially cut at the maximum modularity. Running as_clustering() on this object returns a VertexClustering object, which describes the clustering of the vertex set of a graph.

wtrap = g.community_walktrap(weights=g.es["weight"], steps = 4)
clust=wtrap.as_clustering()
A sample of the VertexClustering object from the walktrap algorithm

Although this is fine, its not particularly interesting to look at! we are then able to plot this result in iGraph using plot(), with two parameters, mark_groups, and **visual_style, where mark_groups specifies whether to highlight some of the vertex groups by coloured polygons and visual_style specifying visual properties. We can simply set up a python dictionary containing keyword arguments that we wish to pass to plot(), which in this case, is marking the vertex labels.

visual_style = dict()
visual_style["bbox"] = (700, 600)
visual_style["vertex_label"] = g.vs["name"]
igraph.plot(clust,mark_groups = True,**visual_style)
Plotting the output of the walktrap algorithm on our dataset

The output shows that 2 main communities are formed, containing 71 and 94 nodes respectively, and a third, smaller community with only 3 members. The remaining 100 nodes (37.3%) were each assigned to their own cluster (i.e. n=1), which we would not count in our definition of a community. These two main communities, however, seem to revolve around English and Russian, we can further verify this by looking back to the VertexClustering object and seeing the members of each community.

Label propagation is a simpler approach requiring no parameters, however, there are a few optional arguments; weights, initial, and fixed, where weights is identical to the structure in the walktrap approach. However, initial can be a list of the initial vertex labels, and fixed is a list of Booleans for each vertex with True corresponding to vertices who’s labeling should not change during the algorithm. These last two parameters were not required with our testing, with only weights being specified in the arguments. Unlike walktrap, label propagation does not return a dendrogram object and instead directly returns a VertexClustering object which can then be plotted in an identical fashion to the walktrap approach.

labelProp = g.community_label_propagation(weights=g.es["weight"])
igraph.plot(labelProp,mark_groups = True,**visual_style)
Plotting the output of the Label Propagation algorithm on our dataset

As found with the Walktrap algorithm, running label propagation on the weighted dataset identified two major communities of a similar size, with 70 and 102 members. There were, however, a further two minor communities, which contained 3 and 8 members respectively. Again, approximately one third of the nodes were not classified into communities due to the agglomerative nature of the approach (n=85, 31.7% of the dataset)

It is important to note that the high weights and dense nature of the graph has a considerable impact on how the algorithms work in this analysis. For example, for walktrap, in an unweighted graph, a node with 3 unweighted edges to neighbors would have a 1/3 chance to traverse each edge in a random walk. However, when taking weights into account, the probability of choosing a highly weighted edge (mainly English and Russian) becomes extremely high, causing the walks to get stuck going back and forth on a single edge. The situation with label propagation differs slightly; after each iteration, node labels update to that of the one shared by most of its neighbours. As Russian and English have such high in and out-degrees, the majority of nodes have one of these two languages in their neighborhood set, meaning there is a large possibility that these nodes will update labels to match that of the two aforementioned, even more so when edge weights are considered.

Rerunning without edge weights

Although removing the weights wouldn’t have much effect on the community structure with the label propagation, it should have a large impact on the outcome of the Walktrap algorithm. We can therefore re-run the algorithm, leaving weights unspecified:

wtrapnoweights = g.community_walktrap(steps = 4)
clustnoweights=wtrapnoweights.as_clustering()
VertexClustering object from Walktrap using an unweighted graph

We can already see that many more communities have been formed, and it is looking to be a much more interesting result. So let’s plot it!

wtrapnoweights = g.community_walktrap(steps = 4)
clustnoweights=wtrapnoweights.as_clustering()
igraph.plot(clustnoweights,mark_groups = True, **visual_style)
Output of the walktrap algorithm on our dataset with no weights

Repeating the test, this time giving all edges equal weighting yielded very different results. Ten communities were detected, with only a single node being assigned to its own cluster. There was also less of an imbalance in the size of the identified communities (mean 22.4 node members; minimum 3; maximum 54). This result is actually very interesting, even more so if you look at the VertexClustering object, and look at community 9, which shows that all the Scandinavian languages have been formed into a community, and also community 7, which has clustered many Asian languages together, such as Korean and Japanese.

Conclusion

In this post, we have compared two agglomerative community detection algorithms using the book translations global languages network. We found that, when using the weighted network, neither approach generated many distinct communities in the dataset. Instead, generally, two major communities formed. This suggests that there are not distinct groups of languages, and instead, information sharing occurs at an international level in this network. This conclusion is supported by the fact that we identified similarities in the two major communities identified by the two approaches, in terms of their size and constituents, by qualitatively considering the language families of the community members. Suggesting that the findings were not an artefact of the method employed. By removing the edge weights, we found that this attribute of the underlying dataset was very important important in community formation. However, it is not known if the edge weights used in this work are truly representative of language co-occurrence frequencies. The edge weights were derived from data reported to UNESCO by many different international libraries. These data may be subject to reporting biases of the various locations, with different locations following different practices and having differing levels of completeness, and may they may also not be up to date. This warrants further investigation.

An interesting extension would perhaps be to re-run the algorithms on the graph but without English and Russian nodes. We would expect that this would return a more interesting set of communities, that aren’t quite as affected by the large bias towards highly connected nodes and heavily weighted edges.

Thanks for checking this article out! Although this isn’t normally the sort of thing I would create, I found the idea of community detection fascinating. Full information and code can be found on my Github below

See you the next time!

--

--