The world’s leading publication for data science, AI, and ML professionals.

Graph Sampling Strategies for Clustering

How to generate high-modularity clustering on sampled graphs

Photo by Nastya Dulhiier on Unsplash
Photo by Nastya Dulhiier on Unsplash

Graphs are complex structures to analyze and can be challenging to deal with if they get too big. The analysis of enormous graphs has been a hot research area in the last few years, generating initiatives like the OGB and the OGB-LSC.

One way of trying to work with these huge graphs is to sample them in a way their original structure remains represented on the sample and then do the analysis on that sampled graph.

However, graphs are not easy to sample. How do you tackle the loss of information you will have when you remove nodes or edges? To start studying this theme today we will review and implement the paper Clustering, Prominence and Social Network Analysis on Incomplete Networks [1] that proposes two algorithms for edge sampling on graphs.

We will start by understanding the ideas from the paper, the limitations, and the results the authors proposed. Then, we will implement it, try to work around one of its limitations and see if this method actually works.

The notebook for the analysis is available on Kaggle and on my Github.

Initial Considerations

First of all, let’s introduce the basic assumptions of the paper. The sampling methodology was tested by applying a Clustering method to the sampled graph to see if the resulting clustering on the original graph is similar or better than applying the same algorithm on the full graph.

For this, they used an algorithm called SpeakEasy. Because I could not find a Python implementation for this algorithm, I used the Greed Modularity Maximization algorithm available on NetworkX.

They also analyzed several metrics. For this analysis, I went with only modularity, which was one of the many considered in the paper.

The paper states that this method is used only for non-directed graphs. This is especially true for the first algorithm that uses the symmetry of the adjacency matrix. In this analysis, however, I considered a network that is directed to see how these methods hold up.

Finally, one important thing to keep in mind is that the methods depend on the creation of the adjacency matrix, which, for graphs with billions of nodes, can be unfeasible to represent to generate the samplings. The authors state that other representations are possible but do not go into further detail.

To analyze the results, I considered 4 networks, all of them freely available on the SNAP from Stanford. For this post, I will show you the result of two of them: the ego-Facebook network [2], an undirected network consisting of friends lists from Facebook, and the p2p-Gnutella08 network [3], a directed graph consisting of a peer-to-peer file-sharing network.

Each algorithm and sample size was applied several times to account for stochasticity in the results.

Sampling Strategy 1

The first strategy uses a sampling on the adjacency matrix followed by linear regression to try to estimate the missing links that were sampled out. Using the predictions from this regression, a new adjacency matrix is created and the algorithm is applied to that.

The algorithm goes as follows:

  • Be A the adjacency matrix of the graph G
  • Random sample k columns from the adjacency matrix and save their indexes on a variable K
  • Sample X -> X = A(:, K)
  • Create a target Y partially filled with the symmetric properties of A -> Y(K, 🙂 = X(N-K, :).T
  • Perform a linear regression using the Moore-Pensore pseudo-inverse of X -> Y_hat = X(X-¹Y)
  • Generate a new adjacency matrix with the sampled columns: A_hat(:,K) = X
  • Add the reconstructed columns: A_hat(:, N-K) = Y_hat

As you can see, the symmetric property of the matrix is directly used here to define Y. Let’s see what this implementation looks like in Python:

def algorithm_1(A, sample_size, n, N):
    k = sample_size
    K = randsample(N, k)
    X = A[:, K]

    set_ = list(set(N) - set(K))

    Y = np.zeros((n, n-k))
    Y[K, :] = X[set_, :].T
    Y_hat = np.dot(X, np.dot(np.linalg.pinv(X), Y))
    A_hat = np.zeros((n,n))
    A_hat[:, K] = X
    A_hat[:, list(set(N) - set(K))] = Y_hat
    G_algo1 = nx.from_numpy_matrix(A_hat)
    return G_algo1

Where the randsample method is implemented as:

def randsample(N, k):
    K = np.random.choice(N, size=k, replace=False)
    if type(K[0]) == str:
        K = [int(i) for i in K]
    return K

As you can see, this is a direct implementation of the algorithm proposed in the paper. Notice that on the paper, the dimension of the matrix Y_hat is wrong, it should be (n,n) otherwise the matrices operations will not make sense.

After implementing this method, I noticed that instead of generating graphs that were smaller than the original graph, the linear regression was actually generating more edges. We will see in the analysis that this made the clustering algorithm take way longer on the sampled graph than on the original.

For the Facebook network, in terms of execution time, as I said, the sampled graph goes way worse:

The elapsed time of algorithm 1 in the Facebook network. Developed by the author.
The elapsed time of algorithm 1 in the Facebook network. Developed by the author.

When we talk about modularity, after a sample size of about 40%, the sampled graph was able to surpass the modularity on the original graph, but at the expense of the bigger elapsed time.

The modularity of the clustering on the Facebook network sampled by Algorithm 1. Developed by the author.
The modularity of the clustering on the Facebook network sampled by Algorithm 1. Developed by the author.

It is important to notice, however, that for one of the networks not exposed here, also undirected, this sampling method was not capable of surpassing the modularity and also took way longer. Refer to the notebook for more details.

Now, for the p2p-Gnutella network, the same phenomena on the elapsed time happened, but this time, the modularity did not come even close to the original as the following graphs show:

The elapsed time of algorithm 1 in the p2p-gnu network. Developed by the author.
The elapsed time of algorithm 1 in the p2p-gnu network. Developed by the author.
The modularity of the clustering on the Facebook network sampled by Algorithm 1. Developed by the author.
The modularity of the clustering on the Facebook network sampled by Algorithm 1. Developed by the author.

Sampling Strategy 2

The second sampling strategy consists of sampling the original adjacency matrix several times and then aggregating the results to generate the new graph to be considered.

This methodology has the effect of changing the weights of the edges, making it compensate for the edges that were removed from the network. The algorithm goes as follows:

  • Be A the adjacency matrix of the graph G
  • Random sample k columns from the adjacency matrix and save their indexes on a variable K
  • Get the K columns from A -> X1(:, K) = A(:, K)
  • Apply the symmetric property -> X1(K, 🙂 = A(K, :).T
  • Scale the X by the scaling factor from matrix a P made of the inverse of the square root of the number of nodes -> X1 = diag(P1) * X1
  • Generate a new set of columns removing the previously sampled and save it to X2
  • Generate the new adjacency matrix as a ponderation over the samples -> A_hat = alpha X1 + (1-alpha)X2

Implementing it in Python, it becomes:

def algorithm_2(A, sample_size, n, N, symmetric=True):

    def construct_X(N, k):
        K = randsample(N, k)
        X1 = np.zeros((n, n))
        X1[:, K] = A[:, K]
        if symmetric:
            X1[K, :] = A[:, K].T
        P1 = np.full((n,n), np.sqrt(1 / n))
        X1 = np.diag(P1) * X1

        return X1, K

    k = int(np.floor(sample_size / 2))
    alpha = 0.5
    X1, K1 = construct_X(N, k)
    X2, K2 = construct_X(list(set(N) - set(K1)), k)
    A_algo2 = alpha * X1 + (1 - alpha) * X2
    G_algo2 = nx.from_numpy_matrix(A_algo2)

    return G_algo2

I must draw your attention here that I made a simplified version of the one proposed in the paper. The paper generates the second matrix with some specific sampling mechanisms from linear algebra. To keep it simple I just did a random sample again. This may work against the performance of this method.

Now, for the results:

The elapsed time of algorithm 2 in the Facebook network. Developed by the author.
The elapsed time of algorithm 2 in the Facebook network. Developed by the author.
The modularity of the clustering on the Facebook network sampled by Algorithm 2. Developed by the author.
The modularity of the clustering on the Facebook network sampled by Algorithm 2. Developed by the author.

As we can see, for the Facebook network the algorithm was capable of generating faster clusterings capable of even surpassing the original modularity on the full network which is a very good result given the simplicity of the method.

However, for the p2p-gnu network we couldn’t tell the same :

The elapsed time of algorithm 2 in the p2p-gnu network. Developed by the author.
The elapsed time of algorithm 2 in the p2p-gnu network. Developed by the author.
The modularity of the clustering on the p2p-gnu network sampled by Algorithm 2. Developed by the author.
The modularity of the clustering on the p2p-gnu network sampled by Algorithm 2. Developed by the author.

As we can see, the clustering runs faster, however, the sampled graph was never able to achieve the modularity measure of the full graph even with higher sampling methods.

On the notebooks linked above, the Ca-Gr network achieved good modularity results. This is an indication that relaxing the symmetry condition of this algorithm heavily hurts its performance.

Conclusion

In this post, we explored two mechanisms of sampling graphs for clustering purposes. We tried to verify if the symmetry assumption of the methods is in fact necessary and tried to reproduce the claims of the paper.

Notice, again, that some simplifications were made to the algorithm during development.

I hope this can be useful for you in your journey as a Data Scientist.

[1] Hegde, K., Magdon-Ismail, M., Szymanski, B.K., & Kuzmin, K. (2016). Clustering, Prominence and Social Network Analysis on Incomplete Networks. Complex Networks.

[2] J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.

[3] M. Ripeanu and I. Foster and A. Iamnitchi. Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design. IEEE Internet Computing Journal, 2002.


Related Articles