The third part of our journey into Graph Neural Networks. Today the theoretical background of Perozzi’s 2014 paper DeepWalk: Online Learning of Social Representations

In the very first post of this series, we learned how the Graph Neural Network model works. We saw that GNN returns node-based and graph-based predictions and it is backed by a solid mathematical background. In particular, transition and output functions satisfy Banach’s fixed-point theorem. However, despite the successful GNN applications, there are some hurdles, as explained in [1]. The main idea of the GNN model is to build state transitions, functions f𝓌 and g𝓌, and iterate until these functions converge within a threshold. This is a strong constraint that may limit the extendability and representation ability of the model. Secondly, GNN cannot exploit representation learning, namely how to represent a graph from low-dimensional feature vectors. Third, GNN is based on an iterative learning procedure, where labels are features are mixed. This mix could lead to some cascading errors as proved in [6]
To solve these problems, DeepWalk [2] emerged in 2014 as a possible enhancement for GNN. DeepWalk is the first graph to implement the embedding method and it is based on representation learning and language modelling – SkipGram model[3]
DeepWalk model
Definitions
DeepWalk learns social representations of graph vertices, through modelling a stream of short random walks. Social representations are hidden connections across vertices of a graph, which can be decoded as latent features. Thus DeepWalk reads the input graph and outputs a latent representation of the graph.
How to get to a latent representation of the graph? Well, here we have the main difference to Scarselli’s approach. Scarselli’s GNN is a supervised approach, while DeepWalk is an unsupervised method, so it has to learn from graph features to understand and catch the graph structure and targets, independently from nodes or graph’s labels. This mapping approach has been proved to be more efficient than other social dimensions classifiers [4,5], increasing classification performance of 5–10% of Micro F1 score, outperforming "competitors" even given 60% less training data. Finally, DeepWalk framework allows easy scalability, to cope with web-scale graphs such as for Facebook, Twitter or Youtube.
Mathematical insights I: definitions
DeepWalk returns insightful latent representations of an input graph based on nodes’ social attributes. The graph is defined as a mathematical function G=(V, E), which is made of V vertices and E edges. Vertices may be partially labelled, Y labels, and they are described by attributes/features X. Attributes/features are part of the domain of the real numbers, whose dimension is S, a_n_d labels are defined by real numbers vector.
Given the input real domain, DeepWalk translates nodes’ features X into a latent domain, whose dimension is d. This mapping describes social phenomena with a smaller subset of features, which could help in better understanding the relationship of the underlying data and have an immediate visual answer when ingested into PCA or t-SNE algorithms.
To understand the theory and the process underneath DeepWalk, before dealing with the practical implementation, I provide here an example with the Karate Club graph. Fig.1 shows the input graph G, where the vertices are 34 and the edges 78.

The input feature matrix X depicts relationships between nodes and we want to use DeepWalk to find a latent representation of the graph, to find the perfect groups division in the club. For example, the latent representation dimension can be 10, thus DeepWalk will produce 10 new features to subdivide the graph into groups.
Mathematical insights II: generation of random walks
To achieve the latent representation DeepWalk starts to create γ (e.g. 100) random walks from the input graph. A random walk Wvᵢ starts at a vertex vᵢ and randomly select k (e.g. 10) other graph’s vertices vₖ (for example, see fig.2)

The selection of random walks allows the algorithm to extract information from a network, guaranteeing on one side a computational easy parallelisation and the other side a dynamic way of exploring the graph, which can encapsulate new information once the graph has changed. This latter aspect allows an iterative update of the model training from a single graph subset, without the problem of the whole graph computation.
Mathematical Insights III: Language modelling
Once γ (e.g. 100) "truncated" random walks have been collected it is possible to exploit some language modelling to extract the latent description of the input social representation. In particular, the collection of random walks can be seen as an encoded collection of sentences (fig.3 shows 10 examples of random walks) made up of k words w: Wvᵢ = (w₀, w₁, …wₖ). From here DeepWalk computes the word-embeddings or the node-embedding representation of all the input corpus G. This is a remarkable point because it’s the first time in the graph world that someone writes about the concept of node-embeddings.
Word-embedding can be described by a function Φ whose values live in the real-number domain. Φ is a matrix, whose dimensions are the number of vertices V and the latent-dimension d (in this case 34 x 10). The goal is to retrieve the likelihood of observing a vertex vᵢ given all the previous visited vertices embeddings:

However, this is calculation depends on graph dimension and length of random walks, as they are growing eq.1 is computational, not efficient. A solution is proposed in [7] [8] where the problem is reversed. Rather than predicting the occurrence of a missing word in a sentence, we are going to compute the likelihood of a word appearing as a neighbour in a given context, as described in fig.4

Such an approach is called SkipGram [7–11] and aim to solve this optimization problem:

From equation 2 is possible to describe vertices in the d-dimensional latent representation, so that similar vertices have a similar representation.
Mathematical Insights IV: Hierarchical Softmax
Looking closely at the SkipGram algorithm we can enlist the following "ingredients":
- a window of size w to slide across the input sequences
- the word embedding latent dimension d
- a 2-layer neural network, which receives the input corpus
- an activation layer to compute the probability distribution of words
For each vertex vᵢ in a given random walk, we have a representation vector Φ(vᵢ). Given this vertex representation, we would like to maximize the probability of its neighbours in the random walk (eq.2). Such a posterior distribution can be learned through several algorithms. In DeepWalk the model uses the Hierarchical Softmax approach [12, 13], which alleviate the computational cost and speed up the training time, approximating the probability distributions. Computing the posterior probability can result in a huge effort, which equals the number of vertices in the graph (up to billions in some cases).
Hierarchical Softmax is based on a binary tree, the Huffman tree/coding, which ensures a balance between the distribution of points belonging to each side of any tree node. The vertex vᵢ from eq.2 is the root node. From there we can have branches created from the random walks till the last vertex. The final posterior probabilities are computed as the sum of the probability of each word we come across by transversing the tree from the root word.
Final remarks: visualise the embeddings
What will have at the end? Once DeepWalk has run, the Skipgram Word2Vec
model will return the latent space description of our input graph, namely a graph-embedding file:
For each vertex, the file reports the model’s embeddings, namely the position of the vertex in the latent representation space. How can we use this representation? Well, we can use at least 2 decomposing algorithms to grab a "human" representation of such a space. The first is t-SNE approach [14] and the second is the principal component analysis (PCA). Fig.6 shows the PCA decomposition from the output embeddings for the Karate club. As you can see, DepeWalk can retrieve at least 3 main groups on how the Karate club can be subdivided, based on input features/relationships across nodes. This allows people to analyse huge graphs, e.g. Facebook pages, to understand social relationships across pages, without having headaches in dealing with representations and labelling.

Stay tuned for our next post: two practical application of DeepWalk to fully get the Python implementation 🙂
Please, feel free to send me an email for questions or comments at: [email protected] or directly here in Medium.
BIBLIOGRAPHY
- Zhou, Jie, et al. "Graph neural networks: A review of methods and applications." AI Open 1 (2020): 57–81.
- Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)
- Tang, Lei, and Huan Liu. "Relational learning via latent social dimensions." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.
- Tang, Lei, and Huan Liu. "Leveraging social media networks for classification." Data Mining and Knowledge Discovery 23.3 (2011): 447–478.
- Neville, Jennifer, and David Jensen. "A bias/variance decomposition for models using collective inference." Machine Learning 73.1 (2008): 87–106.
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
- Guthrie, David, et al. "A closer look at skip-gram modelling." LREC. Vol. 6. 2006.
- Gittens, Alex, Dimitris Achlioptas, and Michael W. Mahoney. "Skip-Gram− Zipf+ Uniform= Vector Additivity." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.
- Mimno, David, and Laure Thompson. "The strange geometry of skip-gram with negative sampling." Empirical Methods in Natural Language Processing. 2017.
- Mnih, Andriy, and Geoffrey E. Hinton. "A scalable hierarchical distributed language model." Advances in neural information processing systems 21 (2008): 1081–1088.
- Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." International workshop on artificial intelligence and statistics. PMLR, 2005.
- Hinton, Geoffrey, and Sam T. Roweis. "Stochastic neighbor embedding." NIPS. Vol. 15. 2002.