Five challenges of graph processing

A summary of the most relevant problems graph computing practitioners have to overcome.

Published in

Towards Data Science

6 min readOct 27, 2020

In a previous post, I exposed my opinion about the lack of a clear platform/solution/framework/architecture for graph processing. However, what are the main issues graph processing has to deal with? Serve these lines as an appetizer for the curious minds out there.

(Images are property of the author if not mentioned otherwise).

Graph processing: a problem with no clear victor

Do you know the most popular graph processing solution? No? Don’t worry. There is no such thing yet.

towardsdatascience.com

1. Graphs are unstructured

A graph is a collection of vertices V and edges E connecting these vertices. A graph G=(V,E) can be directed or undirected. In a directed graph any edge from vertex u to vertex v has a direction (u → v). This is, we can go from u to v.

Image by Author

If the graph is undirected we have no direction (u - v). This is, u is connected to v.

Image by Author

Additionally, we can have edges with weights W(e).

And vertices with weights W(u), W(v).

2. Graphs representation

Graphs are particularly suitable to represent absolutely anything. From social networks to interactions between atomic forces. This level of freedom is difficult to translate into a computer. A very computer friendly representation of a graph is the adjacency matrix A. A is a square matrix of size |V| x |V| where Aᵢⱼ indicates that it exists an edge between vertices i and j. For undirected graphs, A is a symmetrical matrix. For example, for the graph

we can get its representation in an adjacency matrix:

Because the previous representation is a matrix, we can use all the available tools of algebraic operations. For example, by summing up the values of every column we have the number of edges targetting every vertex. Additionally, we can compute eigenvalues and eigenvectors that can give us interesting information using spectral graph theory.

3. Memory

But there are other aspects to be taken into consideration. What about memory utilization? If you think about the adjacency matrix above, for a 4 x 4 matrix (16 cells) we have 10 empty cells (equal to 0). When allocating memory for this matrix, we are wasting 62% of the allocated memory with non-relevant information. For very large sparse matrices this cost is unacceptable. And we have sparse matrices in many scenarios, think about social networks. Facebook has 2.6 billion active users and there is a limit of 5000 friends per profile. Additionally, some operations become cumbersome. For example, knowing the adjacent vertices involves operations with the whole matrix.

Fortunately, we have other solutions that are more “memory friendly” or at least more suitable to certain scenarios. An adjacency list is a list where every item represents a vertex in the graph. For every vertex, we store pointers to its adjacent vertices (neighbours). Following with the example above:

The adjacency list limits the information to be stored to the adjacent vertices. Vertex a has access to a list of references to its adjacent vertices. This facilitates the implementation of traversing operations that can travel across the graph vertices and makes it possible to store additional data enhancing data locality.

And here is where graphs become a really painful data structure to deal with from the point of view of performance. Let’s assume we want to visit all the neighbors of vertex a (Γ(a)). The adjacent vertices of a are: itself, b and d. This means that we have to access memory positions for b and d. What happens if these memory fragments are not already cached? That is a cache miss, which means that we have to pick up that piece of the graph from the main memory. For vertices with a large number of neighbors cache misses will iteratively repeat. This directly impacts the performance of a graph traverser. The lack of locality in memory accesses is a performance limitation in most graph algorithms.

Another interesting problem is what happens when the graph does not fit into main memory? Cache misses are expensive in terms of performance, but accesses to secondary memory are orders of magnitude more expensive.

4. Distributed solutions

If we think of distributed solutions, the lack of locality can be a dramatic performance bottleneck. For a distributed memory solution with the graph split across nodes we may have something like:

For an operation running in nodeA any access to vertices c or d would have to retrieve information from other node. Complexity increases if edge weights can be modified. Who is the owner of the edge? And the vertices? Solutions such as Metis can give you the best graph partition across n bins or nodes. However, adapting algorithms to work with distributed partitions is not easy at all. And computing the best partition is an expensive operation.

5. Parallelism

And what about parallelism? This depends on the problem or algorithm to be parallelized. However, there is a clear problem in the utilization of concurrent access data structures such as queues or stacks in classic iterators such as Breadth First Search (BFS) or Deep First Search (DFS). For example, this is the pseudocode for BFS:

1  procedure BFS(G, root) is
2      let Q be a queue
3      label root as discovered
4      Q.enqueue(root)
5      while Q is not empty do
6          v := Q.dequeue()
7          if v is the goal then
8              return v
9          for all edges from v to w in G.adjacentEdges(v) do
10             if w is not labeled as discovered then
11                 label w as discovered
12                 w.parent := v
13                 Q.enqueue(w)

If we consider the pseudocode above to run in parallel we have to guarantee that Q and labels structures are accessed/modified atomically. This imposes a tremendous bottleneck for multiple parallel instances that can result in really poor performance.

Summary

In this post, I have introduced some of the potential issues that graphs bring to the table when designing a graph-based solution. There are many other issues you may find. However, I think these are the main ones. There are other issues you can find particularly difficult to overcome. I would love hearing them.

Thanks for reading!