Large-scale Graph Mining with Spark: Part 2

Published in

Towards Data Science

11 min readOct 9, 2018

A tutorial in 2 parts:

Part 1: Graphs for unsupervised learning.

Part 2 (you are here!): How you can wield magical graph powers. We’ll talk about label propagation, Spark GraphFrames, and results. Repo with sample graph and notebook here: https://github.com/wsuen/pygotham2018_graphmining

In Part 1 (here), we saw how to solve unsupervised machine learning problems with graphs because communities are clusters. We can leverage edges between nodes as an indicator of similarity or relation, the same way distances in a feature space is used for other types of clustering.

Here, we dive into the hows of community detection. We build and mine a large web graph, learning how to implement a community detection method called Label Propagation Algorithm (LPA) in Spark.

Detect Communities with Label Propagation

Though there are many community detection techniques, I focus only on one: label propagation. For an overview of other methods, I recommend Santo Fortunato’s “Community Detection in Graphs”.

Graph with communities. From Girvan, Michelle, and Mark EJ Newman. “Community structure in social and biological networks.” *Proceedings of the national academy of sciences* 99.12 (2002): 7821–7826.

Enter the Label Propagation Algorithm (LPA) proposed by Raghavan, Albert, and Kumara (2007). LPA is an iterative community detection solution whereby information “flows” through the graph based on underlying edge structure. Here’s how LPA works:

Raghavan, Usha Nandini, Réka Albert, and Soundar Kumara. “Near linear time algorithm to detect community structures in large-scale networks.” *Physical review E* 76.3 (2007): 036106.

At the beginning, each node starts in its own community.
For each iteration, go through all your nodes randomly. For each node, update that node’s community label with the label of the majority of its neighbors. Randomly break any ties.
If nodes are now labelled with the majority label of their neighbors, the algorithm has achieved the stop criterion. If not, repeat step 2.

Or just, you know, stay home? Image Source.

Label propagation makes sense intuitively. Let’s say one day at work, someone gets a cold and “propagates” the illness until everyone in your workplace is as sick as their neighbors. Meanwhile, employees of FoobarCo across the street catch and spread the flu. There’s not much contact between you and FoobarCo, so “propagation” stops when members of each community have caught the ailment du jour. Convergence achieved! Too bad about the sniffling and headaches though.

Why use LPA?

Labelled data is nice, but not required. Makes LPA suitable for our unsupervised machine learning use case.
Parameter tuning is straightforward. LPA runs with a max_iterations parameter, and you can get some good results using the default value of 5. Raghavan and her coauthors tested LPA against several labelled networks. They discovered that at least 95% of nodes are correctly classified in 5 iterations.
A priori number of clusters, cluster size, other metric not required. This is crucial if you don’t want to assume your graph has a certain structure or hierarchy. I had no a priori assumptions about the network structure of my web graph, the number of communities I had data for, or the expected sizes of these communities.
Near linear runtime. Each iteration of LPA is O(m), linear in number of edges. The whole sequence of steps runs in near linear time, compared to O(n log n) or O(m+n) for some previous community detection solutions.
Interpretability. When someone asks, you can explain why a node was grouped into a certain community.

Language communities in Belgium mobile network (red = French, green = Dutch). Image from Blondel, Vincent D., et al. “Fast unfolding of communities in large networks.” *Journal of statistical mechanics: theory and experiment* 2008.10 (2008): P10008..

Tool Selection

First, a quick and non-exhaustive breakdown of the tools landscape. I divided the tools based on graph size, whether the library plays nicely with Python, and how easily I could generate simple visualizations.

A non-exhaustive menu of tools:

For data that fit onto a single machine, the networkx Python package is a good choice for easy-to-use graph exploration. It implements the most common algorithms (including label propagation, PageRank, maximum clique detection, and much more!). Simple visualizations are also possible.
Gephi is an open graph analysis tool. Gephi isn’t a Python package, but a standalone tool with a robust UI and impressive graph visualization capabilities. If you are working with smaller graphs, need strong visualizations, and prefer a UI to working in Python, give Gephi a try.
Spark has 2 graph libraries, GraphX and GraphFrames. Spark is a great solution when you have graph data too large to fit onto a single machine (limited to amount of resources allocated to your Spark application), want to take advantage of parallel processing, and some of Spark’s built-in fault tolerance features. Pyspark, Spark’s Python API, is nicely suited for integrating into other libraries like scikit-learn, matplotlib, or networkx.
Apache Giraph is the open-source implementation of Pregel, a graph processing architecture created by Google. Giraph had a higher barrier to entry compared to the previous solutions. While Giraph is very powerful for large scale graph analysis deployments, I chose something more lightweight that had both Scala and Python APIs.
Neo4j is a graph database system. It does have a Python client, though you have to install Neo4j separately. Since my analysis was only a POC, I wanted to avoid maintaining and deploying an entirely separate tool that did not integrate with my existing code.
Finally, you could in theory implement your own solution. For initial data science exploration, I would discourage this. Many bespoke graph mining algorithms are for very specific use cases (for instance— be super efficient at graph clustering only, not other things). If you do need to work with very, very large datasets, first consider sampling your graph, filtering subgraphs of interest, inferring relationships from examples, basically anything to get more mileage from one of the existing tools.

Given the data size I was working with, I chose Spark GraphFrames.

Remember: the best graph library for your project depends on languages, graph size, how you store your graph data, and personal preference!

Building a Common Crawl Web Graph

Great! I’m fully convinced how awesome graphs are, and they’re the coolest things ever! How do I get started using community detection on real data?
-you

Steps

1. Obtain data: The Common Crawl dataset is an open web crawl corpus well-suited for web graph research. The crawl results are stored in WARC (Web Archive) format. In addition to page contents, the dataset contains crawl date, headers used, and other metadata.

I sampled 100 files from the September 2017 crawl. The file warc.paths.gz contains pathnames; using these pathnames, download the respective files from s3.

2. Parse and clean data: As a first pass, we want the html content of each page. For each page, we collect the URL and the URLs of any links to create our graph.

To extract edges from raw WARC files, I wrote some data cleaning code that probably should never see the light of day. At least it got the job done so I can focus on more fun things! My parsing code was written in Scala, but my demo is in pyspark. I used WarcReaderFactory and Jericho parser. For python, a library like warc looks like it could cover your data munging needs.

After I had all the href links out of the html content,

I drew edges between domains rather than full URLs. Instead of medium.com/foo/bar and medium.com/foobar, I only created one node, medium.com, that captured link relationships to and from other domains.
I filtered out loops. Loops are edges that connect a node to itself, not useful for my purposes. No edges drawn if medium.com/foobar linked to same domain, say, medium.com/placeholderpage.
I removed many of the most popular resource links, including popular CDNs, trackers, and assets. For my preliminary exploration, I only wanted to focus on webpages a human would likely visit.

3. Initialize a Spark Context: For those following along at home, see demo at https://github.com/wsuen/pygotham2018_graphmining. This demo only runs on your local machine. You won’t get all the computational resources of a distributed cluster, but you will get an idea of how to get started with Spark GraphFrames.

I’ll be working with Spark 2.3. Import pyspark and other needed libraries, including graphframes. Then create a SparkContext, which will allow you to run a pyspark application.

# add GraphFrames package to spark-submit
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 pyspark-shell'import pyspark# create SparkContext and Spark Session
sc = pyspark.SparkContext("local[*]")
spark = SparkSession.builder.appName('notebook').getOrCreate()# import GraphFrames
from graphframes import *

4. Create a GraphFrame: Once you have your cleaned data, you can load your vertices and edges into Spark DataFrames.

vertices contains an idfor each node, and the name of the node, which indicates the domain.
edges contains my directed edges, from source domain src to the domain the source links to, dst.

# show 10 nodes
vertices.show(10)+--------+----------------+
|      id|            name|
+--------+----------------+
|000db143|         msn.com|
|51a48ea2|tradedoubler.com|
|31312317|   microsoft.com|
|a45016f2|     outlook.com|
|2f5bf4c8|        bing.com|
+--------+----------------+
only showing top 5 rows# show 10 edges
edges.show(10)+--------+--------+
|     src|     dst|
+--------+--------+
|000db143|51a48ea2|
|000db143|31312317|
|000db143|a45016f2|
|000db143|31312317|
|000db143|51a48ea2|
+--------+--------+
only showing top 5 rows

You can then create a GraphFrame object with your vertices and edges. Voila! You have a graph!

# create GraphFrame
graph = GraphFrame(vertices, edges)

5. Running LPA: One line of code allows you to run LPA. Here, I run LPA with 5 iterations (maxIter).

# run LPA with 5 iterations
communities = graph.labelPropagation(maxIter=5)communities.persist().show(10)+--------+--------------------+-------------+
|      id|                name|        label|
+--------+--------------------+-------------+
|407ae1cc|             coop.no| 781684047881|
|1b0357be|  buenacuerdo.com.ar|1245540515843|
|acc8136a|   toptenreviews.com|1537598291986|
|abdd63cd| liberoquotidiano.it| 317827579915|
|db5c0434|          meetme.com| 712964571162|
|0f8dff85|           ameblo.jp| 171798691842|
|b6b04a58|             tlnk.io|1632087572480|
|5bcfd421|         wowhead.com| 429496729618|
|b4d4008d|investingcontrari...| 919123001350|
|ce7a3185|   pokemoncentral.it|1511828488194|
+--------+--------------------+-------------+
only showing top 10 rows

Running graph.labelPropagation() returns a DataFrame with nodes and a label denoting which community that node belongs in. You can use label to understand distribution of community size and zoom in on areas of interest. For example, to discover every other website in the same community as pokemoncentral.it (and honestly, who wouldn’t?), filter for all other nodes where label = 1511828488194.

Results

What happened when I ran LPA on my sample Common Crawl web graph?

I began with over 15 million web sites in my raw data. That’s a lot of nodes, many of which contained redundant information. The cleaning process I described condensed the graph into fewer, more meaningful edges.
LPA discovered over 4,700 communities. However, over half of these communities contained only one or two nodes.
On the other end of the size spectrum, the largest community was over 3,500 different websites! To give an idea of scope, this was around 5% of the nodes in my final graph post-filtering.

The extremes in community size illustrate one drawback of LPA. Too much convergence, and there could be clusters that are too large (caused by certain labels dominating densely connected networks). Too little convergence and you may get more, smaller communities that are less useful. I found that the most interesting clusters ended up being somewhere between the two extremes.

Convergence and the small-world network effect

In my dataset, LPA did converge around 5 iterations. You can see the number of communities level off at about 4,700. Raghavan and her colleagues showed this property with their labelled graphs as well.

One possible mechanism that explains this is the small-world network effect — the tendency for graphs to cluster, but also to have short path lengths compared to number of nodes. In other words, although graphs have clusters, but you would also expect to be able to travel from, say, one friend to another in your network within 5–6 jumps. Many real-world graphs, including the Internet and social networks, share this property. You may also know this as the six degrees of separation phenomenon.

Number of communities levels off after 5 iterations.

Sample clusters

At a coarse level, let’s see some samples clusters. As with traditional unsupervised clustering, the communities can be a mix of different sites, although there are topics of interest that we would not have discovered without LPA! From left to right:

E-learning sites: Sites related to or linking to e-learning pages. Time for me to find some new data science MOOCs!
Bedbug sites: Sites related to real estate and bed bugs. All these sites used the same template/images, just with slightly different domain names. Don’t think I ever got to the bottom of this.
Star Wars community: Sites that talk about Star Wars movies, events, and memorabilia frequently link to each other.

It’s worth emphasizing that we obtained this clusters without text processing and feature selection, manual labelling, domain name features, or even knowing how many communities to find. We found communities of interest by leveraging the underlying network structure of the web graph!

Where to go from here

I’ve barely scratched the surface of web graph communities. There are so many directions future research could go. For example:

Layer in and propagate metadata: If we add information such as edge weights, link types, or external labels to the data, how well can we propagate this information through the graph?
Remove/add nodes and measure impact on communities: I’m curious how adding or taking away nodes of high edge centrality change effectiveness of LPA and quality of resulting communities.
Observe web graph evolution over time: There is a new Common Crawl dataset each month! Would be interesting to see what clusters emerge over time. Conversely, what communities remain unchanged? As we know, the Internet isn’t static.

The Github repo for the demo contains a small sample web graph of 10k nodes. There are also instructions for getting setup with Docker and running the pyspark notebooks. I hope this will be helpful in jump-starting experimentation with web graph data, and learning Spark GraphFrames for your own data science problems.

Happy exploring!

We’re pioneers! Graph pioneers. :) Image Source.

Acknowledgements

Thanks to Dr. Yana Volkovich for deepening my learning of graph theory and being a great mentor. Thanks also to my other colleagues who gave feedback on my talk.

References

Adamic, Lada A., and Natalie Glance. “The political blogosphere and the 2004 US election: divided they blog.” Proceedings of the 3rd international workshop on Link discovery. ACM, 2005.

Common Crawl dataset (September 2017).

Farine, Damien R., et al. “Both nearest neighbours and long-term affiliates predict individual locations during collective movement in wild baboons.” Scientific reports 6 (2016): 27704

Fortunato, Santo. “Community detection in graphs.” Physics reports 486.3–5 (2010): 75–174.

Girvan, Michelle, and Mark EJ Newman. “Community structure in social and biological networks.” Proceedings of the national academy of sciences 99.12 (2002): 7821–7826.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Raghavan, Usha Nandini, Réka Albert, and Soundar Kumara. “Near linear time algorithm to detect community structures in large-scale networks.” Physical review E 76.3 (2007): 036106.

Zachary karate club network dataset — KONECT, April 2017.