PageRank: Link Analysis Explanation and Python Implementation from Scratch

The Algorithm that Starts Google

Chonyy
Towards Data Science

--

Photo by Christian Wiediger on Unsplash

Introduction

We have introduced the HITS Algorithm and pointed out its major shortcoming in the previous post. In this article, an advanced method called the PageRank algorithm will be revealed. We will briefly explain the PageRank algorithm and walkthrough the whole Python Implementation.

The best part of PageRank is it’s query-independent. We don’t need a root set to start the algorithm.

The biggest difference between PageRank and HITS

  • HITS calculate the weights based on the hubness and authority value
  • PageRank calculated the ranks based on the proportional rank passed around the sites

According to Google,

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Algorithm

Feel free to check out the well-commented source code. It could really help to understand the whole algorithm.

The algorithm steps are listed below

  • Initialize the PageRank of every node with a value of 1
  • For each iteration, update the PageRank of every node in the graph
  • The new PageRank is the sum of the proportional rank of all of its parents
  • Apply random walk to the new PageRank
  • PageRank value will converge after enough iterations

PageRank Equation

Image by Chonyy

Python Implementation

Initialize PageRank value

We initialize the PageRank value in the node constructor.

PageRank one iteration

This is the PageRank main function. Just like the algorithm explained above, we simply update PageRank for every node in each iteration. The key to this algorithm is how we update the PageRank.

Calculate new PageRank

  • Specify the in-neighbors of the node, which is all of its parents
  • Sum up the proportional rank from all of its in-neighbors
  • Calculate the probability of randomly walking out the links with damping factor d
  • Update the PageRank with the sum of proportional rank and random walk

Result analysis

Let’s test our implementation on the dataset in the repo. We set damping_factor = 0.15 in all the results.

graph_1.txt

Image by Chonyy.

Result

The result follows the order of the node value 1, 2, 3, 4, 5, 6 .

Let’s observe the result of the graph. The nodes in the graph are in a one-direction flow. And we knew that the PageRank algorithm will sum up the proportional rank from the in-neighbors. This means that node2 will accumulate the rank from node1, node3 will accumulate the rank from node2, and so on and so forth.

In other words, node6 will accumulate the rank from node1 to node5. That’s why node6 has the highest rank.

graph_2.txt

Image by Chonyy.

Result

The nodes form a cycle. So the rank passing around will be an endless cycle. And finally converges to an equal value.

graph_3.txt

Image by Chonyy.

Result

If we look at this graph from a physics perspective, and we assume that each link provides the same force. Intuitively, we can figure out node2 and node3 at the center will be charged with more force compared to node1 and node4 at the side.

graph_4.txt

Image by Chonyy.

Result

Node1 and Node5 both have four in-neighbors. But why Node1 has the highest PageRank? This is because two of the Node5 in-neighbors have a really low rank, they could not provide enough proportional rank to Node5.

Node6 and Node7 have a low PageRank because they are at the edge of the graph and only have one in-neighbor. There’s just not enough rank for them.

IBM.txt

Image by Chonyy.

Result

The result follows the node value order 2076, 2564, 4785, 5016, 5793, 6338, 6395, 9484, 9994 .

Node9484 has the highest PageRank because it obtains a lot of proportional rank from its in-neighbors and it has no out-neighbor for it to pass the rank. From this observation, we could guess that the nodes with many in-neighbors and no out-neighbor tend to have a higher PageRank.

Discussion

Let’s run an interesting experiment. Assume that we want to increase the hub and authority of node1 in each graph. How can we do it?

graph_1.txt

Since the PageRank is calculated with the sum of the proportional rank of its parents, we will be focusing on the rank flows around the graph. In order to increase the PageRank, the intuitive approach is to increase its parent node to pass the rank in it.

Comparing to the original graph, we add an extra edge (node6, node1) to form a cycle. This way, the PageRank of each node is equal, which is larger than node1’s original PageRank value.

graph_2.txt

Similarly, we would like to increase node1’s parent. The more parents there are, the more rank is passed to node1.

Therefore, we add an extra edge (node4, node1). In the original graph, node1 could only get his rank from node5. But after adding this extra edge, node1 could get the rank provided by node4 and node5.

graph_3.txt

Adding an new edge (node4, node1). Just like what we explained in graph_2, node1 could get more rank from node4 in this way. Please note that this rule may not always hold. It’s just an intuitive approach I figured out from my observation.

Computation Performance

Convergence

Now we all knew that after enough iterations, PageRank will always converge to a specific value. Why don’t we plot it out to check how fast it’s converging?

Testing the convergence on graph_4.txt

Image by Chonyy.

From the graph, we could see that the curve is a little bumpy at the beginning. The rank is passing around each node and finally reached to balance. The PageRank value of each node started to converge at iteration 5.

Please note that it may not always take only this few iterations to complete the calculation. For example, if we test this algorithm on graph_6 in the repo, which has 1228 nodes and 5220 edges, even 500 iteration is not enough for the PageRank to converge. And the computation takes forever long due to a large number of edges.

Number of Edges

Image by Chonyy.

We run 100 iterations with a different number of total edges in order to spot the relation between total edges and computation time. As you can see, the inference of edges number on the computation time is almost linear, which is pretty good I’ll say.

Please note that the reason it’s not completely linear is the way the edges link to each other will also affect the computation time a little.

Conclusion

It’s not surprising that PageRank is not the only algorithm implemented in the Google search engine. The problems in the real world scenario are far more complicated than a single algorithm. For example, they could apply extra weight to each node to give a better reference to the site’s importance. So there’s another algortihm combined with PageRank to calculate the importance of each site.

According to Google,

Google assesses the importance of every web page using a variety of techniques, including its patented PageRank™ algorithm.

Source Code

--

--