The world’s leading publication for data science, AI, and ML professionals.

The Algorithm That Made Google Google

How PageRank transformed how we searched the internet, and why it's still playing an important role in LLMs with Graph RAG.

Image generated by DALL-E
Image generated by DALL-E

In the late 1990s, two Stanford University graduate students, Larry Page and Sergey Brin, were working on their PhD research when they came across an interesting idea.

Larry was particularly fascinated by the way web pages linked to one another. He saw the internet as a vast network of citations, similar to how academic papers reference each other. This sparked an idea: what if a web page’s importance could be measured by how many other pages linked to it? But not just that – what if the importance of those linking pages also mattered?

Intrigued by that idea, Larry started building an algorithm that would later be named "PageRank" (a clever play on his last name). The algorithm treated each link to a webpage as a vote of confidence, but with a twist – votes from more important pages carried more weight.

Sergey Brin, interested in Larry’s concept, joined him. Together, they worked from their dorm rooms, building a search engine that would use this ranking system. At first, they called their project "BackRub" due to its analysis of backward links.

As their project grew, they realized they had something special on their hands. Their search engine was producing significantly better results than existing search engines of the time. In 1998, they decided to take a leave of absence from Stanford to focus on their project full-time.

They renamed their search engine "Google" – again a play, this time on the mathematical term "googol" (a number represented by 1 followed by 100 zeros) – reflecting their ambition to organize the vast amount of information on the web. With some initial funding from investors, including a $100,000 check written in their garage office, Google was officially born. The rest is history.

Breaking down PageRank

Now, let’s imagine we have 4 high school students: Amy, Bob, Charlie, and Diana. They are trying to decide who is more and less "cool", and how "cool" everyone is depends on how much support they get from the others. In particular, they have a pool of 100 "cool" points that is distributed to each student, hence 25 points each. Then, each student needs to give all their points to one or more students.

Let’s say:

  • Amy vouches for Bob
  • Bob vouches for Amy
  • Charlie vouches for Amy and Diana
  • Diana vouches for everyone

After one round:

  • Amy gets points from Bob (25) and half of Charlie’s points (12.5) totaling 37.5 points
  • Bob gets points from Amy (25) totaling 25 points
  • Diana gets half of Charlie’s points (12.5) totaling 12.5 points

Then, the points get redistributed again and again until they stabilize. With this mechanism, the more popular you are, the more valuable your vouch is.

This is basically how PageRank works, just with websites instead of students and links instead of vouchers.

The Math Behind PageRank

At its heart, PageRank is based on a simple, but very powerful idea: the importance of a webpage can be determined by the quantity and quality of links pointing to it. This concept can be broken down into several key principles:

  1. Every link to a page is a vote of confidence for that page.
  2. The importance of the page casting the vote matters.
  3. The more links a page casts, the less weight each link carries.

First, we need to understand the concept of the "random surfer." Imagine you’re mindlessly clicking links on the internet (we’ve all been there, right?). You start on a random page, click a link, then another, and another. Occasionally, you get bored and just type in a new web address. This aimless journey is what we call the "random surfer model."

The PageRank of a webpage is basically the probability that this random surfer will end up on that page after clicking around for a really, really long time (theoretically, an infinite amount of time, but let’s not get too philosophical here).

Now, here’s where it gets interesting. The PageRank formula looks like this:

PageRank formula - Image by Author
PageRank formula – Image by Author

Where:

  • PR(A) is the PageRank of page A
  • d is the damping factor (typically 0.85), which represents the probability that the random surfer will continue clicking links rather than jumping to a random page.
  • N is the total number of pages in the network
  • PR(T_i) is the PageRank of pages T_i which links to page A
  • C(T_i) is the number of outbound links from page T_i
  • n is the number of pages that link to page A

Now, let’s look at the two parts of this formula:

Left component in PageRank Formula - Image by Author
Left component in PageRank Formula – Image by Author

This is the probability that our random surfer gets bored and jumps directly to page A. It’s like rolling a die with N sides (where N is the number of pages on the internet) and landing on page A.

Right component in PageRank Formula - Image by Author
Right component in PageRank Formula – Image by Author

This is the meaty part. It represents the probability of arriving at page A by following links. Each term in this sum is the PageRank of a page linking to A, divided by the number of links on that page.

PageRank is also recursive, which means that page A depends on the PageRank of the pages linking to it, which in turn depend on the pages linking to them, and so on. It’s like a never-ending chain of popularity contests!

So how do we actually calculate this? We use an iterative process:

  1. Start by giving every page an equal PageRank (let’s say 1/N).
  2. Use our formula to calculate new PageRank values for each page.
  3. Repeat step 2 until the values stop changing much between iterations.

Now, you might be thinking, "Wait a minute, what about pages with no outgoing links? Or pages that only link to each other?" Great questions! These are what we call "dangling nodes" and "rank sinks," and they can cause problems for our algorithm.

For dangling nodes (pages with no outgoing links), we usually pretend they link to all other pages equally. For rank sinks (groups of pages that only link to each other), the random jump factor (1-d) helps prevent them from hoarding all the PageRank.

Ok that seems easy when dealing with 4 pages, but how can we scale it to the whole internet? Here, we use a few tricks to make it manageable:

  1. Sparse Matrix Representation: Most pages only link to a few others, so we can save a lot of memory by only storing the non-zero elements of our link matrix.
  2. Distributed Computing: We can split up the calculation across many computers, each working on a different part of the web.
  3. Approximation Methods: For really huge networks, we might use algorithms that give us a good estimate of PageRank without calculating it exactly for every single page.

And remember, the web is constantly changing. New pages are created, old ones disappear, links are added and removed. So Google (and other search engines) need to keep updating their PageRank calculations all the time.

Practical example

Let’s start with a small web of just four pages to illustrate how PageRank actually works in practice. Imagine we have pages A, B, C, and D with the following link structure:

  • Page A links to B and C
  • Page B links to C and D
  • Page C links to A
  • Page D links to C
4 pages example visualized in a Knowledge Graph - Image by Author
4 pages example visualized in a Knowledge Graph – Image by Author

Now, let’s calculate the PageRank for these pages step by step. We’ll use a damping factor (d) of 0.85, which is the typical value used in practice.

Step 1: Initialize PageRank We start by giving each page an equal PageRank. Since we have 4 pages, each page starts with a PageRank of 1/4 = 0.25.

Equal probabilities among the 4 pages - Image by Author
Equal probabilities among the 4 pages – Image by Author

Step 2: Calculate new PageRank values Now we’ll use our PageRank formula for each page. Remember, the formula is:

Calculate new page rank values using PageRank formula - Image by Author
Calculate new page rank values using PageRank formula – Image by Author

Where N is the total number of pages (4 in our case), Y are the pages linking to X, and C(Y) is the number of outbound links from Y.

First iteration of calculation for 4 pages - Image by Author
First iteration of calculation for 4 pages – Image by Author

Step 3: Iterate Now we use these new PageRank values and repeat the calculation. Let’s do one more iteration:

Second iteration of calculation for 4 pages - Image by Author
Second iteration of calculation for 4 pages – Image by Author

We would continue this process until the values converge (change very little between iterations). Indeed, the PageRank algorithm is guaranteed to converge because of the properties of the Google matrix (the matrix representing the web’s link structure plus the damping factor). This matrix is stochastic (each column sums to 1) and irreducible (there’s a path between any two pages), which ensures convergence to a unique solution.

4: From Math to Code

Now, it’s time to switch from theory to practice with a Python implementation of PageRank on a Knowledge Graph. Here, we will:

  1. Connect to a Neo4j graph database.
  2. Create a small sample graph structure to understand the basics.
  3. Crawl a subset of Wikipedia pages to build a larger, more realistic network of links.
  4. Load this network into Neo4j, creating nodes for pages and relationships for links.
  5. Implement PageRank in Python from scratch and run it on the graph data stored in Neo4j.

You can find all the code we will cover in this section on this Jupyter Notebook:

rag-knowledge-graph/pagerank/demo.ipynb at main · cristianleoo/rag-knowledge-graph

Setting up the Neo4j Connection

First, we need to set up a connection to our Neo4j database. Neo4j is a graph database that stores data in nodes (think webpages) and relationships (think hyperlinks). Connecting to it allows us to store our crawled data and run queries to build the link structure we need for PageRank. If you read my previous Graph RAG article, you would already know how to set it up, so just jump straight to the code. If not, let’s first go over the Neo4j installation process. We will be using Neo4j just because it provides an easy installation and free option for local hosting (I don’t have any sort of affiliation with it).

To install Neo4j simply visit the Neo4j Desktop Download Page and click Download. Open Neo4j Desktop after installation. Sign in or create a Neo4j account (required to activate the software).

Once logged in, create a New Project:

  • Click on the + button in the top-left corner.
  • Name your project (e.g., "PageRank").

Inside your project, click on Add Database. Select Local DBMS and click Create a Local Graph.

Configure your database:

  • Name: Enter a name (e.g., neo4j).
  • Password: Set a password (e.g., pagerank). Remember this password for later.

Click Create to initialize the database.

Now, let’s move on to the code:

from neo4j import GraphDatabase
import json
import wikipediaapi
import random
from tqdm import tqdm
from collections import deque
import numpy as np

We import the GraphDatabase class to connect to a Neo4j database and run queries against it. This will be our main class to handle operation against our Knowledge Graph.

Next, we import a series of Python libraries: json for reading and writing data in JSON format, wikipediaapi for accessing Wikipedia pages, random for random selection of links, tqdm for progress bars, deque (a double-ended queue) for efficiently handling lists we will treat like queues, and numpy for mathematical operations.

# Connect to Neo4j
uri = "bolt://localhost:7687"  # Change this to your Neo4j URI
username = "neo4j"  # Change this to your Neo4j username
password = "pagerank"  # Change this to your Neo4j password

driver = GraphDatabase.driver(uri, auth=(username, password))

Let’s define the connection for the Neo4j instance. In a production environment, you’d make sure these credentials are secure, but for our demonstration, we’ll keep it simple.

Next, we create a driver object that lets us open sessions to the database and run queries. Think of this as "plugging in" your Python code to the Neo4j database engine.

Crawling Wikipedia

For our example, we will crawl 100 pages from Wikipedia, starting from a single article. We’ll follow links to other articles, pick a few at random, and build a larger link network. This mimics the random surfer model we discussed, except we’re not just simulating a random surfer’s journey – we’re actually collecting the links themselves.

def crawl_wiki_network(start_title, max_articles=100, max_links_per_article=5):
    # Initialize Wikipedia API
    wiki = wikipediaapi.Wikipedia(
        'RAG Knowledge Graph ([email protected])',
        'en'
    )

This crawl_wiki_network **** function starts at a given Wikipedia article and crawls through linked pages. Inside it, we create a Wikipedia API object. The arguments identify our application and language (English in this case).

    # Track visited pages and links
    visited_pages = set()
    links_data = []
    pages_to_visit = deque([start_title])

Let’s initialize visited_pages = set() to keep track of which pages we’ve already processed so we don’t repeat work. Next, links_data = [] will store information about the links we discover: from which page the link came, to which page it goes, and some context.

Also, we start with a queue that initially contains just one page: the page we start from (e.g., "Artificial Intelligence"). We’ll pull pages off this queue and explore them.

    with tqdm(total=max_articles) as pbar:
        while pages_to_visit and len(visited_pages) < max_articles:
            current_title = pages_to_visit.popleft()

            if current_title in visited_pages:
                continue

We use tqdm for a nice progress bar so we know how many articles we’ve processed out of max_articles. We continue crawling as long as there are pages to visit and we haven’t hit our limit of articles via the while loop.

popleft() removes the next page title from our queue, and in case we’ve already seen this page, skip it to avoid loops.

            page = wiki.page(current_title)
            if not page.exists():
                continue

Next, we fetch the page from Wikipedia, and in case of broken links we skip the page.

            visited_pages.add(current_title)
            pbar.update(1)

Once, we visit the new page, we add it to the visited_pages set, and update our progress bar, showing that we’ve successfully processed another article.

            # Get all links from the page
            page_links = list(page.links.items())

            # Randomly select a subset of links
            selected_links = random.sample(
                page_links, 
                min(max_links_per_article, len(page_links))
            )

Next, we get all the links found on the current page, and pick a few links at random, up to max_links_per_article. This mimics the idea that our random surfer won’t follow every single link, just a handful.

            # Process selected links
            for link_title, link_page in selected_links:
                if link_page.exists():
                    # Add link to queue for future processing
                    pages_to_visit.append(link_title)

                    # Get the first paragraph as context
                    context = link_page.summary.split('n')[0] if link_page.summary else ""

                    links_data.append({
                        "from": current_title,
                        "to": link_title,
                        "context": context
                    })

Next, we iterate over each chosen link, and for each valid link (skipping broken as before):

  • We add it to our queue, so we can explore it in a future iteration.
  • Extract a bit of context (like a snippet of text) from the linked page, giving us more info if needed.
  • Store the "from → to" link in links_data. This data will later help us build a graph in Neo4j and eventually run PageRank.
return links_data

After we’ve reached our maximum number of articles or run out of links, we return all the link data we collected.

Saving the Crawled Network

Now we have a crawler that returns link data. Let’s save this data as a JSON file for future use.

def save_network(start_title, output_file="wikipedia_network.json"):
    links_data = crawl_wiki_network(start_title)
    if links_data:
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(links_data, f, indent=2, ensure_ascii=False)
        print(f"Saved {len(links_data)} links to {output_file}")

This function runs the crawler, takes the returned links_data, and writes it out to a JSON file.

# You can start with any article
start_article = "Artificial intelligence"
save_network(start_article)

We choose a starting point for our crawl – here it’s "Artificial intelligence." We could pick any article to begin the chain, and run the whole process and save the results.

_(As the crawler runs, we’ll see a progress bar. Eventually, it prints something like: "Saved 495 links to wikipedianetwork.json".)

Loading the Network into Neo4j

Now that we have a JSON file full of link data, let’s load it into Neo4j so we can store and query the network there. This will let us handle larger datasets more easily and perform efficient lookups when we implement PageRank.

def load_wiki_network(file_path="wikipedia_network.json"):
    # Read the JSON file
    with open(file_path, 'r', encoding='utf-8') as f:
        links_data = json.load(f)

    # Connect to Neo4j and load the data
    with driver.session() as session:
        # Create unique constraints if they don't exist
        session.run("CREATE CONSTRAINT page_name IF NOT EXISTS FOR (p:Page) REQUIRE p.name IS UNIQUE")

        # Create nodes and relationships
        for link in links_data:
            # Create or merge nodes and relationship
            session.run(
                """
                MERGE (from:Page {name: $from})
                MERGE (to:Page {name: $to})
                CREATE (from)-[:REFERENCES {context: $context}]->(to)
                """,
                parameters={"from": link["from"], "to": link["to"], "context": link["context"]}
            )

    print(f"Successfully loaded {len(links_data)} relationships into Neo4j")

load_wiki_network takes the JSON file path and loads the link data into Neo4j. It first reads the JSON file containing all the link relationships, and open a Neo4j session, allowing us to run queries. Next, we run a few Cypher queries (SQL for Knowledge Graph) for each page:

  • CREATE CONSTRAINT page_name IF NOT EXISTS: Enforce that each Page node must have a unique name, preventing duplicates.
  • MERGE (from:Page {name: $from}) ...: MERGE either finds or creates a node. This ensures we don’t create the same page node multiple times.
  • CREATE (from)-[:REFERENCES {context: $context}]->(to): Here we create a REFERENCES relationship between two pages. We store context as a property. In a real scenario, this might help with downstream tasks like summarization or topic analysis.
load_wiki_network()

Finally, we load all the links we discovered into Neo4j, resulting in a rich network we can run PageRank on.

(After running, we might see: "Successfully loaded 495 relationships into Neo4j")

Once, we have our graph, let’s visualize it. There are a few way to visualize our newly created Knowledge Graph, but probably the easiest is using Neo4j browser. Switch back to Neo4j desktop, click on the arrow pointing down next to "Open", and click on "Neo4j Browser":

Screenshot of Neo4j desktop - Image by Author
Screenshot of Neo4j desktop – Image by Author

Once you are in, run the following Cypher query:

MATCH (n) RETURN n

This will fetch all the nodes, and plot our Knowledge Graph. In my case, mine looks like this:

Knowledge Graph containing random Wikipedia topics - Image by Author
Knowledge Graph containing random Wikipedia topics – Image by Author

Resetting the Graph (Optional)

If at any point you want to start fresh, you can clear out all the data from Neo4j:

with driver.session() as session:
    session.run("MATCH (n) DETACH DELETE n")
    print("Graph has been reset - all nodes and relationships deleted.")

MATCH (n) DETACH DELETE n finds all nodes n and deletes them along with their relationships. This is a complete reset of the graph.

Implementing the PageRank Algorithm in Python

Now that we have our graph of pages and links stored in Neo4j, we can implement PageRank from scratch. Remember how the math worked? We’ll apply the iterative method we described, but this time reading real data from our Neo4j database.

def get_pages(tx):
    result = tx.run("MATCH (p:Page) RETURN p.name AS name")
    return [record["name"] for record in result]

def get_links(tx):
    result = tx.run("MATCH (p1:Page)-[:LINKS_TO]->(p2:Page) RETURN p1.name AS from, p2.name AS to")
    links = {}
    for record in result:
        if record["from"] not in links:
            links[record["from"]] = []
        links[record["from"]].append(record["to"])
    return links

get_pages(tx) runs a Cypher query to get all page names from the database. We return a list of these names.

get_links(tx) queries all LINKS_TO relationships. We build a dictionary where each key is a page and each value is a list of pages it links to. This gives us the outbound link structure we need for PageRank.

def page_rank(pages, links, damping_factor=0.85, max_iterations=100, tol=1.0e-6):
    n = len(pages)
    ranks = np.ones(n) / n
    page_index = {page: i for i, page in enumerate(pages)}

It’s finally time to create the function that will handle the PageRank algorithm. Let’s start by setting:

  • n = len(pages): The total number of pages.
  • ranks = np.ones(n) / n: Initialize all PageRank values equally, just like our earlier examples, giving each page 1/N initially.
  • page_index = {page: i ...}: Create an index mapping from page names to their array index. This helps us store and access ranks efficiently.
for _ in range(max_iterations):
        new_ranks = np.zeros(n)
        for page, out_links in links.items():
            if out_links:
                share = ranks[page_index[page]] / len(out_links)
                for linked_page in out_links:
                    new_ranks[page_index[linked_page]] += share
            else:
                # If a page has no out-links, it is a "dangling node"
                # We distribute its rank evenly across all pages
                new_ranks += ranks[page_index[page]] / n

Next, we iterate multiple times until PageRank converges or we hit the limit. Each iteration starts with a blank slate. If the page has outbound links, we split its current rank among those links (like distributing "votes").

For each linked page, add the "share" of PageRank. If a page has no outbound links, it’s a dangling node. We distribute its rank among all pages equally (this matches the theoretical fix for dangling nodes we discussed).

        # Apply damping
        new_ranks = damping_factor * new_ranks + (1 - damping_factor) / n

Remember our formula:

PageRank formula - Image by Author
PageRank formula – Image by Author

The first part (1-d)/N is the "random jump" probability. We multiply our accumulated ranks by d and add the (1-d)/N term to every page.

        if np.linalg.norm(new_ranks - ranks, 1) < tol:
            break
        ranks = new_ranks

Next, we compute the difference between the new ranks and the old ones. If this difference (the "update") is very small, it means we’ve converged. If the change is less than the tolerance tol, we stop early. Our PageRank values have stabilized. Otherwise, we continue another iteration using the new ranks.

return {page: ranks[page_index[page]] for page in pages}

Finally, we return a dictionary of {page_name: page_rank_value}, making it easy to see which pages have the highest PageRank.

with driver.session() as session:
    pages = session.execute_read(get_pages)
    links = session.execute_read(get_links)
    ranks = page_rank(pages, links)
    print(ranks)

Now, let’s get all pages and links, and run our PageRank algorithm on the retrieved network, print out the PageRank scores for each page.

Connecting Back to the Math

This Python code reflects the math we discussed earlier. The page_rank function is essentially performing the iterative calculation of the eigenvector solution. Each iteration corresponds to one application of the formula:

R(t+1) = d*M*R(t) + (1-d)*1/N

Where M is our link structure and 1/N is the uniform distribution for the random jump. If you look closely at the code, every iteration updates the new_ranks in a way that mirrors applying M and then adding (1-d)/N. Once the difference between new_ranks and ranks is very small, we’ve found our stable vector: the PageRank values.

Conclusion

Now, before we get too carried away, let’s remember that PageRank isn’t perfect, and has a few weak spots:

  • Link Spam: Just like email spam, some folks try to game the system by creating fake links.
  • The "Rich Get Richer" Problem: New websites start with low PageRank, making it harder for them to climb the rankings.

Search engines like Google now use incredibly complex systems that go way beyond just PageRank. But the core idea of PageRank – that the web’s structure itself contains valuable information – is still at the heart of how we organize and find information online. Especially now in the AI world PageRank is coming back as a hot topic. For example, we see PageRank in Graph Neural Networks, which usePageRank-like methods to spread information through complex networks, and Graph RAG (Retrieval-Augmented Generation) where language models can use PageRank-style algorithms to navigate vast knowledge graphs.


Related Articles