Getting Started

In graph theory, a path is a sequence of distinct vertices and edges connecting two nodes. There can be a plethora of paths that lead from one source node to a destination node. Consider the following example:

Consider the component (0, 1, 2, 3), we have two possible ways of getting from 0 to 3. add in (3, 4, 5, 6) and we have 2 x 2 possible ways. Add in another 8 similar components, and we will have 2¹⁰ = 1024 possible paths.
The number of paths can grow exponentially in terms of the input, but only a subset of those paths minimizes the sum of edge weights, and those are called shortest paths.
Generating a set of all paths to find the shortest ones is, as we have demonstrated, very wasteful. In this article, I will expound on the most fundamental shortest paths algorithms. The aim is to ensure an understanding not only of how such algorithms work, but why they do, while avoiding convoluted mathematical proofs and emphasizing intuition instead, all of which will be bolstered with python code.
Prior knowledge of basic graph algorithms such as BFS and DFS is a bonus, but not requisite. If you know what an edge and a vertex are, you probably know enough.
Throughout this article, a graph G(V, E), with V representing the set of vertices in the graph, and E representing the set of edges in the graph, will be represented as an Adjacency List.

For example, the adjacency list for the above graph is represented as below:
Breadth First Search
Breadth First Search (BFS) is a fundamental graph traversal algorithm. The way it works is, for every node, we scan all of its adjacent nodes, and store them so that we can scan each of them in turn in the next iteration.
Implementation
This implementation in python, from MIT 6.006, while not the one you are likely to use in practice, will help us understand better what BFS is actually doing, and how we can leverage it to find shortest paths.
In every iteration of the while loop, we explore a new frontier of nodes by looping through the neighbors of each node in the last frontier.
The result is we end up dividing the graph into levels, as shown in the figure above, where the first level is comprised of nodes that are at least one edge away from the source, the second of nodes that are at least two edges away from the source, and so on and so forth. In the code above, the distance array dist holds this value for every node in the graph.
We can then surmise that after running BFS on a graph, we can figure out the way to reach any node from the source using the least number of edges. Well, that to me sounds like the shortest path to any vertex in the graph ! (not quite …).
Along with the distance array, we are also maintaining an array (or hash table if you prefer) of parent pointers, conveniently named parent, in which we specify, for every discovered node v, the node u we discovered v from, i.e. it’s parent.
We are using that array to keep track of whether a node has been visited before, but that alone doesn’t require specifying the parent of every node. The real reason we are doing that, is to be able to backtrack to the source and construct the shortest path by following those parent pointers.
A more practical implementation of BFS would use a queue to store the nodes to explore next. But it abstracts away the notion of levels, making it harder to understand how BFS gives us the shortest path.
Running Time
- V iterations are required to initialize parent and distance arrays
- The while loop will only run V times. Because we keep track of every node we see in the parent array, we only add unexplored nodes to be scanned in the next iteration.
- Per every iteration of the while loop, we pop a node from the queue and we scan all of the nodes adjacent to it by an edge. That is equal to the number of outgoing edges from V. The cost of doing that to every node in the graph is O(Total number of edges = E).
Thus, The running time of BFS is O(V + E).
A Caveat
The caveat is, as stated before, that this is only the shortest path in terms of the number of edges, i.e. this would only qualify as a "real" shortest path in case the graph is either unweighted or all the weights are the same. Consider the following example where the shortest path from 0 to 2 is not the one with the least number of edges:

So how can we fix this?
Well, one way to do it is to transform the initial weighted graph into an unweighted one whilst keeping the specifications of the problem statement intact, by breaking the weighted edges into edges of weight 1 and linking them together with fake nodes.

what is the running time of this algorithm ?
This is still BFS, so the running time should still be linear in terms of the edges and vertices. But how many of those do we have in our new graph ?
- For every edge of weight w, we replace it by w edges of weight 1:*E’ = O(WE).** W being the maximum edge weight.
- For every Edge of weight w, we create w -1 new vertices, and the old vertices have yet to go anywhere. *V’ = O(WE + V).**
Thus, the running time is *O(V + WE)**, and it’s here that we begin to see the caveat with this approach: It is dependent on how large the weights are. This may as well work well for graphs with small weights, but it’s seldom useful in practice. We would rather have an algorithm that can quickly find the shortest paths regardless of the value of the weights.
To that end, we need introduce a vital technique that is the key to finding the shortest paths in a graph.
Edge Relaxation

Relaxing an edge (u, v) consists in simply checking whether we can find, from a source node s, a shorter path to v through (u, v).
We do so by comparing the distance to v through the old path we know of, with that of the path formed by the shortest path to u, and the edge (u, v).
Relax(u, v, weight):
if d[v] > d[u] + weight(u, v):
d[v] = d[u] + weight(u, v)
parent[v] = u
If indeed we found that we could reach v faster through the edge (u, v), then we update the following values:
- d[v]: the distance to the node v from a source s. we update this with the new value that we just compared it to.
- parent[v], the direct parent of v, which is now u. Since, like in BFS, we find the path by following parent pointers, this is akin to updating the shortest path to v.
Template for shortest path algorithms
Using the technique we learned above, we can write a simple skeleton algorithm that computes shortest paths in a weighted graph, the running time of which does not depend on the values of the weights.
- Select edge (u, v) from the graph.
- Relax edge (u, v).
- repeat 1 and 2, selecting edges in some order, until no edges can be relaxed (d[v] ≤ d[u] + weight(u, v) for all edges (u, v))
If no edges in the graph can be relaxed, that means there is no better way to reach any node through any of its adjacent edges, i.e. there is no shorter path to get to any node that the path we have at the moment.
Well that’s simple enough! However, as it is often the case, the devil is in the details, and in this particular case, the devil insidiously lurks within the word ‘some order’. An unwise relaxation order will result in exponential time. A good order may even result in linear time. The efficiency of this algorithm hinges on the order in which you relax the edges, and each of the subsequent algorithms differs in that regard.
Bellman Ford
Since we already know how to explore the edges and vertices of a graph in a certain order that is called BFS, why don’t we try that ? Let us explore the graph in breadth first order, scanning vertices and relaxing edges along the way.
This time, however, we can’t just be content with scanning every vertex once. Unlike when edges were unweighted, we can’t possibly know whether there could be a better path to reach a node, a path that goes through nodes lying multiple frontiers ahead and are yet to be explored, like in the example below.

Instead, if when scanning a node u, an edge (u, v) can be relaxed, we relax it and then we would add it to the queue. But we don’t want to add a node to the queue that is already in it, and is scheduled to be scanned later, that would be wasteful. Therefore, for each node we have to keep track of whether it is in the queue at any given moment.
The implementation would go something like below: (ignore lines 26 to 30 for now)
But why is this algorithm able to find the shortest path ?
Intuition
Similarly to how we did in BFS, we will try to analyze this algorithm by looking at the levels in the queue.
Level 0 consists only of the source node s.
Level 1, same as BFS, consists of all the nodes adjacent to the source node. Those are nodes within an edge of the source, i.e. through a path of length 1.
Level 2 will now consist of all the nodes adjacent to the nodes at level 1, whose edges can be relaxed. So these are all the nodes in the graph that can be reached through a path of length 2, faster than a path of length 1.
At level 1, all the shortest paths (from source s) of length 1 are computed correctly.
At level 2, all the shortest paths of length 2 are computed correctly.
At level V-1, all the shortest paths of length V-1 are computed correctly.
A path can only have V nodes at most, since all of the nodes in a path have to be distinct from one another, whence the maximum length of a path is V-1 edges. Thus, after V-1 levels, the algorithm finds all the shortest paths and terminates.
Negative weight cycles
Up until this point I have purposefully neglected an important detail about weighted graphs. Consider the following graph:

Nodes 1, 2 and 3 form a cycle. What is particular about this cycle is that the sum of the weights of its edges is negative. Finding the shortest path in a graph with a negative cycle is an NP-complete problem, for which there is no known algorithm that can compute an efficient solution, and it’s easy to see why.
We have seen that the key for finding shortest paths is the order in which edges are relaxed, but no matter the order in which you relax edges 1–2, 2–3 and 3–1, you will never reach a point where no edges can be relaxed, for you can always decrease the distance by going through the negative edge one more time, by closing the cycle one more time.
Wherefore it behooves us to detect the presence of these cycles within our algorithm to keep it from running ad infinitum. But how ?
- We have already made arrangements to ensure a node does not appear more than once during a level or phase.
- We also know that under regular circumstances there should be no more than V-1 levels.
Therefore, a node can only be pushed into the queue at most V-1 times during the entirety of the execution. We can then keep track of how many times a node was added to the queue (line 26), and if that number exceeds V-1, we have detected a negative cycle (lines 28 to 30).
note: A consequence of this is that we cannot use this algorithm on undirected graphs with negative edges, because a single negative undirected edge would count as a negative cycle (since its equivalent to 2 directed edges, (u,v) and (v,u)).
Running time
- We know that the algorithm has V-1 phases, each corresponding to the V-1 levels we just spoke of. (For example in phase 1, I pop all the nodes in level 0(which is just the source), scan all their adjacent nodes and make the next level of nodes to be scanned.)
- Since we ensured no vertex appear on the queue more than once in the same level, there will be no more than V vertices in each level.
- In each phase, the worst case would be if we had all the vertices in a single level and had to scan all the edges in the graph, which translates to O(E).
Therefore, the total running time of the algorithm is O(V.E).
This is quadratic time in case the graph is sparse (E = O(V), not very bushy) and cubic time in case the graph is dense (E = O(V²), very bushy graph, lots of edges coming out of each node). That can be pretty slow, is there no way to speed this up ?
well, it turns out we can’t do any better when it comes to graphs that contain negative weight edges. But who needs them anyway ? It’s not like they appear frequently in practice. If we had a graph with non negative edge weights, is there a way to leverage that to come up with a faster algorithm ?
Dijkstra
Consider the following example:

If we posit that there are no negative edges in the graph, can we make a guess as to which one of these edges forms a shortest path?
It’s the edge with minimum weight, that links node C to the source.
Why is that the case ? Well, If there existed a shorter path to node C, then it would have to go through one of the nodes from A to E, all of which have higher weights in their edges than the edge we picked, which had the minimum weight. Because we have no negative weight edges, the cost of a path would never get smaller as you add in more edges, so there is no hope of decreasing the total path weight as we progress further in the graph.
Thus, we can tell with certainty that (S,C) is the shortest path to node C, and as a result, we don’t have to go back to it anymore, and therein lies the essence of Dijkstra‘s algorithm.
Algorithm
Dijkstra adopts a greedy approach to edge relaxation. At each iteration, It selects the closest node to the source. It relaxes all of its outgoing edges, and then never checks it again, because it is be sure that this is the current value is the Shortest Path to that node.
It keeps repeating the process until it has exhausted every node on the graph, or, alternatively, if you have a destination node in mind, until it has reached that particular node.
Pseudo-code Dijkstra (adj, source):
visited = {}
Q = priority_queue()
Initialize Q with (node, distance) values, distance being 0 for the source and infinity for every other node.
while Q is not empty:
u = extract_min(Q) # deletes u from Q
visited = visited ∪ {u}
for each vertex v, weight in Adj[u]:
if d[v] > d[u] + weight(u, v):
d[v] = d[u] + weight(u, v)
parent[v] = u
Q.decrease_key(v, d[v])
The algorithm leverages a priority queue data structure instead of a regular queue like the previous ones in this article, since we need to pop the closest node at every iteration.
You can find here a visualization of Dijkstra that you can play with, created by David Galles at the University of San Francisco.
Intuition
We have already explored part of the intuition as to why this algorithm manages to compute shortest paths. We can extend that reasoning beyond the first iteration:

For this particular state, the priority queue will return N. There exists no shorter path to node N, since any other path to N will have to go through the nodes that were already in the priority queue at the time (the way we explore nodes and relax all their adjacent edges makes it so), all of which are farther than A from the source.
As for the other nodes, while their distance values can decrease in future iterations, they can never become smaller than A’s distance, because their best bet of decreasing their distance is to find paths that go through A.
Running Time
We have the following operations:
- O(V) inserts into priority queue (initialization step)
- O(V) extract min operations (queue starts with V nodes and keeps popping until it runs out)
- for every one of the extracted vertices, we do as many decrease key operations as there are outgoing edges (in the worst case). summing that over all nodes amounts to O(E) decrease key operations.
The actual time complexity hinges on the data structure used to implement the priority queue.
Using an unsorted array, extracting the minimum would require a full pass through the vertices, incurring a cost of O(V). Decreasing a key’s value can be done in constant time. The result is O(V² + E) = O(V²).
Using a binary heap, both operations would cost O(log(V)). the total running time is O(V.log(V) + E.log(V)) = O(E.log(V)).
Using a Fibonacci heap, you can get O(E) running time, but that’s too fancy a data structure for practical use.
As to which one is the better approach, it (clearly) depends on the value of E. The best value E can have is V -1* (when the graph is just connected). For sparse graphs E = O(V), making the binary heap a better option.
The worst case happens when the graph is dense and we have edges from every node to almost every other node. In that case, E = O(V²), and you’re better off using the array implementation to save cost over those decrease key operations.
*I’m ignoring the fact that E can be less than V-1 when have multiple disconnected components in the graph, since those won’t be reachable from the source anyway. You can either run BFS from the start node to determine which nodes are connected to it and only put those in the priority queue, or populate the latter as you go in the for loop rather than initializing it from the start(which is the case for the python implementation below).
Implementation
This is an implementation that uses a heap as a priority queue, and is the most useful one in practice. But it slightly differs from the pseudocode:
Notice that instead of decreasing the key of a node in place, we just push the new (node, distance) pair in the queue, trusting that the queue will return the pair with the smallest distance. As for the duplicate larger ones that remain in the queue, they will be filtered out in line 12 by checking if they have already been visited.
The reason I opted for this, for the decrease key to be O(log(V)), we need to be able, from a node u, to locate the (u, du) pair in the queue quickly. Python’s heapq implementation does not offer such functionality. You can either implement it by hand, use some external module, or get creative.
Since the number of nodes in the queue is no longer capped at V nodes, we can have at most O(E) nodes in the queue. The asymptotic running time is O(E.log(E)) = O(E.log(V²)) = O(E.log(V)), so it basically remains unaffected.
Dijkstra’s running time is pretty good, and for a random graph, there is no better asymptotic running time, though algorithm that run faster in practice do exist.
However, for the particular case of Directed Acyclic Graphs (DAGs),there is one last algorithm that is faster than Dijkstra’s, and that can even work with negative weight edges !
DAG shortest path
The creative name in the title is curtesy of the fact that this algorithm lacks one, since no one really knows who first invented it.
Now, the primary instinct one should develop upon encountering a Directed Acyclic Graph (DAG) in the wild, is to topologically sort it. This would give us a linear ordering such that if (u, v) was a directed edge from u to v, than u, the predecessor, will certainly appear before v, the successor.
In other words, no node can come before its predecessor on any path. All paths lead forward.

Therefore, by iterating through each node in topological order, relaxing all of its outgoing edges, we can rest assured that once a node is scanned for its neighbors it would never have to be again, because its distance value would never be updated, because there literally is no way to go back to it from its successor nodes, as is the property of topological sort.
This is basically the same reasoning used to prove the correctness of Dijkstra’s greedy approach, but without the complicated explanations (the actual formal proof of Dijkstra is pretty gnarly…) or the fancy data structures. and the total running time is same as BFS, O(V + E).
Implementation
We first topologically sort the graph. The code below uses Depth First Search (DFS) for that purpose. I will not get into the details of this since it is not a shortest path algorithm.
And now we can implement the shortest path algorithm:
Conclusion
Phew, well It’s about damn time. This took longer that I expected. We have explored the most fundamental single source shortest algorithms. Here is a summary of the algorithms we have seen so far:

And below is a decision tree to help with choosing the best option for your use case:

A couple of notes to finish:
- All of the algorithms described in this article, are designed to find the shortest path from one particular source node, hence why they are called single source shortest paths algorithms. There are other algorithms that can find all the shortest paths from all nodes on the graph, and that are more efficient than running one of the algorithms we have seen V times (as there are V source nodes). I might write an article on those next time.
- I actually lied about the Bellman Ford Algorithm. The algorithm presented in the article is actually called "Shortest path faster", an improved version of Bellman ford that has the same asymptotic complexity, but works faster in practice. They are also the same proof of correctness, but I find it easier to see that proof using the shortest paths faster algorithm. That, coupled with practical efficiency is what made me choose the variation over the original.