A DAG expresses a set of interconnected nodes and puts a few hard limits on how those nodes can be connected. A DAG differs from a regular Graph Network in two ways: each connection between nodes represents a one-way relationship. Relationships between nodes can not result in an infinite loop.
The diagram below shows a directed graph. However, it is not acyclic because if you were to step around the graph’s nodes following the edges’ directions, you would end up passing the starting point an infinite amount of times, nor would you even reach a final node to stop on.

Instead, a DAG enforces that you cannot loop like this, hence the Acyclic part of its name. Because of this inherent property of DAGs, they make great candidates for expressing concepts like dependencies in processes. See below a modified version of the graph above, which is indeed a valid DAG.

Using DAGs to express dependencies in a process, let’s imagine that the DAG edges shown above represent steps in a processing pipeline, A feeds into B, and B feeds into C. In this setup, we can resolve dependencies by walking back along the edges. Each node we encounter like this is a dependency of the initial node we started on. Another way to think about this is shown below. We can reverse the relationship expressed by each edge and walk along them. If a forward edge represents the flow of data, then a reverse edge represents a data dependency.

Now that we understand the concept, we can start exploring the algorithms we might use to resolve our data dependencies dynamically. As part of my job, I have been working on a processing engine that creates and runs sets of rules into SQL. This SQL operates on raw data to produce something akin to a presentation layer in a data warehouse. The issue here is that sometimes the SQL can depend on tables and columns created by other SQL scripts, and I need a nice way to handle these dependencies. What if I only want to run one rule? I need a way to find all of the rules that this rule depends on, then to run them in an appropriate order so that all of the dependencies are created before each SQL script tries to access them.
Let’s build a toy example of what I’m talking about. Since I work in the health care realm, I will use that as the context, but the concepts apply to any process. Imagine we have some raw data explaining clients interactions with a set of health services. We first start by cleaning the demographic data and the health service locality data. This clean demographic and locality data are then used as part of the PCD Risk data processing stage. In turn, this PCD Risk processing data is used by two more stages, the Care Plan stage and the Event information stage. Finally, some of the data produced in the Event information stage is used as part of the Billing stage. This is laid out in the diagram below. In this case, the arrows’ directions in the Acyclic graph show the flow of data and not the data dependencies.

Since I am a python programmer, I chose to use the NetworkX library for prototyping some algorithms for walking these dependency Graphs. NetworkX has a DAG class which raises an exception if the graph I attempt to initialise is not Acyclic, and it treats all edges in the initial edge list as directed. Using the situation described above, let’s set up an edge list describing this graph and build the DAG using NetworkX.
Notice in the code above that we defined each node’s relationships as the arrow facing the data flow direction, not the data dependencies’ direction. To walk the graph getting nodes which we are dependant on, we will be walking "backwards" or towards the nodes "predecessors".
Next, we need to define a function that will take the graph, and a starting node, walk the graph covering each node the starting node depends on, then return a list representing the Execution order. To allow parallelisation, we will allow numbers in the execution order to be used more than once if there are nodes not dependent on each other. In the diagram above, I actually encoded this using the colours of the nodes. The Yellow nodes need to be run first, but as they don’t depend on each other, there is no reason why one has to be run before the other, or if the database supports it, there is no reason we cannot run these processing stages in parallel.
Usingget_node_dependancies(graph, start_nodes)
we can walk backwards along the graph build a list of nodes and the depth from the start node that we encounter them. We can calculate the execution order with these depth variables as nodes at the same depth can be run together! There is an issue with this code, though. If we use multiple start nodes, as the flow of data branches and merges, nodes can appear in this list twice and even at different depths! We need a way to clean this and return us a list of nodes and their execution order. I opted for a model where we will run nodes as late as possible in the processing engine’s execution flow. This means that if a node appears more than once in the list, the minimum depth will be used (as these depths are negative, and at execution time, the "start nodes" are actually the last things to run since we are walking backwards.
Now we have all the pieces required to implement our dependency resolution algorithm, let’s test it on the graph we created. If we were to run the PCD job, we would first need to run the Demographics and Locality jobs. we would expect to see something like "Demographics (0), Locality (0), PCD (1)", where the number in brackets is the execution order. Because the Demographics and Locality jobs don’t depend on each other, they should get the same "execution order" value. The PCD should be run next and therefore have an execution order of 1 higher.
There is one thing to note here. I made the decision to run jobs as late as possible; this means assigning them the highest valid number for execution order. You might notice in the diagram above that this would mean that the care plans node is actually blue when using this dependency walker. This was to do as little work as possible early if something failed. This isn’t necessarily appropriate all the time, and I would like to look into spreading execution evenly across stages. So if there were 4 things that needed to run at stage two or three, and 4 things that needed to run at stage three or four, I would load balance them so that there are 2 in stage two, 3 in stage three, 3 in stage four, however, at this stage that’s conveniently left as an exercise for the reader. If you do implement it, please let me know!