Akkordeon: Actor model of a neural network

Koen Dejonghe
Towards Data Science
11 min readOct 22, 2018

--

Intro

I’ve been pondering this for a long time: is it possible to implement a neural network as an actor model? I have finally developed an implementation, and I felt compelled to write an article about it.

The focus of this article is on exploring an idea and how to implement it, rather than how this implementation compares to other frameworks and implementations. I’m not even sure if it’s a good idea. But it can be done, as I hope to show in below content.

Source code is on github: https://github.com/botkop/akkordeon/tree/medium

TLDR;

I have developed 2 actor models for concurrent training of a neural network. The first, which I’ve dubbed ‘flip-flop’, swaps state for every message type, where message type is one of forward pass, backward pass, and validation (or test), and can only handle messages conform the state it is in. The second, nicknamed ‘wheels’, is able to handle all message types at any given moment. It trains the same network used for testing the flip-flop scenario, 2 times faster, and uses 2 times as many resources.

The actor model

The actor model is a system for concurrent computation.

Actors in the actor model are independent units, that can have state.

The state of an actor is private: the actor does not share memory with other actors.

Communication between actors is through messaging. Messages are sent asynchronously and arrive in the mailbox of the receiving actor, from where they are processed in sequence, first-in first-out.

Although messages in an actor’s mailbox are processed one after the other, messages between actors are sent asynchronously. This means that computations in different actors run concurrently.

Actor systems can be deployed on a single machine or in a cluster. This allows for horizontal and vertical scaling, allowing optimal usage of computation resources.

Neural networks

In plain speak:

A neural net is a computation that tries to define an unknown function. We have an input x, and a known output y, but we do not know how to get from x to y. Let’s try something: we take x, do some computation (activation) on it using x and a bunch of random numbers (weights), and see how it compares to the wanted output y. This is known as the forward pass. Obviously, the outcome is not going to be very good.

Next we introduce a method for measuring how wrong our result is (loss function) and adjust the weights a little bit (derivatives) in the direction of the expected output. This is the backward pass, and adjusting the weights is known as optimization.

We do this over and over again, until we find the result satisfactory. This is called ‘training’.

We can try to improve the network, by adding more weights and forward functions. These are stacked in layers, one after the other. Forward results pass from one layer to the next in the forward pass, and adjustments pass from the next layer to the one before that, in the backward pass.

Networks with more than one layer are called deep networks.

Although we may never find the exact function we were looking for, given enough data and training, we can compute something that almost always produces a good result. And hopefully, when we apply this to an x never seen before, it gives us a good prediction of the unknown y.

In summary:

Neural networks are composed of layers, which, during training, are traversed in the forward pass by activations on an input variable and in the backward pass by the derivatives calculated by a loss function. This is followed by an update of the parameters of the layers with the derivatives through an optimizer function, such as stochastic gradient descent.

Neural nets and the actor model

A neural net has many moving parts, and we can probably find some computations that can run concurrently.

Let’s start by defining an actor model, where each layer runs as an independent actor.

Let’s also call this layer-as-an-actor a gate. It better reflects the independency of an actor, and I like the term because it was coined by Andrej Karpathy in his magnificent lecture on backpropagation for CS231n.

In the forward pass, activations are sent as messages from one gate to the next. In the backward pass the derivatives are sent as messages to the previous gate. So far, nothing new. However, the fact that each gate runs independently from the rest of the network, allows for some novelties.

First, optimization (updating a gates’ parameters with the backward gradients of the next gate or the loss function) can be done asynchronously, ie. at the same time the previous gate is processing the newly arrived gradients of this gate.

Second, there is no need to halt the training process in order to execute the validation or test process. All can run at the same time.

Third, there is no need to wait for the gradients calculated by the rest of the network, and ultimately the loss function, to start processing the forward pass of the next batch of input data, if available.

Another thing that comes to mind is that gates can be arbitrarily complex, each composing a complete neural net in itself. Since gates are actors, they can be scaled both vertically and horizontally. Within one machine, gates will execute on all CPUs/GPUs available. Gates can easily be deployed on different machines for horizontal scaling. Optimal usage of all resources on a single machine and/or cluster could result in avoiding the cost of expensive GPUs.

Implementation

How can we implement this?

I will use Scorch, a neural net framework written in Scala. It has a programming interface very much like PyTorch. Since it’s written in Scala, it allows for integration with Akka, an actor modeling toolbox.

Let’s see how it all works together, then I’ll explain the details.

Architecture

This only shows the training phase, but other phases like validation and testing, are similar and simpler, because there is no backpropagation.

There are 3 components.

  • Gates: actors, each consisting of a Module and an Optimizer. These are similar to the layers of traditional nets.
  • A Sentinel: also an actor, responsible for both data provision to the gates (input) and loss calculation/evaluation of the output.
  • A main program, to define the actor system, network, data loaders, and to start the training.

The sentinel, of which there is only one, reads a batch of data from the training data set, and forwards this as a message to the first gate in the network. This gate executes its forward function and sends the result to the next gate. And so on, until the last gate, which forwards the result back to the sentinel.

The sentinel receives the forward message, which is now the final result of the network, and compares it with the expected outcome. It calculates the loss and sends the backward message with the derivatives to the last gate. This gate calculates the local gradients of its function, and sends them to the gate before it, as a backward message. Then it updates its weights (optimization). And so until the backward message reaches the sentinel again.

Then the next batch of data is read by the sentinel. And so on.

Two Scenarios

I have developed 2 scenarios, which I’ve called the flip-flop and wheels scenarios.

In the flip-flop scenario, the state of all the actors (both sentinel and gate) swaps between a forward and a backward processing state. When a forward message reaches an actor, the actor processes it, and flips to the backward state. Vice versa, when a backward message reaches the actor, it flops to the forward state. The drawback of this scenario is that it can only be in 1 state at the same time, and thus can only process messages that the current state allows.

The wheels scenario resolves this by having just 1 state, where it processes all types of messages. This allows a highly concurrent model, where forward propagation, back propagation, and validation all run simultaneously. This scenario relies on the actor model definition that messages in the mailbox of an actor are processed in sequence. When an actor needs to keep track of forward pass activations, in order to calculate the gradients in the backward pass, a simple queue data structure suffices to fulfill this need.

Gates

A gate is similar to a layer. The difference is that every gate is an actor, and that it has its own optimizer, whereas in a traditional network, there is one optimizer for the complete network. There is however, no difference in functionality, since optimizers do not share data between the layers/gates.

Thus, a gate is composed of a module and its own optimizer.

‘Module’ is Scorch (and PyTorch) terminology for a container with data, or parameters, also known as weights, and a forward function. The forward function takes an input, performs a calculation using the weights, and produces a differentiable output.

Modules can also contain other modules, allowing to nest them in a tree structure.

For example:

This module, consists of another module (Linear, a fully connected layer). It will pass its input through the fully connected layer, then a relu function. Note that a module can be made as complex as you want. You can put convolutional, pooling, batchnorm, dropout, … and so on in there.

Gates and the Flip-Flop Scenario

In the flip-flop scenario, a gate has 2 states: the forward state and the backward state.

In the forward state, the gate accepts a forward message from the previous gate or the sentinel, runs the message content through the forward function and passes the result on to the next gate or sentinel.

Then it swaps to the backward state.

In the backward state it accepts a backward message from the next gate or sentinel.

This message contains the gradients on the activation described above. The gate calculates the local gradients and in turn passes these on as a backward message to the previous layer.

Once this backward message is sent, the optimizer updates the weights of the gate with the gradients using a function such as gradient descent or Adam.

Then it swaps back to the forward state, and waits for the next forward message to arrive.

The wire object in the above code, holds pointers to the next and previous actors, either another gate or the sentinel.

Gates and the Wheels Scenario

In the Wheels implementation, a gate has only one state which handles all message types.

Since the goal of this scenario is to be able to execute the forward and backward pass simultaneously, we need to keep track of the activations in the forward pass, so that we can use them in the backward pass for the gradient calculation. I do this using a list, which acts as a queue. In the forward pass the input variable together with the result is appended to the list.

In the backward pass, the first element of the list is popped, and is used for gradient calculation. The gradient of the backward message and the first element of the list of activations are guaranteed to belong together, since in actors, messages are processed sequentially, and our network consists of a single line (each node exactly 1 parent, exactly 1 child), not a tree or a graph.

Sentinel

The sentinel does a couple of things:

  • provide data for training and validation
  • calculate and report loss and accuracy during training and validation
  • trigger the forward pass for training and validation
  • trigger the backward pass when training

Sentinel and the Flip-Flop Scenario

The sentinel has 3 states: Startpoint, Endpoint, and Validation.

In Startpoint state, it accepts Start and Backward messages.

A Start message indicates the beginning of a new training epoch.

A Backward message means the latest Forward message has come full circle, and the system is ready to accept an new batch of training data. The Backward message is otherwise ignored.

In both cases the Sentinel then forwards the next training batch to the first gate.

Then state changes to Endpoint.

In Endpoint state, it accepts Forward messages. These contain the result of the network for the latest training batch, so the loss function is executed, and the gradients are backpropagated with Backward messages to the last gate of the network.

The sentinel tries to retrieve the next batch of training data from the data loader. If this is possible, the state changes to start point, providing it with the batch.

If the end of the training dataset is reached, and it is not possible to retrieve the next batch, then it does some end of epoch administration, and the state changes to Validation.

In Validation state, it sends the validation data in batches to the first gate, which executes the forward pass on it and forward this to the next gate, until the Validation messages arrive back at the sentinel. The sentinel accumulates the loss and accuracy and reports the averages.

Then it sends a Start message to itself, and swaps state to Forward.

Sentinel and the Wheels Scenario

In the wheels scenario, we do not have state other than some variables to keep track of the validation score and training and validation loss.

When the Start message is received, it requests the data provider to send a number of training batches to the first gate, one after the other. These will be processed one after the other by the forward pass, without waiting for the backward pass of each message. The backward messages will arrive in the order of the forward messages. It is possible that a backward message is processed before the next forward message, but that is not an issue, since the gates keep track of the messages using a queue.

The Start message also triggers sending the first validation batches through the network. As a consequence, the validation results of the first epoch will be those of an almost untrained network.

But it is important to understand that this allows training and validation to run concurrently, and that all types of messages (forward, backward and validation) are all processed simultaneously, by the network.

Results on MNIST

Number of training samples: 60000

Number of test samples: 10000

Running a simple network with 3 fully connected layers, of size 784x50, 50x20, and 20x10 resp., each followed by a relu non-linearity, and stochastic gradient descent optimization.

Hardware: 2.3 GHz Intel Core i7 (quad-core, with hyperthreading). No GPU.

Flip-Flop

Gives an accuracy of +96% after 10 epochs. Average CPU usage is around 150%. Average epoch duration (including validation): 18 seconds.

Wheels

Concurrency level for training is set to 4, and for validation to 1. Average CPU usage is around 300%. Accuracy comparable to flip-flop. Average epoch duration (including validation): 8 seconds.

Conclusion

In this article I demonstrated how a neural net can be implemented in an actor model and how the different phases of neural net training can run concurrently.

I have discussed 2 scenarios, one where states switch, and which allows processing of only one message type at a given moment, and another scenario, which allows processing of all message types simultaneously. The latter is more than twice as fast for comparable accuracy.

Resources

Source Code

Actor model

Neural networks

--

--