The world’s leading publication for data science, AI, and ML professionals.

Back Propagation, the Easy Way (Part 2)

Practical implementation

of Back Propagation

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com

In the first part we have seen how back propagation is derived in a way to minimize the cost function. In this article we will see the implementation aspect, and some best practices to avoid common pitfalls.

We are still in the simple mode, where input is handled one at a time.

Layer Class

Consider a fully connected neural network such as in the figure below.

Each layer will be modelled by a Layer object containing the weights, the activation values (output of the layer), the gradient dZ (not represented in the image), the cumulative error delta (𝚫), as well as the activation function f(x) and its derivative f‘(x). The reason for storing intermediate is to avoid computing them each time they are needed.

Advice: It is better to organize the code around few classes, and avoid cramming everything into arrays, as it is very easy to get lost.

Note that the input layer won’t be represented by a Layer object since it consists only of a vector.

class Layer:

    def __init__(self, dim, id, act, act_prime, 
                 isoutputLayer = False):
        self.weight = 2 * np.random.random(dim) - 1
        self.delta = None
        self.A = None
        self.activation = act
        self.activation_prime = act_prime
        self.isoutputLayer = isoutputLayer
        self.id = id

The constructor of the Layer class, takes as parameters:

  • dim: dimensions of the weight matrix,
  • id: integer as id of the layer,
  • act, act_prime: the activation function and its derivative,
  • isoutputlayer: True if this layer is the output, False otherwise.

It initializes the weights randomly to numbers between -1 and +1, and set the different variables to be used inside the object.

The layer object has three methods:

  • forward, to compute the layer output.
  • backward, to propagate the error between the target and the output back to the newtwork.
  • update, to update the weights according to a Gradient Descent.
def forward(self, x):
    z = np.dot(x, self.weight)
    self.A = self.activation(z)
    self.dZ = self.activation_prime(z);

The forward function, computes and returns the output of the Layer, by taking the input x and computes and stores the output A = activation (W.X). It also computes and stores dZ which the derivative of the output relative to the input.

The backward functions takes two parameters, the target y and rightLayer which is the layer (𝓁-1) assuming that the current one is 𝓁.

It computes the cumulative error delta that is propagating from the output going leftward to the beginning of the network.

IMPORTANT: a common mistake, is to think that the backward propagation is some kind of loopback in which the output is injected again in the network. So instead of using _dZ = self.activation_prime(z); some uses self.activation_prime(A)._ This is wrong, simply because what we are trying to do is figure out how the output A would vary relative to input z. This means computing the derivative ∂a/∂z = ∂g(z)/∂z = g'(z) according to the chain rule.**This error might be due to the fact that in the case of sigmoid activation function a = 𝜎(z), the derivative 𝜎’(z) = 𝜎(z)(1-𝜎(z)) = a(1-a). Which gives the illusion that the output is injected into to the network, while the truth is that we are computing 𝜎'(z).**

def backward(self, y, rightLayer):
    if self.isoutputLayer:
        error =  self.A - y
        self.delta = np.atleast_2d(error * self.dZ)
    else:
        self.delta = np.atleast_2d(
            rightLayer.delta.dot(rightLayer.weight.T)
            * self.dZ)
    return self.delta

What the backward function does is to compute and return the delta, based on the formula:

Finally the update function uses the gradient descent to update the weights of the current layer.

def update(self, learning_rate, left_a):
    a = np.atleast_2d(left_a)
    d = np.atleast_2d(self.delta)
    ad = a.T.dot(d)
    self.weight -= learning_rate * ad

NeuralNetwork class

As one might guess layers form a network, so the class NeuralNetwork is used to organize and coordinate the layers. It’s constructor takes the configuration of the layers that is an array which length determines the number of layers in the network and each element defines the number of nodes in the corresponding layer. For example [2, 4, 5, ] means that the network has 4 layers with the input layer having 2 nodes, the next hidden layers have 4 and 5 nodes respectively and the output layer has 1 node. The second parameter is the type of activation function to use for all layers.

The fit function is where all the training happens. It starts by selecting one input sample, computes the forward over all the layers, then computes the error between the output of the network and the target value and propagate this error to the network by calling backward function of each layer in reverse order, starting by the last one up to the first. Finally, the update function is called for each layer to update the weights.

These steps are repeated a number of times determined by the parameter epoch.

After the training is complete, the predict function can be called to test input. The predict function is simply a feed forward of all the network.

class NeuralNetwork:

    def __init__(self, layersDim, activation='tanh'):
        if activation == 'sigmoid':
            self.activation = sigmoid
            self.activation_prime = sigmoid_prime
        elif activation == 'tanh':
            self.activation = tanh
            self.activation_prime = tanh_prime
        elif activation == 'relu':
            self.activation = relu
            self.activation_prime = relu_prime

        self.layers = []
        for i in range(1, len(layersDim) - 1):
            dim = (layersDim[i - 1] + 1, layersDim[i] + 1)
            self.layers.append(Layer(dim, i, self.activation, self.activation_prime))

        dim = (layersDim[i] + 1, layersDim[i + 1])
        self.layers.append(Layer(dim, len(layersDim) - 1, self.activation, self.activation_prime, True))
# train the network
    def fit(self, X, y, learning_rate=0.1, epochs=10000):
        # Add column of ones to X
        # This is to add the bias unit to the input layer
        ones = np.atleast_2d(np.ones(X.shape[0]))
        X = np.concatenate((ones.T, X), axis=1)

        for k in range(epochs):

            i = np.random.randint(X.shape[0])
            a = X[i]

            # compute the feed forward
            for l in range(len(self.layers)):
                a = self.layers[l].forward(a)

            # compute the backward propagation
            delta = self.layers[-1].backward(y[i], None)

            for l in range(len(self.layers) - 2, -1, -1):
                delta = self.layers[l].backward(delta, self.layers[l+1])

            # update weights
            a = X[i]
            for layer in self.layers:
                layer.update(learning_rate, a)
                a = layer.A
# predict input
    def predict(self, x):
        a = np.concatenate((np.ones(1).T, np.array(x)), axis=0)
        for l in range(0, len(self.layers)):
            a = self.layers[l].forward(a)
        return a

Running The Network

To run the network we take as example the approximation of the Xor function.

We try the several network configurations, using different learning rates and epoch iterations. Results are liste below:


Result with tanh
[0 0] [-0.00011187]
[0 1] [ 0.98090146]
[1 0] [ 0.97569382]
[1 1] [ 0.00128179]
Result with sigmoid
[0 0] [ 0.01958287]
[0 1] [ 0.96476513]
[1 0] [ 0.97699611]
[1 1] [ 0.05132127]
Result with relu
[0 0] [ 0.]
[0 1] [ 1.]
[1 0] [ 1.]
[1 1] [ 4.23272528e-16]

It is advisable that you try different configurations and see for yourself which one gives the best and most stable results.

The Source Code

The full code can be downloaded here.

Conclusion

Back propagation can be confusing and tricky to implement. You might have the illusion that you get a grasp of it through the theory, but the truth is that when implementing it, it is easy to fall into many traps. You should be patient and persistent, as back propagation is a corner stone of Neural Networks.

Related Articles

Part 1: Simple detailed explanation of the back propagation Part 3: How to handle dimensions of matrices


Related Articles