Multiclass Classification Neural Network using Adam Optimizer

This is in continuation of my blog series where I use numpy’s einsum to implement the complete model.

Published in

Towards Data Science

6 min readJul 10, 2021

I wanted to see the difference between Adam optimizer and Gradient descent optimizer in a more sort of hands-on way. So I decided to implement it instead.

In this, I have taken the iris dataset and implemented a multiclass classification 2 layer neural network as done in my previous blog. The only difference this time is that I have used Adam Optimizer instead of Gradient descent.

def leaky_relu(z):
    return np.maximum(0.01*z,z)def leaky_relu_prime(z):
    z[z<=0] = 0.01
    z[z>0] = 1
    return zdef softmax(z, _axis=0):
    stable_z = z - np.max(z)
    e_z = np.exp(stable_z)
    return e_z/np.sum(e_z, axis=_axis, keepdims=True)def binarize(z):
    return (z[:,None] == np.arange(z.max()+1)).astype(int)def pasapali_batch(units_count, x, y, lr, epochs, bias=False, _seed=42):
    beta1 = 0.9
    beta2 = 0.999
    eps = np.nextafter(0, 1)
    batch_size, ni = x.shape[-2:]
    units_count.insert(0,ni)
    units_count_arr = np.array(units_count)
    L, = units_count_arr.shape # Number of layers + 1
    # RED ALERT - `as_strided` function is like a LAND-MINE ready to explode in wrong hands!
    arr_view = as_strided(units_count_arr, shape=(L-1,2), strides=(4,4))
#     print(arr_view)
    rng = np.random.default_rng(seed=_seed)
    wghts = [None]*(L-1)
    intercepts = [None]*(L-1)
    M_W = [None]*(L-1)
    M_B = [None]*(L-1)
    V_W = [None]*(L-1)
    V_B = [None]*(L-1)
    # WEIGHTS & MOMENTS INITIALIZATION
    for i in range(L-1):
        w_cols, w_rows = arr_view[i,:]
        wghts[i] = rng.random((w_rows, w_cols))
        M_W[i] = np.zeros((epochs+1, w_rows, w_cols))
        V_W[i] = np.zeros((epochs+1, w_rows, w_cols))
        if bias:
            intercepts[i] = rng.random((w_rows,))
            M_B[i] = np.zeros((epochs+1, w_rows))
            V_B[i] = np.zeros((epochs+1, w_rows))
            
    # COSTS INITIALIZATION
    costs = np.zeros(epochs)
    
    # Gradient Descent
    for epoch in range(epochs):
        # FORWARD PROPAGATION
        # hidden layer 1 implementation, relu activation   
        h1a = np.einsum(’hi,Bi -> Bh’, wghts[0], x)
        if bias:
            h1a = h1a + intercepts[0]
        h1 = leaky_relu(h1a)
        # hidden layer 2 implementation, softmax activation
        h2a = np.einsum(’ho,Bo -> Bh’, wghts[1], h1) 
        if bias:
            h2a = h2a + intercepts[1]
        hyp = softmax(h2a, _axis=1)
        current_epoch_cost = -np.einsum(’Bi,Bi’, y, np.log(hyp))/batch_size
#         print(current_epoch_cost)
        costs[epoch] = current_epoch_cost
        # BACKWARD PROPAGATION
        # layer 2
        dJ_dH2a = hyp - y
        dJ_dW1 = np.einsum(’Bi,Bj -> ij’,dJ_dH2a, h1)/batch_size
        # layer 1
        dJ_dH1 = np.einsum(’Bi,ij -> Bj’, dJ_dH2a, wghts[1])
        dJ_dH1a = dJ_dH1*leaky_relu_prime(h1a)
        dJ_dW0 = np.einsum(’Bi,Bj -> ij’,dJ_dH1a, x)/batch_size
        # numerical optimization
        beta1_denom = (1.0 - beta1**(epoch+1))
        beta2_denom = (1.0 - beta2**(epoch+1))
        if bias:
            dJ_dB1 = np.einsum("Bi -> i", dJ_dH2a)/batch_size
            dJ_dB0 = np.einsum("Bi -> i",dJ_dH1a)/batch_size
            # MOMENTS ADJUSTMENT
            M_B[0][epoch+1,:] = beta1 * M_B[0][epoch,:] + (1.0 - beta1)*dJ_dB0
            M_B[1][epoch+1,:] = beta1 * M_B[1][epoch,:] + (1.0 - beta1)*dJ_dB1
            
            V_B[0][epoch+1,:] = beta2 * V_B[0][epoch,:] + (1.0 - beta2)*dJ_dB0**2
            V_B[1][epoch+1,:] = beta2 * V_B[1][epoch,:] + (1.0 - beta2)*dJ_dB1**2
            # BIAS CORRECTION
            mhat_b0 = M_B[0][epoch+1,:] / beta1_denom
            vhat_b0 = V_B[0][epoch+1,:] / beta2_denom
            
            mhat_b1 = M_B[1][epoch+1,:] / beta1_denom
            vhat_b1 = V_B[1][epoch+1,:] / beta2_denom
            # BIAS ADJUSTMENT with numerical stability
            intercepts[1] = intercepts[1] - lr*mhat_b1/(np.sqrt(vhat_b1) + eps)
            intercepts[0] = intercepts[0] - lr*mhat_b0/(np.sqrt(vhat_b0) + eps)
            
            
        # MOMENTS ADJUSTMENT
        M_W[0][epoch+1,:] = beta1 * M_W[0][epoch,:] + (1.0 - beta1)*dJ_dW0
        M_W[1][epoch+1,:] = beta1 * M_W[1][epoch,:] + (1.0 - beta1)*dJ_dW1
        
        V_W[0][epoch+1,:] = beta2 * V_W[0][epoch,:] + (1.0 - beta2)*dJ_dW0**2
        V_W[1][epoch+1,:] = beta2 * V_W[1][epoch,:] + (1.0 - beta2)*dJ_dW1**2
        # BIAS CORRECTION
        mhat_w0 = M_W[0][epoch+1,:] / beta1_denom
        vhat_w0 = V_W[0][epoch+1,:] / beta2_denom
        
        mhat_w1 = M_W[1][epoch+1,:] / beta1_denom
        vhat_w1 = V_W[1][epoch+1,:] / beta2_denom
        # WEIGHTS ADJUSTMENT with numerical stability
        wghts[1] = wghts[1] - lr*mhat_w1/(np.sqrt(vhat_w1) + eps)
        wghts[0] = wghts[0] - lr*mhat_w0/(np.sqrt(vhat_w0) + eps)    if bias:
        return (costs, wghts, intercepts)
    else:
        return (costs, wghts)iris = load_iris()
x = iris.data
y = iris.target
#NORMALIZE
x_norm = normalize(x)
x_train, x_test, y_train, y_test = train_test_split(x_norm, y, test_size=0.33, shuffle=True, random_state=42)
#BINARIZE
y_train = binarize(y_train)
y_test = binarize(y_test)unit_per_layer_counts = [10,3]
costs, fw, fb = pasapali_batch(unit_per_layer_counts, x_train, y_train, lr=0.01, epochs=200, bias=True)plt.plot(costs)def predict(x,fw,fb):
    h1a = np.einsum(’hi,Bi -> Bh’, fw[0], x)+fb[0]
    h1 = relu(h1a)
    h2a = np.einsum(’ho,Bo-> Bh’,fw[1],h1)+fb[1]
    return softmax(h2a)

In the code, the difference is that I have initialized two moment arrays for each layer and updated (or should I write adapted…) these moments as per the Adam optimization algorithm.

In a normal gradient descent optimizer, the weights are adjusted based on the gradient calculated in the same epoch.

Weights adjustment in gradient descent in each epoch. — Weights adjustment in gradient descent depends on the current gradient only. — Image owned by Author

In Adam optimizer, the weights are adjusted based on the moving average of gradients calculated in current and previous epochs. The moments adjustment as per the Adam algorithm is calculated as moving average of previous and current gradients and then those moments are used to update the weights.

Weights adjustment of adam optimizer in each epoch. — Weights adjustment in Adam optimizer depends on current as well as previous gradients. — Image owned by Author

In the paper, beta1 = 0.9 and m is updated based on formula:

Adaptive first moment formula according to paper. — Adaptive first moment formula according to paper — Image owned by Author

Let us expand the above formula step by step for each epoch.

Expansion of the formula for first moment m in three epochs. — Image owned by Author

We see that in each epoch, the previous gradients are getting included in the update, but the weights assigned to gradients which are far away from the current epochs gradient are getting smaller and smaller. This helps in moving towards minima while simultaneously dampening oscillations of gradient in search for minima. This gives us speed to cross saddle points.

Let us talk a bit about the second order moment v. During weights adjustment, learning rate is divided by root mean square of v. It helps to adjust the learning rate for each weight w. The weights which have relatively larger magnitude will have larger value of v and hence a smaller learning step in that direction. This helps us to slow down so that we do not overshoot the minima.

Finally, let us talk about the bias correction section. In the original paper, they have shown the mathematical derivation and given the explanation. For layman, it's sufficient to know that introducing this bias correction helps when gradients are sparse which if not corrected leads to large steps.

Let's compare the convergence of cost function in both Gradient Descent optimizer and Adam optimizer method.

Decay in cost function using gradient descent — Image owned by Author

Decay in cost function using Adam optimizer — Image owned by Author

Adam optimizer took just 250 epochs to reach the optimal cost value while Gradient descent took 19000 epochs. Reminds me of the super hero Flash!!

Time taken by Gradient descent optimizer to complete 19000 epochs is 2.98s. — Time taken by Gradient descent optimizer to complete 19000 epochs is 2.98s — Image owned by Author

Time taken by Adam optimizer to complete 250 epochs is 87.3 ms. — Time taken by Adam optimizer to complete 250 epochs is roughly 87.3 ms — Image owned by Author

This is a tremendous improvement in the speed of convergence. Adam is not only good in speed. It is also good for sparse and very noisy gradients.

Please let me know your views about this blog post and code implementation with your comments.

I am a machine learning engineer at TCS and my (Digital Software & Solutions) team is building amazing products.

Multiclass Classification Neural Network using Adam Optimizer

This is in continuation of my blog series where I use numpy’s einsum to implement the complete model.

Written by ManiShankar Singh