Multiclass Classification Neural Network using Adam Optimizer

This is in continuation of my blog series where I use numpy’s einsum to implement the complete model.

ManiShankar Singh
Towards Data Science

--

Photo by Edward Howell on Unsplash

I wanted to see the difference between Adam optimizer and Gradient descent optimizer in a more sort of hands-on way. So I decided to implement it instead.

In this, I have taken the iris dataset and implemented a multiclass classification 2 layer neural network as done in my previous blog. The only difference this time is that I have used Adam Optimizer instead of Gradient descent.

def leaky_relu(z):
return np.maximum(0.01*z,z)
def leaky_relu_prime(z):
z[z<=0] = 0.01
z[z>0] = 1
return z
def softmax(z, _axis=0):
stable_z = z - np.max(z)
e_z = np.exp(stable_z)
return e_z/np.sum(e_z, axis=_axis, keepdims=True)
def binarize(z):
return (z[:,None] == np.arange(z.max()+1)).astype(int)
def pasapali_batch(units_count, x, y, lr, epochs, bias=False, _seed=42):
beta1 = 0.9
beta2 = 0.999
eps = np.nextafter(0, 1)
batch_size, ni = x.shape[-2:]
units_count.insert(0,ni)
units_count_arr = np.array(units_count)
L, = units_count_arr.shape # Number of layers + 1
# RED ALERT - `as_strided` function is like a LAND-MINE ready to explode in wrong hands!
arr_view = as_strided(units_count_arr, shape=(L-1,2), strides=(4,4))
# print(arr_view)
rng = np.random.default_rng(seed=_seed)
wghts = [None]*(L-1)
intercepts = [None]*(L-1)
M_W = [None]*(L-1)
M_B = [None]*(L-1)
V_W = [None]*(L-1)
V_B = [None]*(L-1)
# WEIGHTS & MOMENTS INITIALIZATION
for i in range(L-1):
w_cols, w_rows = arr_view[i,:]
wghts[i] = rng.random((w_rows, w_cols))
M_W[i] = np.zeros((epochs+1, w_rows, w_cols))
V_W[i] = np.zeros((epochs+1, w_rows, w_cols))
if bias:
intercepts[i] = rng.random((w_rows,))
M_B[i] = np.zeros((epochs+1, w_rows))
V_B[i] = np.zeros((epochs+1, w_rows))

# COSTS INITIALIZATION
costs = np.zeros(epochs)

# Gradient Descent
for epoch in range(epochs):
# FORWARD PROPAGATION
# hidden layer 1 implementation, relu activation
h1a = np.einsum(’hi,Bi -> Bh’, wghts[0], x)
if bias:
h1a = h1a + intercepts[0]
h1 = leaky_relu(h1a)
# hidden layer 2 implementation, softmax activation
h2a = np.einsum(’ho,Bo -> Bh’, wghts[1], h1)
if bias:
h2a = h2a + intercepts[1]
hyp = softmax(h2a, _axis=1)
current_epoch_cost = -np.einsum(’Bi,Bi’, y, np.log(hyp))/batch_size
# print(current_epoch_cost)
costs[epoch] = current_epoch_cost
# BACKWARD PROPAGATION
# layer 2
dJ_dH2a = hyp - y
dJ_dW1 = np.einsum(’Bi,Bj -> ij’,dJ_dH2a, h1)/batch_size
# layer 1
dJ_dH1 = np.einsum(’Bi,ij -> Bj’, dJ_dH2a, wghts[1])
dJ_dH1a = dJ_dH1*leaky_relu_prime(h1a)
dJ_dW0 = np.einsum(’Bi,Bj -> ij’,dJ_dH1a, x)/batch_size
# numerical optimization
beta1_denom = (1.0 - beta1**(epoch+1))
beta2_denom = (1.0 - beta2**(epoch+1))
if bias:
dJ_dB1 = np.einsum("Bi -> i", dJ_dH2a)/batch_size
dJ_dB0 = np.einsum("Bi -> i",dJ_dH1a)/batch_size
# MOMENTS ADJUSTMENT
M_B[0][epoch+1,:] = beta1 * M_B[0][epoch,:] + (1.0 - beta1)*dJ_dB0
M_B[1][epoch+1,:] = beta1 * M_B[1][epoch,:] + (1.0 - beta1)*dJ_dB1

V_B[0][epoch+1,:] = beta2 * V_B[0][epoch,:] + (1.0 - beta2)*dJ_dB0**2
V_B[1][epoch+1,:] = beta2 * V_B[1][epoch,:] + (1.0 - beta2)*dJ_dB1**2
# BIAS CORRECTION
mhat_b0 = M_B[0][epoch+1,:] / beta1_denom
vhat_b0 = V_B[0][epoch+1,:] / beta2_denom

mhat_b1 = M_B[1][epoch+1,:] / beta1_denom
vhat_b1 = V_B[1][epoch+1,:] / beta2_denom
# BIAS ADJUSTMENT with numerical stability
intercepts[1] = intercepts[1] - lr*mhat_b1/(np.sqrt(vhat_b1) + eps)
intercepts[0] = intercepts[0] - lr*mhat_b0/(np.sqrt(vhat_b0) + eps)


# MOMENTS ADJUSTMENT
M_W[0][epoch+1,:] = beta1 * M_W[0][epoch,:] + (1.0 - beta1)*dJ_dW0
M_W[1][epoch+1,:] = beta1 * M_W[1][epoch,:] + (1.0 - beta1)*dJ_dW1

V_W[0][epoch+1,:] = beta2 * V_W[0][epoch,:] + (1.0 - beta2)*dJ_dW0**2
V_W[1][epoch+1,:] = beta2 * V_W[1][epoch,:] + (1.0 - beta2)*dJ_dW1**2
# BIAS CORRECTION
mhat_w0 = M_W[0][epoch+1,:] / beta1_denom
vhat_w0 = V_W[0][epoch+1,:] / beta2_denom

mhat_w1 = M_W[1][epoch+1,:] / beta1_denom
vhat_w1 = V_W[1][epoch+1,:] / beta2_denom
# WEIGHTS ADJUSTMENT with numerical stability
wghts[1] = wghts[1] - lr*mhat_w1/(np.sqrt(vhat_w1) + eps)
wghts[0] = wghts[0] - lr*mhat_w0/(np.sqrt(vhat_w0) + eps)
if bias:
return (costs, wghts, intercepts)
else:
return (costs, wghts)
iris = load_iris()
x = iris.data
y = iris.target
#NORMALIZE
x_norm = normalize(x)
x_train, x_test, y_train, y_test = train_test_split(x_norm, y, test_size=0.33, shuffle=True, random_state=42)
#BINARIZE
y_train = binarize(y_train)
y_test = binarize(y_test)
unit_per_layer_counts = [10,3]
costs, fw, fb = pasapali_batch(unit_per_layer_counts, x_train, y_train, lr=0.01, epochs=200, bias=True)
plt.plot(costs)def predict(x,fw,fb):
h1a = np.einsum(’hi,Bi -> Bh’, fw[0], x)+fb[0]
h1 = relu(h1a)
h2a = np.einsum(’ho,Bo-> Bh’,fw[1],h1)+fb[1]
return softmax(h2a)

In the code, the difference is that I have initialized two moment arrays for each layer and updated (or should I write adapted…) these moments as per the Adam optimization algorithm.

In a normal gradient descent optimizer, the weights are adjusted based on the gradient calculated in the same epoch.

Weights adjustment in gradient descent in each epoch.
Weights adjustment in gradient descent depends on the current gradient only. — Image owned by Author

In Adam optimizer, the weights are adjusted based on the moving average of gradients calculated in current and previous epochs. The moments adjustment as per the Adam algorithm is calculated as moving average of previous and current gradients and then those moments are used to update the weights.

Weights adjustment of adam optimizer in each epoch.
Weights adjustment in Adam optimizer depends on current as well as previous gradients. — Image owned by Author

In the paper, beta1 = 0.9 and m is updated based on formula:

Adaptive first moment formula according to paper.
Adaptive first moment formula according to paper — Image owned by Author

Let us expand the above formula step by step for each epoch.

Expansion of the formula for first moment m in three epochs.
Expansion of the formula for first moment m in three epochs. — Image owned by Author

We see that in each epoch, the previous gradients are getting included in the update, but the weights assigned to gradients which are far away from the current epochs gradient are getting smaller and smaller. This helps in moving towards minima while simultaneously dampening oscillations of gradient in search for minima. This gives us speed to cross saddle points.

Let us talk a bit about the second order moment v. During weights adjustment, learning rate is divided by root mean square of v. It helps to adjust the learning rate for each weight w. The weights which have relatively larger magnitude will have larger value of v and hence a smaller learning step in that direction. This helps us to slow down so that we do not overshoot the minima.

Finally, let us talk about the bias correction section. In the original paper, they have shown the mathematical derivation and given the explanation. For layman, it's sufficient to know that introducing this bias correction helps when gradients are sparse which if not corrected leads to large steps.

Let's compare the convergence of cost function in both Gradient Descent optimizer and Adam optimizer method.

Decay in cost function using gradient descent — Image owned by Author
Decay in cost function using Adam optimizer
Decay in cost function using Adam optimizer — Image owned by Author

Adam optimizer took just 250 epochs to reach the optimal cost value while Gradient descent took 19000 epochs. Reminds me of the super hero Flash!!

Time taken by Gradient descent optimizer to complete 19000 epochs is 2.98s.
Time taken by Gradient descent optimizer to complete 19000 epochs is 2.98s — Image owned by Author
Time taken by Adam optimizer to complete 250 epochs is 87.3 ms.
Time taken by Adam optimizer to complete 250 epochs is roughly 87.3 ms — Image owned by Author

This is a tremendous improvement in the speed of convergence. Adam is not only good in speed. It is also good for sparse and very noisy gradients.

Please let me know your views about this blog post and code implementation with your comments.

I am a machine learning engineer at TCS and my (Digital Software & Solutions) team is building amazing products.

--

--

I am currently working as a machine learning engineer with focus on Machine learning using numpy, scikit and tensorflow.