
I wanted to see the difference between Adam Optimizer and Gradient descent optimizer in a more sort of hands-on way. So I decided to implement it instead.
In this, I have taken the iris dataset and implemented a multiclass classification 2 layer neural network as done in my previous blog. The only difference this time is that I have used Adam Optimizer instead of Gradient descent.
def leaky_relu(z):
return np.maximum(0.01*z,z)
def leaky_relu_prime(z):
z[z<=0] = 0.01
z[z>0] = 1
return z
def softmax(z, _axis=0):
stable_z = z - np.max(z)
e_z = np.exp(stable_z)
return e_z/np.sum(e_z, axis=_axis, keepdims=True)
def binarize(z):
return (z[:,None] == np.arange(z.max()+1)).astype(int)
def pasapali_batch(units_count, x, y, lr, epochs, bias=False, _seed=42):
beta1 = 0.9
beta2 = 0.999
eps = np.nextafter(0, 1)
batch_size, ni = x.shape[-2:]
units_count.insert(0,ni)
units_count_arr = np.array(units_count)
L, = units_count_arr.shape # Number of layers + 1
# RED ALERT - `as_strided` function is like a LAND-MINE ready to explode in wrong hands!
arr_view = as_strided(units_count_arr, shape=(L-1,2), strides=(4,4))
# print(arr_view)
rng = np.random.default_rng(seed=_seed)
wghts = [None]*(L-1)
intercepts = [None]*(L-1)
M_W = [None]*(L-1)
M_B = [None]*(L-1)
V_W = [None]*(L-1)
V_B = [None]*(L-1)
# WEIGHTS & MOMENTS INITIALIZATION
for i in range(L-1):
w_cols, w_rows = arr_view[i,:]
wghts[i] = rng.random((w_rows, w_cols))
M_W[i] = np.zeros((epochs+1, w_rows, w_cols))
V_W[i] = np.zeros((epochs+1, w_rows, w_cols))
if bias:
intercepts[i] = rng.random((w_rows,))
M_B[i] = np.zeros((epochs+1, w_rows))
V_B[i] = np.zeros((epochs+1, w_rows))
# COSTS INITIALIZATION
costs = np.zeros(epochs)
# Gradient Descent
for epoch in range(epochs):
# FORWARD PROPAGATION
# hidden layer 1 implementation, relu activation
h1a = np.Einsum('hi,Bi -> Bh', wghts[0], x)
if bias:
h1a = h1a + intercepts[0]
h1 = leaky_relu(h1a)
# hidden layer 2 implementation, softmax activation
h2a = np.einsum('ho,Bo -> Bh', wghts[1], h1)
if bias:
h2a = h2a + intercepts[1]
hyp = softmax(h2a, _axis=1)
current_epoch_cost = -np.einsum('Bi,Bi', y, np.log(hyp))/batch_size
# print(current_epoch_cost)
costs[epoch] = current_epoch_cost
# BACKWARD PROPAGATION
# layer 2
dJ_dH2a = hyp - y
dJ_dW1 = np.einsum('Bi,Bj -> ij',dJ_dH2a, h1)/batch_size
# layer 1
dJ_dH1 = np.einsum('Bi,ij -> Bj', dJ_dH2a, wghts[1])
dJ_dH1a = dJ_dH1*leaky_relu_prime(h1a)
dJ_dW0 = np.einsum('Bi,Bj -> ij',dJ_dH1a, x)/batch_size
# numerical optimization
beta1_denom = (1.0 - beta1**(epoch+1))
beta2_denom = (1.0 - beta2**(epoch+1))
if bias:
dJ_dB1 = np.einsum("Bi -> i", dJ_dH2a)/batch_size
dJ_dB0 = np.einsum("Bi -> i",dJ_dH1a)/batch_size
# MOMENTS ADJUSTMENT
M_B[0][epoch+1,:] = beta1 * M_B[0][epoch,:] + (1.0 - beta1)*dJ_dB0
M_B[1][epoch+1,:] = beta1 * M_B[1][epoch,:] + (1.0 - beta1)*dJ_dB1
V_B[0][epoch+1,:] = beta2 * V_B[0][epoch,:] + (1.0 - beta2)*dJ_dB0**2
V_B[1][epoch+1,:] = beta2 * V_B[1][epoch,:] + (1.0 - beta2)*dJ_dB1**2
# BIAS CORRECTION
mhat_b0 = M_B[0][epoch+1,:] / beta1_denom
vhat_b0 = V_B[0][epoch+1,:] / beta2_denom
mhat_b1 = M_B[1][epoch+1,:] / beta1_denom
vhat_b1 = V_B[1][epoch+1,:] / beta2_denom
# BIAS ADJUSTMENT with numerical stability
intercepts[1] = intercepts[1] - lr*mhat_b1/(np.sqrt(vhat_b1) + eps)
intercepts[0] = intercepts[0] - lr*mhat_b0/(np.sqrt(vhat_b0) + eps)
# MOMENTS ADJUSTMENT
M_W[0][epoch+1,:] = beta1 * M_W[0][epoch,:] + (1.0 - beta1)*dJ_dW0
M_W[1][epoch+1,:] = beta1 * M_W[1][epoch,:] + (1.0 - beta1)*dJ_dW1
V_W[0][epoch+1,:] = beta2 * V_W[0][epoch,:] + (1.0 - beta2)*dJ_dW0**2
V_W[1][epoch+1,:] = beta2 * V_W[1][epoch,:] + (1.0 - beta2)*dJ_dW1**2
# BIAS CORRECTION
mhat_w0 = M_W[0][epoch+1,:] / beta1_denom
vhat_w0 = V_W[0][epoch+1,:] / beta2_denom
mhat_w1 = M_W[1][epoch+1,:] / beta1_denom
vhat_w1 = V_W[1][epoch+1,:] / beta2_denom
# WEIGHTS ADJUSTMENT with numerical stability
wghts[1] = wghts[1] - lr*mhat_w1/(np.sqrt(vhat_w1) + eps)
wghts[0] = wghts[0] - lr*mhat_w0/(np.sqrt(vhat_w0) + eps)
if bias:
return (costs, wghts, intercepts)
else:
return (costs, wghts)
iris = load_iris()
x = iris.data
y = iris.target
#NORMALIZE
x_norm = normalize(x)
x_train, x_test, y_train, y_test = train_test_split(x_norm, y, test_size=0.33, shuffle=True, random_state=42)
#BINARIZE
y_train = binarize(y_train)
y_test = binarize(y_test)
unit_per_layer_counts = [10,3]
costs, fw, fb = pasapali_batch(unit_per_layer_counts, x_train, y_train, lr=0.01, epochs=200, bias=True)
plt.plot(costs)
def predict(x,fw,fb):
h1a = np.einsum('hi,Bi -> Bh', fw[0], x)+fb[0]
h1 = relu(h1a)
h2a = np.einsum('ho,Bo-> Bh',fw[1],h1)+fb[1]
return softmax(h2a)
In the code, the difference is that I have initialized two moment arrays for each layer and updated (or should I write adapted…) these moments as per the Adam optimization algorithm.
In a normal Gradient Descent optimizer, the weights are adjusted based on the gradient calculated in the same epoch.

In Adam optimizer, the weights are adjusted based on the moving average of gradients calculated in current and previous epochs. The moments adjustment as per the Adam algorithm is calculated as moving average of previous and current gradients and then those moments are used to update the weights.

In the paper, beta1 = 0.9 and m is updated based on formula:

Let us expand the above formula step by step for each epoch.

We see that in each epoch, the previous gradients are getting included in the update, but the weights assigned to gradients which are far away from the current epochs gradient are getting smaller and smaller. This helps in moving towards minima while simultaneously dampening oscillations of gradient in search for minima. This gives us speed to cross saddle points.
Let us talk a bit about the second order moment v. During weights adjustment, learning rate is divided by root mean square of v. It helps to adjust the learning rate for each weight w. The weights which have relatively larger magnitude will have larger value of v and hence a smaller learning step in that direction. This helps us to slow down so that we do not overshoot the minima.
Finally, let us talk about the bias correction section. In the original paper, they have shown the mathematical derivation and given the explanation. For layman, it’s sufficient to know that introducing this bias correction helps when gradients are sparse which if not corrected leads to large steps.
Let’s compare the convergence of cost function in both Gradient Descent optimizer and Adam optimizer method.


Adam optimizer took just 250 epochs to reach the optimal cost value while Gradient descent took 19000 epochs. Reminds me of the super hero Flash!!


This is a tremendous improvement in the speed of convergence. Adam is not only good in speed. It is also good for sparse and very noisy gradients.
Please let me know your views about this blog post and code implementation with your comments.
I am a machine learning engineer at TCS and my (Digital Software & Solutions) team is building amazing products.