Fathoming ‘the Deep’ in Deep Learning — A Practical Approach

Published in

Towards Data Science

44 min readApr 29, 2019

Deep in ‘Deep Learning’ is elusive yet approachable with a bit of mathematics. This beckons a practical question: Is elementary calculus sufficient to unravel deep learning? The answer is yes indeed. Armed with an unbound curiosity to learn and re-learn new and old alike and possibly if you can methodically follow below sections, I reckon you’ll cross the chasm to intuitively understand and apply every concepts including calculus in their glory to de-clutter all intricacies of deep learning.

Learning ‘deep learning’ is fun yet fraught. Mathematics is fundamental and can help understand clearly and etch the concepts indelibly. When I started on Michael Nielsen’s book ‘Neural Networks and Deep Learning’, I stumbled to understand thoroughly at first go with lots of basic concepts involved requiring a refresher and I thought if I could share my journey with others to simplify their learning, then I think I’ve done my little part in facilitating Deep Learning.

In this post, I’m covering the steps I took and what I researched, read and understood — being captured to reveal each concept as intuitively as possible and additional topics that piques your interest. Each step detailed under a heading, to understand the whole edifice of Deep Learning, the cornerstone of AI which is still evolving. Feel free to skip sections familiar and dive deep that interests much.

Steps to fathom the depth:

The Beginnings — Modelling Decisions with Perceptrons

Top

Biological neuron models explain the underlying mechanism by which nervous system functions to facilitate perception be it sensory or its allied activities. From this perception came ‘perceptron’ an allegorical mathematical unit or model. Simply put, a perceptron (1) takes multiple inputs, (2) weighs (multiplies by a set weight), (3) sums and adds up to a constant called bias and finally (4) sends to an activation function to (5) provide an output.

Typically you could visualize a perceptron as a mathematical model to decide whether I can go for a movie today given a set of inputs and a threshold value. Lets say the inputs are cloud cover, holiday occurrence, time of the day — all as a scaled real number, which are each modified by multiplying to their corresponding preset weighage and added to a threshold value (bias factor) and then this combined number is stepped or smoothed out wherein the resultant of a stepper function is 1 or 0 whereas a smoother function the value ranges between 1 or 0 — which can be interpreted as a probability of the result happening.

To make it clear, below ia an illustration of the above example:

mathematically, if we have j number of perceptron inputs, the output is stated as:

To simplify further, we can think of the above equation as dot product or scalar product which is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number. The dot product of two vectors w = [w1, w2, …, wn] and x = [x1, x2, …, xn] is defined as:

Expressing the above example in this way, a 1 × 3 matrix (row vector) is multiplied by a 3 × 1 matrix (column vector) to get a 1 × 1 matrix that is identified with its unique entry:

From the above diagram, we can intuitively visualize a network of perceptrons as one which can filter previous layer consecutively i.e. weigh and decide the previous layer’s decisions and so on until we reach the final layer which is the output. Sometimes inputs can be represented as a perceptron just with an inherent or constant bias with no weights or inputs that point at it. Earlier when perceptron models were created, it was thought that NAND or XNOR can’t be simulated and this kept the pace of innovation dormant and later it was found that all bitwise operation can be simulated or matter of fact a function exhibiting any pattern. As networks with multiple layers became apparent and deep learning finally smashed its dormancy and is fast picking up its slack.

Workhorses inside Nodes — Activation Functions

Top

Modelling any process has been a fundamental activity of measuring output versus given input to gain understanding and more to guarantee repeatability of outcome for a given input with that established model in place. Essentially this model is always to me in simplified terms is ‘Curve Fitting’. With an input and model, you can always determine what is the output. Activation functions are ‘Curve Fitters’ where given an input, they provide an output based on the model defined. Again the variations in models gives rise to different activation functions.

Stepper Function

A basic activation function is a stepper function that’s shown in the graph below:

Equation :

Stepper or Binary is a function whose output is either 1 or 0 (Yes or No). Alternatively its also called a Heaviside step function or Unit step function. This function emits 0 for all negative inputs and 1 for all positives. Perceptrons use this kind of activation to signal whether to proceed or not based on the input.

Linear Function

Linear function as the name suggests will ramp up the given input by a multiple. Mathematically a linear function is a function whose graph is a straight line, that is a polynomial function of degree one or zero.

Equation :

The graph represents a linear function f(x) = 4x, essentially having a multiplier effect on the input by 4 times. This will be useful in neurons whenever we want to proportionally scale the input but its use case is rare.

Sigmoid Function

Sigmoid is a S curve, where output is = 0.5 for all positives and output never exceeds 1. For higher values in x and -x axis, y value is an asymptote. Of the ‘S’ shape, top portion is concave and the bottom being convex. The definition of sigmoidal functions are very broad, but uniquely it’s one that has real value output, one inflection point and has a bell shaped first derivative.

Equation :

In general, Sigmoid function is obtained from a Logistic function whose common shape is ‘S’ with the equation

where

e = the natural logarithm base (also known as Euler’s number),
x0 = the x-value of the sigmoid’s midpoint,
L = the curve’s maximum value, and
k = the logistic growth rate or steepness of the curve

In the case of standard logistic sigmoid function the values will be as follows : L = 1, k = 1, x 0 = 0, giving the Sigmoid equation above.

. One great advantage of this function is it’s non-linear. shows linearity within the range of -3 and 3 and tapers of to an asymptote line on the rest of the range. It can be used to transform a continuous space value into value between binary range and such discretization method is very useful to dampen the extremes — characterized by limiting value of zero at low activation and one at high activation. For now this function suffices to start our journey into neural networks but we’ll peek comprehensively into other functions and their suitability later.

Idea Behind Sigmoid Optimization

The interest in sigmoid is that it can manage linear and non-linear modeling in the same function. In the above diagram the inflection point is at 0.5 indicated by z. Sigmoid curve models a typical economy of scale problem. For example, if a manufacturer starts making a toy, initial investment is heavy and the rate of return is minimal or near zero,once the public starts buying the toy and sales increases gradually and takes on a linear path as demand sets in where the supply cannot fully meet the demand — the profit soars up to a point where supply catches up demand at the inflection point where the profit is ideal beyond which supply outstrips demand and the profitability simply drops and cannot increase beyond a max no matter how much influence exerted.

This surmises that any activity dependent on threshold is an ideal candidate for optimization by a sigmoid function. Let’s look at a couple of such candidates and relate it to how it can help in deep learning use cases. Election campaign spending is one where if campaign manager tried to optimize their advertising spending, then their objective function will model an outcome where they want to spend money on states that have a feasible change of voting in favor of their candidate. Whereas money spent on states eventually lost is completely wasted. Sigmoidal functions will behave in the same way, only recommending to spend money in a state if they can pass the threshold of getting a majority of voters. They will predict that it is better to spend a lot of money on one state that has a good chance to be influenced than spending half each on two states where they may end up losing both. In ‘deep learning use case’ of predicting a number from a given image, it will be useful that each pixel or group of pixels collectively crosses a threshold and if they are consecutively cross validated using the layered sigmiod functions, then it can perfectly predict the outcome with less errors and during this process, our model would have learned well.

A Gentle Detour on Basics — Differential Calculus

Top

Differentiation

Before we delve further, we have to recollect what we studied in our high school mathematics about calculus — the dreaded subject to our dilettante brains then, it’s a perfect time to reflect with fun and an utilitarian mindset rather than rote learning we did in school days.

Derivatives is all about rate of change. How fast do you travel to reach your destination from home? How quick population grew over years? how wealthy are people? All these talk about rate of change, one against another. In comparison to above examples, they are distance versus time, count versus years, dollar versus people count.

Incidentally rate of change is also slope, if one of the above use case data is plotted in a graph of x-y axes, distance in y axis and time in x axis, average speed between 2 time points is given by distance travelled over the time taken — which is essentially the slope. If you really want to find the speed at a particular point of time, it’s difficult to measure distance traversed at infinitesimal slice of time, hence there’s no way to compute rate of change. But differentiation helps here.

Inspired by lucid and simple explanations on complex mathematical concepts, let’s borrow the learning concept (thanks to MathIsFun site, by Ed. Rod Pierce) to explain the derivatives in a more clear, fun and concise manner. Please ensure that you go through each one of them to recollect/refresh back what you’ve studied then and internalize differential calculus which is paramount in understanding the functioning of neural net.

Introduction to Derivatives — talks about rate of change, its relation to slope, how we arrive at slope for a given function
Derivative Rules — covers rules that govern the derivatives of various types of functions.
Power Rule — most easiest way to de-clutter power functions to its derivatives
Second Derivative — fantastic way to understand 2nd derivative with speed, acceleration and jerk
Partial Derivatives — practical use case to understand real world usage of partial derivatives using our familiar cylinder
Differentiable — this test allows to check differentialibity in a curve to determine presence of slope
Finding Maxima and Minima using Derivatives — fundamental in learning Gradient Descent and Back Propagation

Differentiation Ready Reckoner

How To Understand Derivatives: The Product, Power & Chain Rules — provides an intuitive way to understanding the product, power and chain rules we always memorized in high school. With a firm understanding of differential calculus, we take our next step!

The Underpinnings — Essential Statistics and Loss Reduction

Top

Statistics

In statistics, you should be familiar with estimates of location and variability which is central to how data is represented and varies and how data can be scaled to normalize and speed up analysis. A quick summary of essential concepts connected to this ‘Deep Learning’ journey:

Estimates of Location

Variables with measured data might have thousands of distinct values. A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data are located (i.e. their central tendency).

Mean
The sum of all values divided by the number of values
Alternative: average
Weighted Mean
The sum of all values times a weight divided by the sum of the weights.
Alternative: weighted average
Median
The value such that one-half of the data lies above and below.
Alternative: 50th percentile
Weighted Median
The value such that one-half of the sum of the weights lies above and below the sorted data
Robust
Not sensitive to extreme values.
Alternative: resistant
Outlier
A data value that is very different from most of the data.
Alternative: extreme value

Estimates of Variability

Location is just one dimension in summarizing a feature. another dimension being variability, also referred to as dispersion, measures whether data values are tightly clustered or spread out. At the center of statistics lies variability: measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability and making decisions in the presence of it.

Variance
The sum of squared deviations from the mean divided by N-1 where N is the number of data values.
Mean Absolute Deviation
The mean of the absolute value of the deviations from the mean.
Range
The difference between the largest and the smallest value in a data set.
Percentile
The value such that P percent of the values take on this value or less and (100-P) percent take on this value or more.
Inter-quartile Range
The difference between the 75th percentile and the 25th percentile

Data Scaling

From Wikipedia: Since the range of values of raw data varies widely, in some machine learning algorithms, objective/loss functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Another reason why feature scaling is applied is that gradient descent (we’re going to see this in more detail later) converges much faster with feature scaling than without it.

Min-Max Scaling
Formula for mix-max:

This method re-scales individual data points within a range say [0,1]. Let’s say we have a dataset with 100 data points and x be an individual data point,min(x) being the minimum among 100 data points and max(x) being the largest among them, min-max scaling squeezes (or stretches) every x value to be within the range of [0,1]. Sometimes the range could be set to [-1,1]. The target range depends on the nature of data and algorithm employed in analysis. Below figure gives you an intuitive illustration.

Image Credit : Feature Engineering by Alice Zheng

Unit Scaling or L2 Normalization
Formula for unit-scaling:

This is obtained by dividing each data point by the Euclidean length of the vector from a centre or mean point. The L2 norm measures the length of the vector in coordinate space. Pythagorean theorem gives the distance

Due to the division, data normalizes within -1 to 1 range.

Variance Scaling or Standardization
Formula for Standardization:

Average or mean intuitively signifies the absolute spread of data whereas Standard Deviation (SD/sd) gives an idea of weighted spread. A measure of spread is most helpful when the distribution of your data is symmetric around the mean and has a variance relatively close to that of the Normal distribution. (This means that it is approximately Normal). In the case where data is approximately Normal, the standard deviation has a canonical interpretation: 68% data will within 1 SD, 95% within 2 SDs and 99% within 3 SDs. The question is whether the distribution is wide or tight? This gives an idea of data and possible outliers that may appear.
Here for every data point, mean value is stripped and further weighted by their standard deviation. Essentially scales every data point centered to mean by their overall spread.
The resulting scaled data point is standardized to have a mean of 0 and a variance of 1. If the original feature has a Gaussian distribution, then the scaled feature is a standard Gaussian, a.k.a. standard normal. Below figure intuitively illustrates this notion.

Image Credit : Mastering Feature Engineering by Alice Zheng

Loss Functions

Variability inspires how deviation in prediction is measured, similar to standard deviation, loss can be measured, scaled and normalized. At the core, loss function measures how well your prediction performed compared to original value. Now you have a tool to track model performance and at the same time tune its performance so that loss is reduced to a minimum. Very basic loss function is to find the absolute difference between original and predicted values. This provides a good judgement on individual prediction (Absolute Error — AE) and again this can be averaged on all predictions to get Mean Absolute Error (MAE / L1 Loss) and averaging their squared errors gives Mean Squared Error (MSE / L2 Loss / Quadratic Loss).

MSE & MAE — An Analysis

Let’s take a simple function y = f(x) = 4(x), for which a graph is drawn as below. The corresponding predicted values y pis also drawn to show the deviation from actual values along with loss, absolute loss and squared loss. Also given are the statistical values and MAE and MSE.

The formulae are:
MSE=

MAE=

y^p= forecasts (predicted),
y = observed values (actuals).

If formulas is not your staple, you can find the MSE & RMSE by:
MSE

Squaring the residuals/loss.
Finding the average of the residuals/loss.

RMSE

Taking the square root of the result.

Both MAE and MSE signify average model prediction errors slightly in different units of interest. Both metrics can range from 0 to ∞ and are indifferent to directional errors. They are negatively-calibrated scores, which means lower values are better. Taking the square root of the average squared errors or plain simple squared errors has some interesting implications — since the errors are squared and before they are averaged, MSE/RMSE gives a relatively high weight to large errors. This means the MSE/RMSE should be very useful when large errors are particularly undesirable.

Looking at the graph above, we can see that the MAE and MSE (RMSE is just square root of MSE) values for actual and predicted values. Most of the prediction are inline except 2 outliers where the error is very high. Compared to MAE,MSE/RMSE will weight the outlier heavily as you notice the Squared Loss is way above their nearer loss counterparts. Intuitively, we can interpret prediction performance from a loss function perspective where if try to minimize MSE, then that prediction should be the mean of all target values. But if we try to minimize MAE, that prediction would be the median of all observations. Median is more robust to outliers than mean, which consequently makes MAE more robust to outliers than MSE.

One big problem in using MAE loss (for neural nets especially) is that its gradient is the same throughout, which means the gradient will be large even for small loss values. This isn’t good for learning. To fix this, we can use dynamic learning rate which decreases as we move closer to the minima. MSE behaves nicely in this case and will converge even with a fixed learning rate. The gradient of MSE loss is high for larger loss values and decreases as loss approaches 0, making it more precise at the end of training (see figure below.)

MAE LOSS

MSE LOSS

Above is a plot of an MAE & MSE function where the true target value is 100, and the predicted values range between -10,000 to 10,000. The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100. The range is 0 to ∞.

Some Loss Functions in Neural Networks

Loss functions play a vital role in stabilizing and improving prediction accuracy of a given model be in ML/DL. They are used in a way to minimize difference between actual and predicted values. A summary of loss functions used in deep learning is listed below — name, it’s equation and a gist of what it does.

Note that y is the actual point or node data and yhat is the predicted or functional output of y. C is the overall error across all data points.

Widely used in regression, the resultant fitting line for data points should be a line which minimizes the sum of distance of each point to the fitted regression line. Square difference between actual and predicted is named ‘residual’ and the target of loss function is to minimize the residual sum of squares in DL.

As the name implies, its half the value of MSE and it’s for calculus convenience which we’ll come across later.

Simply a sum, without averaging the MSE. Also called L2 loss

It measures the variance difference in log values of actual and predicted. It useful when predicted and actual are huge values — using the power of log to reduce the exponential to progressive values. MSLE penalizes under-estimates more than over-estimates.

Absolute difference between actual and predicted value. Computing a nice gradient in MSE is easier whereas MAE gives only a constant.

Obviously a percentage error of actual vs predicted over actual. Cannot be used when actual is zero. When the difference between actual and predicted is wide, it’ll exceed 100%. This function has a tendency to choose a model whose predictions are too low.

MAE without the mean i.e. not divided by n

Typically in binary classification and neural networks, the output is a real value signifying the probability of expected result instead of a 0 or 1 for given event. Or it can be computed easily — given a set of data points, Log Loss measures the divergence of probability distribution of actual versus predicted. We can also assume that the actual and predicted data points represent a probability distribution. From a binary classification perspective, y being the either/or label for an event A or B (1 if A happens, 0 if doesn’t or B happens) and p(y) is its probability. Left hand side represents, label times probability of label being A whereas right side 1-label value times probability of being B. Further intuitive explanation here and refer this for a nice derivation from softmax derivative.

When the model output is probability of each class rather than most likely class — wherein it deals with multi-class i.e. each output will signify for example whether its a cat, dog, rat per output. A great read on soft-max and NLL from Understanding softmax and the negative log-likelihood in Lj Miranda Blog. I’m totally smitten by this fantastic visual from that article.

Links the above two and head to an intuitive article for a neat explanation.

Form high school, we know Cos 0 is 1 and Cos 90 is 0, which kind of indicates that a given set of points whose cosine proximity being near zero are similar than near 1 where they are totally divergent. This can ignore magnitude differences and only take into account angle convergence.

Losses and their impact in error reduction and in comparison to gradient descent is discussed in detail here. Next we are going to tackle Gradient Descent more intuitively.

The Grand Optimization — Gradient Descent

Top

Gradient Descent

Oft refereed algorithm in any mention of ML/DL is Gradient Descent. As it is so fundamental that tomes have been covered illustrating, explaining and debunking it but here were are going to understand it more intuitively with 2 examples rather than the fancy 3D graphs and animations you come across.

Before we delve proper into an example, Gradient Descent literally is ‘a way to traverse’ such that we reach the bottom of a valley. Figuratively we have to step down and check are we at the lowest, if not look for the next best direction of descend and repeat it iteratively. One caveat is if we are stuck in one of the low minimums but not the lowest (assume we have a laser vision that can see through rocks), then retrace back and start in direction which will lead to lowest minimum, the global minimum which is deep down,the nadir point in the valley.

Assuming my valley is without contortions and is single dimension and a look-a-like of curve representing y = x^2, then we can draw that valley in below graph. Recall back of finding the maxima and minima using 1st and 2nd derivatives earlier. Here the slope increases to the right and decreases to the left. Refer to the 2 curves, first is a curve y = x^2 and next one is y = 1 + 2x — x^2. Below diagram illustrates the idea.

Understanding Slope & Descend on Basic Curves

As noticed, in the left image above, slope is positive and increases as we move to right direction and reverses as we move left. In order to find the global minimum, which is simple operation — visually look where the slope reaches zero in the slope graph or mathematically if you equate the slope equation to zero, we get x = 0, the point where slope is minimum and is the global minimum. You can apply the second derivative test at this point of x (=0), 2nd derivative = 2 and when its greater than zero, it is local minimum and single its one dimensional, it’s also global minimum. Also note there’s no iterative process and hence single dimension functions don’t have gradient descent per se. Similarly right image gives an idea what happens to a parabola equation when we venture to find global maximum. Here the 2nd derivative = 2–2x and slope will be zero when x = 1. Applying 2nd derivative test, it’ll be -2 at this point and we know that it’s the local max and also global max for a 1-dimensional function.

There’s no chance for the 2nd derivative being a zero except for y = x (as shown above) where the slope is constant and there’s neither a minimum or maximum point rather technically an undecided point called saddle point. Saddle points happen in multi-dimensional functions.

Another important intuition, in single dimensional functions, we move left or right but from another point of view, if you move against when the slope is positive, you’ll reach the minimum i.e. when slope is steepest at a point then move in the opposite direction, the steepness reduces and reaches zero. This comes handy in gradient descent in multi-dimensional functions. Why it’s important — we can simply multiply the steepest slope at that point by a small factor and subtract/add to move the x-axis point in the opposite direction and re-evaluate the function’s slope and see it has become less steeper and reached the minimum and to repeat this process until we reach it.

Intuitive Examples to the Rescue — Descent Demystified

Top

Gradient Descent on a simple Function f(x,y) = x + y

Time to recall the partial derivatives that we worked up earlier. If f(x,y,z) = x 4- 3xyz, then using curly dee notation:

Now lets consider a function f(x,y) = x + y. Initialize the values of x and y as 2 and 4 respectively, output will be 6.

What beckons here is, how to adjust the values of x and y to bring the functional output to the minimum? The formula is simple:

new x position = old x position — (small constant) * slope of fn w.r.t x
new y position = old y position — (small constant) * slope of fn w.r.t y

Let’s differentiate f(x,y) w.r.t its components

These partial derivatives can be thought of as forces that act on x and y inputs that produce the output. As we produce the output, the objective is to reach the minimum or zero output (incidentally slope at this point for x and y is zero, with 2nd derivative being +ve — recall from our earlier minima/maxima exploration, and also the output and loss function is same here). Let see how to apply the above formula and we assume delta = 0.01.

This new positions for x and y leads to f(x,y) = x + y to be valued @ 1.99 + 3.99 = 5.98. Compared to our starting value of 6, it’s slightly less. In order to reach to a value of 0, we’ll utilize the important characteristics of gradient descent, that is to iterate steps until we reach convergence. Process is to repeat

till x+y reaches zero.
So if we repeat the above steps 300 times, we’ll get x = -1 and y = 1 and final output x+y = 0 and we have ultimately achieved the values of x and y that will give minimum value of 0.

Gradient Descent on Function f(x) = mx + C

Lets take a simple example to immerse on the inner workings and hence can ascend gradient descend with clarity.

From a survey conducted on city’s Central Business District (CBD), office floor space rental prices were collected and tabulated in column A and column B which respectively identifies floor space area (X) in sq.ft. versus the rental yield price per month (Y) in thousands of dollars it fetches.

Problem Statement : For a new office space in CBD, given the floor area, what’ll be the rental yield per month in dollar amount?

The green line plots the current yield price versus floor cover. We’ll use a simple linear model to predict the new rental yield (Y pred) given its floor area (X). Amber line depicts the model which predict the yield given the area

Y pred = mx + C

Green line gives the actual rental yield (Y actual) based on the current survey.

The green dots and amber dots with interconnecting lines between them shows the difference between Y actual and (Y pred) which is the prediction loss or error (E).
The onus is on us to find optimal values for C and m (called weights) so that it best fits the prediction line governed by Y pred = mX + C to reduce prediction error and improves accuracy. We’ll employ quadratic loss function RMSE as loss function which need to be minimized taking into account the actual and predicted values.

Simply put HMSE =,

Let’s utilize Gradient Descent algorithm to find optimal value for weights C and m to minimize HMSE.

Steps involved enumerated here:

Normalize data in Column A and C using Min-Max Normalization into Column D & F respectively and will be used as X and Y
Initialize C and m to random values and compute Squared Error (SE) per data point
Compute individually ‘Change in HMSE’ per line/data point when C and m are changed by very small value from original random value, overall change constitutes summing up all these for all data points.
After summing up, we could divide the summed value by number of data points to average but this isn’t necessary as it simply scales down the total change and moreover affects the change we introduce subsequently — as in the next statement. Essentially this may slow down convergence, so we can stop ar summing and no need to average. UIn dynamic cases, where the number of input data points are not known upfront, then we can’t also average and hence it’s best left summed up
Adjust the corresponding weights C and m subtracting total ‘Change in HMSE’ w.r.t C and m that’s multiplied by a learning rate
Use the new weights to compute total of SE and check it’s reducing
Repeat Steps 3 and 4 till no more significant reductions in total SE

Steps in Detail:

Refer to Columns D, E & F — which are min-max normalized rental data corresponding to Columns A, B & C
To fit the line Y pred = C + mx, start off with a random seed value and compute prediction loss HMSE. HMSE is over all data points, when applied to a single data point, n from the formula is stripped off — just work out only half squared error for each data point and sum all. SE and HSME refers the same in this context.

Download the Source Excel

3. Compute the error gradients w.r.t to the weights, in our case C and m

We’ll work out the calculus for the above equations separately later. The above 2 equations give the changes of ‘C and m’ and are simply gradients that points the direction of movement w.r.t to SE. Refer to last 2 columns of computation

4. Change the weights C and m with the gradients to reach optimal SE value which is the minimum.

We’ll adjust the initial random values of C and m, so that we move in the direction of optimal C and m.

The update rules:

here lr is the learning rate, hence:
New C = 0.52–0.01 * 5.040 = 0.47
New m = 0.81 = 0.01 * 2.279 = 0.79

5. Use the new weights to compute total of SE and check it’s reducing — check the summed HSME value id decreasing in the graphic above

6. As Steps 3 and 4 are repeated till no more significant reductions in total SE consecutively, then we have reached the optimal C and m with best predication rate

Calculus behind Gradient Descent

Squared Loss Function

HMSE per data point ends up applying

First derivative is w.r.t ‘C’ is

Let’s work out the partial derivatives now:

Applying a Different Loss Function

If you decide to apply a different loss function in the above endeavour, we can try Log-Cosh function which supports 1st and 2nd derivatives as well. If you applied MAE or RMSE, both yields a constant and so conducive from a learning and convergence perspective in neural network gradient descent process though!

Second derivative is w.r.t m simple multiplies the above term by x

Gradient Descent Variants

Batch — Vanilla Gradient Descent

As in the immersive example before, there were 12 data points and model parameters m & C were updated only at the end i.e. after computing all derivatives and loss values going through each data point for entire record set of 12 data points. Essentially this is Vanilla Batch GD where model parameters are updated only after all data point computations are complete — Processing all data points is refereed as an epoch.

Mini Batch — Gradient Descent

If in the above example, among 12 data points, we can process first 4 and update the model parameters, followed by next 4 data points to update the parameters and so on until all data points are exhausted. Essentially after every mini batch (4 data points in our case) are processed, model parameters are updated and hence within an epoch, parameters are updated by n times where n = total data points/mini batch data points.

Stochastic Gradient Descent — SGD

Generally SGD computes the gradient using a single data point but most applications of SGD actually use a mini-batch of several samples. Hence prior to every epoch start, all data points are randomly shuffled and mini batch process is followed. SGD works well (compared to batch gradient descent and there are advanced methods now) for error manifolds that have lots of local maxima/minima. With this, the somewhat noisier gradient calculated using the reduced number of data points tends to jerk the model out of local minima into a region that hopefully is more optimal. Single data points are really noisy, while mini-batches tend to average a little of the noise out. Hence the amount of jerk is reduced while using mini-batches. Possibly a good balance can be reached when the mini-batch size can avoid some of the poor local minima but large enough to include global minima or better performing local minima and ideally leading us to easier fall into larger and deeper basin of attraction wherein best minima are placed.

SGD is akin to sampling a cross-section of voters to get a forecast than running an actual election. The learning rate is spectacular compared to Batch, let’s say our total data points being a million (1M) and if the mini-batch size is 10K, then we get a speed-up factor of 1M/1K = 1000. The speed-up factor may not be accurate and includes statistical fluctuations but good enough to lead on a path of gradual reduction in cost function. One benefit of SGD is that it’s computationally faster and can be parallelized.

When building the Gradient Descent algorithm, maximum epoch count (iteration count) can be set along with loss factor difference which work in tandem to arrive at a convergence or stop endless processing whichever happens first. Explore the variants further here.

Ensemble directed Back & Forth — Feed Forward & Back Propagation

Top

Chain Rule

In our earlier Gradient Descent example, we were looking at simple function y = mx + C. This can be imagined as a single node activation function which simply takes an input to provide an output — which essentially scaled the input by m and adds a bias C.
Now lets imagine a bunch of such activation functions interconnected — one to another — such that the output of one is input to next. This is a simple linear network which may have few nodes. In our example we’ll consider 4 node linear network as shown below. Again the same idea applies here as well — we need to find the right weights and biases such that the overall function’s output will be optimized so that any error difference between the original output and output from this ‘interconnected functions’ is minimum.
To demonstrate the idea, look at the diagram below — which is 2 hidden layered neural network with an input node and output node.

Each node is a function of the previous one connected to it. If we were to change either the input or the weight value in the input node, this change will cascade down — as each node will change the input they receive and finally reflects in output. As such the output change may seem complex but intuitively every node amplifies the cascading change. Since each node in interdependent and with notion of functional dependencies, we can mathematically formulate a composite function to represent the behavior of this network:

Consolidating all equations in to one:

Since the intermediary nodes are not working solo and dependent on input from the previous nodes, if we would like to understand the overall change in output for a small change in w1, then we have to apply the derivatives in tandem from last node to first for the output w.r.t w1. The resultant composite derivative will be:

Let’s pause here and look into a detailed example on how chain rule can be leveraged here.

Chain Rule in Action — Understanding With an Example

Chain rule of differentiation is fundamental to back propagation and getting an intuitive understanding goes a long way to comprehend neural networks in its entirety. With the example below, it allows to decipher how chain rule works and see how changes in factors attribute to final output change. In the diagram below, we have 3 activation functions connected in linear manner and essentially any input cascades through them to give an output following the rules stated. It’s very easy to follow along:

i/p signifies the input to this overall linear network
any i/p gets weighted by w1 and supplemented with a bias b1, collectively notated as z1, is fed to activation function #1
this function scales down any input by 1/8 times to give h1 as output
same pattern repeats until output h3
finally we have an error function wherein if we know the actual h3 and predicted h3, error/cost can be computed but for now we ignore this part

Let’s enumerate the equations for the above linear network from output all the way to input

h1 = f1(w1 * x + b1)
= (w1 * x + b1)/8
h2 = f2(w2 * h1 + b2)
= (w2 * h1 + b2)/7
O = h3 = f3(w3 * h2 + b3)
= (w3 * h2 + b3)/9

As shown above essentially any wiggle that you make in x gets scaled down as per our example before a slight bump up due to weights and bias in each layer. One more thing to note is ‘h’ doesn’t know what has happened to x but only alters change from ‘g’ and ‘g’ also has no idea of change in x but only wiggles ‘f’ and so on. Now any change in x will gets amplified by delta x * rate of change in f, followed by its next layer and so on.
O = h(g(f(x))) and the change in h w.r.t. x will be

The chain rule isn’t just unit cancellation of denominator and numerator — it’s cascading of a wiggle, which gets scaled up or down or adjusted at each step. The chain rule works for any number of variables (f depends on g depends on h and so on), just cascades the wiggle. Think of it as zooming into different variable’s point of view — beginning from dx and gazing up, you can visualize the entire chain of adjustments needed before the wiggle reaches h.

Any change in O w.r.t w1 (again w1 is scaled up value of x above) can be written using chain rule as follows:

Above tabulation sums up the idea behind the chain rule with numerical data

Column A : consider few inputs to this linear network
Column B : computes h1
Column C : computes h2
Column D : computes h3 — final output
Column E : computes new h3 based on a change in w1 — just by calculating rate of change of h3 w.r.t w1 and multiplying with change in w1 and finally adding it to h3 to get the new values
the repeated rows compute h1, h2 & h3 by incorporating the new w1 value and doing all end to end calculation to arrive at h3 — note column E data and data under heading ‘new h3’ in column D (on modified w1) are the same
Column F : computes derivative using approximation — as given by the formula:

here O is nothing but computing h3
The value computed by approximation and chain rule are the same as shown

Back Propagation — Chain Rule in Action in Reverse

In our previous linear network of nodes, let’s add a final function called cost function. This takes the final output (computed y value aka predicted value using the given model — our linear network) and calculates the difference with respect to given original output (original y values). Since this is another functional dependency, now our cost is also affected by input and weights and biases, and hence, is a function of them.

Now let’s attach a ‘Cost function’ at the end of this network as shown above, it’s obvious that this function’s output is also dependent opn the input and intermediary weights. Hence we can extend the same derivative formula to our cost function as well:

In section ‘Understanding Chain Rule With an Example’, we were able to find the composite derivative using approximation. i.e by changing w1 by a small value and computing the new output and hence .

value.
Now a question would arise how about applying the approximation principle to compute changes in cost w.r.t weights in a real neural network. It’s quite feasible to take this path but leads to total slowdown in computation. Let’s assume a large network of weights, in order to find the change in Cost w.r.t w1, we need to find C and C(with delta w1) at all weights bearing nodes and sum up the get their final amounts and then apply the approximation formula. If the nodes are in millions, approximation algorithm screeches to a halt, whereas chain rule method comes handy in propagating the derivative back into the network to find the weights quickly by gradient descent to reduce overall cost — as depicted below.

Neural Network

Network

Time to delve in to the real neural network and work out the equations based on the understating so far and apply it to python code. The notation & code is based on the seminal book by Michael Nielsen. I had to spend quite a lot of time understating the python code with respect to notation, equations and deductions. What I’ve done here is to simply it further with better visualizations so that it can assimilated quickly and the book can be used as an reference resource.

Imagine a network of 4 layers (as above):

Layer 0 — Input Layer of 3 nodes
No transformation of inputs and they are just passed on
Layer 1 — comprising 4 neurons
For each neuron in this layer, layer 0 inputs are weighted and finally a bias is added and fed to activation function
Layer 2 — comprising 3 neurons
For each neuron in this layer, layer 1 inputs are weighted and finally a bias is added and fed to activation function
Layer 3 — comprising 2 neurons
For each neuron in this layer, layer 2 inputs are weighted and finally a bias is added and fed to activation function

Let’s assign proper notation to what we said now and properly express them mathematically and it’s very easy to follow. It’s also good to consider an input use case for this network but need not exactly match the inputs and outputs but provides a context and intuitive understanding of how the network is going to handle data overall.
MNIST is standard torch bearer dataset for deep learning — this contains 60K images of hand written numbers and annotated correct numeric value for each image.

Input is an image of 28×28 pixels
translates to a array 784 inputs storing the grey scale value within 0 to 1
Output is indicate numeral between 0–9
translates to an array of 10 outputs, each a value between 0 to 1 where a value greater than .5 indicates that numeral in order. In a way the output can alsi be considered as a probability for that numeral to occur in the i/p image

Mathematically expressed Network

A full network with complete notations is given here and to get a full scape click on the image.

Basic Definitions

Now let’s define all basics — inputs, weights, biases — step by step, Just refer to below diagram

Input (at 1st Layer)
input = x
typically numerical data raw or normalized and is represented by x in this layer Also note that in this layer, there is no activation, weights and biases and hence all these are zero and we could also say that in this layer:

We’ll see what’s z and a subsequently

Rows and Layers (Columns)
Rows = j and/or k
Layers = l or L
j always being associated to the current layer and k refers to (l-1)th layer. ‘L‘ denotes either the last layer or collectives of a layer.
Also note in the above diagram, neurons are not equal-spaced from top but nicely placed in their corresponding row and layer to match the matrix representation which we’ll see in a while.
Weights

Every neuron gets their multitude of inputs from previous layer neurons output ‘weighed in’ before a bias is combined to be taken in as input. j and k notation is as before and as illustrated in the below diagram

Bias

all weighed in inputs from previous layer are summed up and added to bias in node at layer l in row j is fed to a neuron @ layer l and row j.

Other Definitions

Now let’s define next level — combined inputs, activations, outputs, cost function — step by step, Just refer to below diagram.

Input at a Neuron

Activation

Cost

number of inputs, from a MNIST dataset perspective: 50k training numeral images each represented individually as 784×1 matrix of real number between 0 to 1 which is a grey-scaled value of a pixel

output, from a MNIST dataset perspective: after going through our neural net, is numeral output — array of 10 — representing each digit between 0 to 9 — whose value is again between 0 to 1 which is probability value for that particular numeral
Number of outputs (for a given image input) = 10 outputs

original output, from a MNIST dataset perspective: annotated & verified numeric value of a given image — which we can convert into an array of 10 values each representing numeral 0 to 9 where corresponding numeral is 1 and rest are zeros
Number of original outputs (for a given image input — represented as an array of 10–1 for each digit) = 10 outputs

input/sample count, from a MNIST dataset perspective: 50K training samples

as cost is a function of output activations, we can write:

Backprop Equations Derivation

Following the chain rule example, we can say that any wiggle in weight/bias will affect combined input

which in turn affects output

and finally affecting Cost/Error

As a corollary, any change in input will affect output and cost and also any change in output will affect cost, notwithstanding or apart from changes in weight/bias i.e. every intermediate component affects it’s neighbor and all others down the lane. Conversely any change in final cost depends on output, input and weights and bias in that order. Enumerating w.r.t to notation described above in lth layer node j:

Possible Combinations of Changes & Affects

While finding change in cost w.r.t change in weight/bias yields 4 important equations that comprises the Backprop algorithm — BP1…BP4.

Change in Cost w.r.t Weight

Referring to above diagram, let’s workout the first and foremost equation — change in cost w.r.t. change in weight

Change in Cost w.r.t Bias

Change in Cost w.r.t Input in Last Layer

Change in Cost w.r.t Input in mid Layer

We could use total derivative but it’s tedious and cumbersome and we’ll see how to derive this by applying the chain rule on single line multi-layered network and pattern it for a multi-row multi-layered network and understand the intuition behind it but for now the equation is as below:

vectored form:

on a per row basis in the network:

BP2 — Intuitive Deduction — Single Row Network

Let’s consider a linear network below with four neurons each with weights, biases, inputs/activations and output as shown. Superscripts represent the layer they are in.

Forward Pass : Activations

Reverse Pass : optimize Weights for Cost Reduction

From gradient descend and chain rule, we know that we need to compute derivatives and hence compute the change to be effected in ‘w’ & ‘b’ so that cost is minimum i.e. — Recall that any wiggle in w will affect all intermediaries down the lane and we originally ventured to find change in Cost C w.r.t w for last layer. With the single row neuron multi-layer network, it’s easier to compute the intermediaries. By Chain Rule:

Now we can summarize ‘Backward Pass’ for weights and similarly you can workout for biases (encourage reader to work out in detail and left out for brevity) as follows:

BP2 — Intuitive Deduction — 2-Row 4-layer Network

Let’s focus on layer 4, from all our understanding before, we can work out the activations as given below:

Working out the backward pass from layer 4 to layer 3 and summarized as below:

Backprop — A Different Take

As in the above diagram, Activation w.r.t layer 3 (from 3 to 4): is nothing but sum of layer 4 weights multiplied by layer 3 outputs and added to layer 4 bias fed to activation functions gives the output for Layer 4 i.e.

whereas Back propagation w.r.t layer 3 (from 4 to 3): is layer 4 weights transpose times layer 4 change in cost w.r.t output hadmard product of layer 3 activation differential. This is illustrated in the below diagram.

Inner Workings of Bare NeuralNet — Matrices matched to Code

Top

MNIST Dataset on Our Neural Net — Nodes and Layers

Similar to ‘hello world’ program being defacto for any language, MNIST dataset is the defacto for NeuralNet learning. It contains 60K training images and 10K testing images. These images are normalized to fit a 28×28 pixel resolution and anti-aliased to give greyscale. We’ll go ahead to construct a neural net whose:

First Layer — Input Layer:
Will be a 784 node — featuring the 28×28 pixel bound box image as a linear 784×1 array matrix
2ndLayer — Hidden Layer

3. 3rdLayer — Hidden Layer
Will be a hidden layer of 40 nodes — and the weight matrix will be (40×100), bias matrix will be (40×1)

4. 4th/Last Layer — Output Layer
Will be a output (pseudo hidden) layer of 10 nodes — and the weight matrix will be (10×40), bias matrix will be (10×1). The last layer is so chosen, so that each will represent one digit (either one among the numerals 0…9. Since the value is real, the output signifies a probability of that particular numeric value, say above 0.5 is sure thing and less than 0.5 is a non-occurrence.

We will use above neural net architecture to understand the python code given in Michael Nielsen’s book. This provides a compact, simple bare minimum neural net implementation in python.

Clone/copy below repo to a local directory
https://github.com/mnielsen/neural-networks-and-deep-learning.git
Move to your local directory where you cloned and launch python 2.6 or 2.7 from here
Run this snippet to get going:

Overall NeuralNet in Python

Peek into source of network.py and follow along the functions mentioned below to get a better sense of what the python logic does!

SGD Function
(loop around number of epochs)
a. Parameters:
training_data — training data matrices formatted accordingly
epochs — number of iterations on the entire dataset to go through
mini_batch_size — obvious as the name indicates
eta — learning rate multiplier
test_data — test data matrices formatted accordingly — used for evaluation of the learned weights and biases
b. Shuffles all 50K training recirds (each containing 784×1 greyscale data) into mini-batches of size 30 records
c. Performs mini-batch operation (which computes delta weights and biases, updates original weights and biases)
update_mini_batch Function
(loop around for each mini-batch count) where mini-batch count = total training samples divided by mini-batch size
a. Initialize weight and biases matrices to zero
a. Perform backprop (includes feed forward and back propogation). Loop around for each training record + its output record in mini-batch
c. Get the delta matrices and add to previous delta matrices (for both weights and biases)
d. Compute GD and subtract from weights and biases
backprop Function
a.Perform feed-forward — loop around each layer of weights/biases computing input & activation matrices for each layer
b. Perform backward pass
c. first compute last layer delta
d. loop around from last-but-one layer till 1st layer & compute delta matrices at each layer for weights and biases

Feed Forward & BackProp — Matrices Computed in Python

4. evaluate Function
a.With computed weights and biases by SGD, perform feed-forward for the training input
b. Get the final output and compare it to actual output
c. Report the compared correct result count

Hope it helps in your Deep Learning journey a bit and if you’d like to reach out, drop me an email

Learning Curve Retraced — References & Acknowledgements

Top

These are the books, blogs and code, I read, pondered and assimilated as I tried to understand deep learning and hope by stating it here, might help referring them in your journey and at same time acknowledging all who have similarly helped forging deep learning.

Books

Code

Activation Functions

Loss Functions

Gradient Descent

Back Propagation

Originally published at http://avantlive.wordpress.com on April 29, 2019.

Fathoming ‘the Deep’ in Deep Learning — A Practical Approach

The Beginnings — Modelling Decisions with Perceptrons

Workhorses inside Nodes — Activation Functions

Stepper Function

Linear Function

Sigmoid Function

Idea Behind Sigmoid Optimization

A Gentle Detour on Basics — Differential Calculus

Differentiation

Differentiation Ready Reckoner

The Underpinnings — Essential Statistics and Loss Reduction

Statistics

Estimates of Location

Estimates of Variability

Data Scaling

Loss Functions

MSE & MAE — An Analysis

Some Loss Functions in Neural Networks

The Grand Optimization — Gradient Descent

Gradient Descent

Intuitive Examples to the Rescue — Descent Demystified

Gradient Descent on a simple Function f(x,y) = x + y

Gradient Descent on Function f(x) = mx + C

Calculus behind Gradient Descent

Squared Loss Function

Applying a Different Loss Function

Gradient Descent Variants

Batch — Vanilla Gradient Descent

Mini Batch — Gradient Descent

Stochastic Gradient Descent — SGD

Ensemble directed Back & Forth — Feed Forward & Back Propagation

Chain Rule

Chain Rule in Action — Understanding With an Example

Back Propagation — Chain Rule in Action in Reverse

Neural Network

Network

Mathematically expressed Network

Basic Definitions

Other Definitions

Backprop Equations Derivation

Possible Combinations of Changes & Affects

Change in Cost w.r.t Weight

Change in Cost w.r.t Bias

Change in Cost w.r.t Input in Last Layer

Change in Cost w.r.t Input in mid Layer

BP2 — Intuitive Deduction — Single Row Network

BP2 — Intuitive Deduction — 2-Row 4-layer Network

Backprop — A Different Take

Inner Workings of Bare NeuralNet — Matrices matched to Code

MNIST Dataset on Our Neural Net — Nodes and Layers

Overall NeuralNet in Python

Learning Curve Retraced — References & Acknowledgements

Written by Nachi Nachiappan