ANN Training consists of an Iterative Forward (Inference) and Backward (Backpropagation) process aimed at adjusting the Network Parameters so to reduce a similarity distance with an implicitely (i.e. sampled + noise) defined target function
Long Version
Let’s consider an ANN Learned Transfer Function

In Supervised Learning, Training an ANN is aimed at making the Network Transfer Function approximate a Target Transfer Function which is not provided in an explicit closed form but in an implicit sampled form, as a Set of Training Data Points.
Furthermore unfortunately the samples do not regard the Target Transfer Function only but also some additional noise. We want to make the ANN learn as less noise as possible in order to achieve better generalization performance.

This kind of Target Function representation poses 2 major issues
- Discrete vs Continuous Information: the ANN Transfer Function is continuous and differentiable (it is a requirement to be able to use Gradient based Methods) while the Target Function Representation is Discrete hence the ANN shall essentially learn to interpolate properly between Data Points
- Noise: the more ANN learns noise, the less it will be able to generalize properly
The ANN Training typically means solving a Minimization Problem in the Network Parameters State Space involving a Similarity Measure between the ANN Transfer Function related to some Network Parametrimization and the Target hence our Noise Target Function Sampling.
Let’s consider this Similarity Measure as an Objective Function defined with respect to a subset of Training Data-Points which we call Batch as a sum of single contributions

Training typically consists of 2 types of information propagation across the Network:
- Forward Information propagation, aimed at performing an Inference starting from some Input (i.e. Training Data)
- Backward Information propagation, aimed at performing Parameters Fitting from the Error (defined as a function of the Inferred vs Training Data difference)
The Forward Information Propagation happens in the Data Space, while the Parameters Space is fixed

The Backward Information Propagation, called Backpropagation, happens in the Parameters Space, while Data Space is fixed

Training is a 2 Levels Iterative Process: one Iteration consists of a Forward Step, getting from Batch Data to Error, and a Backward Step, getting from Error to Params Update
- the first set of iterations is aimed at covering all the Training Data, batch by batch, so to complete an Epoch
- the second set of iterations regards an Epoch by Epoch cycle
When the Batch Size is equal to the Training Set Size, then the Standard Gradient Descent Method is used and the Computed Gradient is the True Gradient: the Gradient Computed with respect to all the Information available. In this case 1 Iteration = 1 Epoch.
This Algo allows for the best convergence however it is significantly expensive both from memory and computational point of view, as all the available information is used before performing a Parameters Update and hence a possible Performance Improvement.
The Training Cost escalates with the amount of information used to perform an Update Step.
A Strategy could then be to use a Reduced Set of Information to perform an Update Step hence reducing the Batch Size and so computing an Approximation of the True Gradient.
The difference of the Approximate Gradient with respect to the True Gradient is that the Noise / Signal Ratio will increase as the averaging effect will decrease. Furthermore with this method, the number of Iterations to complete an Epoch will increase but they will become cheaper from a computational and memory point of view.