Stochastic-, Batch-, and Mini-Batch Gradient Descent

Why do we need Stochastic, Batch, and Mini Batch Gradient Descent when implementing Deep Neural Networks?

Artem Oppermann
Towards Data Science
13 min readApr 26, 2020

--

Source: www.unsplash.com

This is a detailed guide that should answer the questions of why and when we need Stochastic-, Batch-, and Mini-Batch Gradient Descent when implementing Deep Neural Networks.

In Short: We need these different ways of implementing gradient descent to address several issues we will most certainly encounter when training Neural Networks which are local minima and saddle points of the loss function and noisy gradients.

More on that will be explained in the following article — nice ;)

Table of Content

  1. 1. Introduction: Let’s recap Gradient Descent
  2. 2. Common Problems when Training Neural Networks (local minima, saddle points, noisy gradients)
  3. 3. Batch-Gradient Descent
  4. 4. Stochastic Gradient Descent
  5. 5. Mini-Batch Gradient Descent
  6. 6. Take-Home-Message

1. Introduction: Let’s recap Gradient Descent

Before we address the different approaches to implement gradient descent, I think it would be a good idea to refresh your memory on what gradient descent actually is.

When we train a neural network, we want it to learn to perform a specific task. This task can be as simple as predicting the expected demand for a product in a particular market or performing the classification of skin cancer.

Regardless of this task, our only goal during training is to minimize the objective/ loss function. For predictions of the expected demand, which is a regression task, this loss function would be the Mean Squared Error (MSE) loss function:

Eq.1 Mean Squared Error Loss Function

For classification tasks, we want to minimize the Cross-Entropy loss function:

--

--