Member-only story

Stochastic-, Batch-, and Mini-Batch Gradient Descent

Why do we need Stochastic, Batch, and Mini Batch Gradient Descent when implementing Deep Neural Networks?

Artem Oppermann
Towards Data Science
13 min readApr 26, 2020

Source: www.unsplash.com

This is a detailed guide that should answer the questions of why and when we need Stochastic-, Batch-, and Mini-Batch Gradient Descent when implementing Deep Neural Networks.

In Short: We need these different ways of implementing gradient descent to address several issues we will most certainly encounter when training Neural Networks which are local minima and saddle points of the loss function and noisy gradients.

More on that will be explained in the following article — nice ;)

Table of Content

  1. 1. Introduction: Let’s recap Gradient Descent
  2. 2. Common Problems when Training Neural Networks (local minima, saddle points, noisy gradients)
  3. 3. Batch-Gradient Descent
  4. 4. Stochastic Gradient Descent
  5. 5. Mini-Batch Gradient Descent
  6. 6. Take-Home-Message

1. Introduction: Let’s recap Gradient Descent

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Artem Oppermann
Artem Oppermann

Responses (4)

What are your thoughts?

Very nice article. I would just like to point out that there's a typo when citing the advantages of Mini-Batch gradient descent, specifically about "Faster Learning". As we perform weight updates more often than Batch Gradient Descent, and not Stochastic. Sorry if I'm nitpicking!
Great article nonetheless, thank you :)

In general a nice introduction to the topic. But unfortunately the article contains some misconceptions when explaining the differences between the variants of GD:
- In all three variants of GD allways all model parameters get updated in each…

great content and explanation.. thankyou