Minimizing the error between a model and a training data set is a fundamental practice in Data Science. After all, if a model doesn’t match the training data set well, then it clearly won’t have any predictive power.
For examples of this concept, see Understanding the Fundamentals of Linear Regression and Understanding Multiple Regression. In both cases a model of a specific type (Either linear regression or multiple regression) was fit to a data set by minimizing the error.
In order to minimize the error in the model, it’s also necessary to be able to calculate the error. And to do so in a way that captures the complexity of the situation, while staying sensitive to the big-picture concerns of the data scientist. This requires a well written cost function, as was described in Cost Functions: The Underpinnings of Machine Learning.
Once the model type is selected and the cost function is written, the final piece is implementing a tool to minimize the cost function thus fitting the model to the data. The cost function will vary the parameters in the model as needed to find the best fit.
Gradient descent is a powerful tool for this role.
What is gradient descent?
Let’s start explaining gradient descent by thinking through an example. Imagine a parabola. Parabolas have both constantly changing slopes and a minimum point. The minimum point occurs where the slope is zero, as that’s also the point where the slope begins to increase on the other side.
For those who are familiar with calculus, this is equivalent to saying that the minimum occurs at the point where the derivative of the equation is equal to zero.
Now imagine that we are starting at a certain point and want to get to the minimum. We don’t need to specify the certain point right now, as this is all conceptual. We can calculate the slope in either direction from our starting point, determine which slope is positive and which is negative, and move in the direction of the negative slope. That will automatically bring us closer to the minimum.
Gradually and repeatedly moving in that direction will eventually find the minimum point.
This is the process of gradient descent. Gradient descent is a series of functions that 1) Automatically identify the slope in all directions at any given point, and 2) Adjusts the parameters of the equation to move in the direction of the negative slope. This gradually brings you to a minimum point.
Using gradient descent to find the minimum point of a parabola is a little silly; it’s a wildly overpowered tool for that simple example. But what happens when there are many variables, with many different slopes in many different directions? This is where the power of automating the process becomes clear.
How does gradient descent work?
Gradient descent automatically performs all of the steps described in our previous parabola example. Specifically it:
- Calculates the partial derivative of each variable at the point in question¹,
- Combines the partial derivatives with respect to each variable to identify the direction of most negative slope² across the vector space,
- Steps in the direction of the most negative slope³,
- Repeats the process until it finds the minimum point.
This process doesn’t instantly find the minimum point. There isn’t enough information in the partial derivatives to tell the gradient descent algorithm where the minimum point is, instead it only knows which direction to move. As a result, gradient descent must estimate the best step size, move in that direction, recalculate the partial derivatives, estimate the best step size, move in that direction…and so on, until it eventually does find the minimum.
This begs the question about step size. How does the gradient descent algorithm choose a step size?
How do I choose a step size?
Really, this can be a tricky choice. It’s commonly described as more of an art than a science.
Small steps are the least likely to overshoot and miss the minimum point. This is good. Small steps are the most accurate. On the other hand, they can lead to incredibly time-consuming searches.
Imagine running gradient descent, and wanting to be so sure that you accurately found the minimum that you use a minuscule step size and the algorithm needs to iterate a billion times. That’s not good.
But, at the same time, you do want to keep the step size small enough to make sure you don’t overshoot.
Some possible solutions to this predicament include:
- You could gradually reduce the step size over time. This works because there’s very little risk of overshooting the minimum point in the beginning. You can use a large step size to do a course search of the entire parameter space. Then, as the algorithm becomes more confident about the general location of the minimum, it starts using smaller step sizes to narrow in on the specific point.
- You can also calculate the step size that results in the minimum point in that direction, and make that step. In this way the model identifies the direction to move, then identifies the optimal distance to move in that direction, and does so. Once there, it searches for the new optimal direction and distance. This process dramatically reduces the number of iterations necessary to find the minimum.
Unfortunately, neither of these approaches are clear winners.
Gradually reducing the step size doesn’t work as well as desired, because the step size reduction may happen either too slowly or too quickly. A quick reduction in step size results in the problem this algorithm was trying to avoid, tiny step sizes and many iterations. Slowly reducing the step size can lead to the other problem of overshooting the minimum point.
Calculating the optimal step size works fantastically, but is very computationally expensive. It does reduce the number of iterations, but each iteration takes so long it doesn’t significantly reduce computation time.
One great solution is to use a pseudo-version of calculating the optimal step size. This can be done by providing the algorithm with an array of possible step sizes. The algorithm then iterates through those possible step sizes, finds the one with the best result, and makes that step. It’s more precise than automatically having the step size change over time and more time efficient that calculating the optimal step size at each point.
It provides flexible solution that works well enough at each point without being too computationally expensive.
What are the limitations of gradient descent?
The biggest limitation of gradient descent is computation time. Performing this process on complex models in large data sets can take a very long time. This is partly because the gradient must be calculated for the entire data set at each step.
The most common solution to this problem is stochastic gradient descent.
What is stochastic gradient descent?
Stochastic gradient descent is based on the assumption that the errors at each point in the parameter space are additive. The error at point one can be added to the error at point two which can be added to the error at point three, and so on for all of the points.
This means that the model doesn’t need to calculate the error at all points simultaneously (A computationally expensive process). Instead, it can calculate the error at each point individually, and sum them together.
By the way, this assumption is almost always true. It’s best practice to think for a moment to ensure that it’s true in your problem rather than simply assuming it is, but you’ll find that it’s true the vast majority of the time.
By using that assumption, stochastic gradient descent is able to calculate the error at each point sequentially. This saves dramatic amounts of computation time.
This dramatically increases the speed of each step in the gradient descent process. Stochastic gradient descent is still an iterative process, identifying a direction and moving towards the minimum one step at a time.
The downside of stochastic gradient descent is that it doesn’t find the true minimum as reliably. Instead it has a tendency to get extremely close, then circle the minimum forever. To avoid this, it’s important to set an improvement threshold. If repeated iterations through the gradient descent function don’t yield significant reductions in the cost function, then you’re merely circling the minimum point. It’s time to quit the gradient descent process, and accept the available results.
Bringing It All Together
Gradient descent is an important tool in generating predictive models for data science. It is a common optimization approach used to minimize a cost function, thereby finding the optimal parameters in your models. For example, it’s commonly used to fit linear regression or multiple regression models to the training data set.
This algorithm works by calculating the slope of the curve in every direction at it’s point in the data space and adjusting parameters to move in the direction of the most negative slope. For those familiar with calculus, it does this with partial derivatives.
Determining the step size is a fundamental challenge of using gradient descent. Small step sizes ensure finding the true minimum but also yield many, many iterations and long computation times. Large step sizes run faster, but are more likely to overstep the minimum point. A common solution is to provide the model with a few different step sizes to choose from. It then calculates the improvement with each step size and moves that far. This increases accuracy beyond using a large step size while simultaneously avoiding the computation time challenges of small steps.
Optimizing parameter sets using gradient descent can be a very time consuming process. This is because it has to calculate the error on the entire data set, all at the same time. Stochastic gradient descent avoids this problem by assuming that the errors are additive. In this way it can calculate the error at each point sequentially, reducing the computation demands on the computer. It yields significantly faster results.
[1] For those who haven’t studied calculus, partial derivatives are the slope of the equation of a single variable. For instance, the partial derivative with respect to B of A = 0.25 + 0.9 B + B C + 1.2 C is equal to 0.9 + C.
[2] "Most negative slope" refers to the fact that there will probably be many directions with negative slope. In order to find the minimum as fast as possible, gradient descent uses the direction with the highest absolute value of the negative slope.
[3] How far does it step? How does it know the step size to use? Good questions. Keep reading, they will be addressed soon.