When moving the first steps into Machine Learning, there are a lot of things to study and understand. Also, some ML models may seem very similar to each other; sometimes, it can be difficult to really understand the difference between some of them. This was my case when I first approached Regularization and Gradient Descent: they seemed to me too similar to understand the differences between them. The fact is that both Regularization and Gradient Descent use cost functions, but the differences between them are the type of cost function and the use we make of these methods.
In this article, I’ll try to clarify the differences, with the aim of helping you better understand these methods.
1. Regularization
In Machine Learning, we typically apply Lasso and Ridge regularization methods (also called, respectively, L1 and L2), but we use them when the "standard" model overfits.
Consider, for example, a Simple Linear Regression model; if you get a good value of the coefficient of determination (e.g., near 1) on the train set but a bad one (e.g., near 0) on the test set you are facing overfitting, and applying Lasso or Ridge Regression model can lead to a good result even on the test set. Also, if you don’t know whether to use Lasso or Ridge methods you can read my article here:
In the context of Machine Learning, regularization is the process that shrinks the coefficients toward zero, "discouraging" learning a complex problem; this process is realized with the so-called cost functions.
Lasso and Ridge methods have "pre-fixed" cost functions; this means that they are "built" with the cost functions "inside." This means that Lasso is built with the equation of the linear regression model plus a cost function which is equal to the absolute value of the magnitude of the coefficients. Ridge, instead, is built with the equation of the linear regression model plus a cost function which is equal to the square **** value of the magnitude of the coefficients.
Since these are supervised learning techniques (if you want to know more about supervised and unsupervised learning, you can read my article here), our goal is to find the best hyperparameter to avoid overfitting, and this hyperparameter can be 0 (and so, we "return" to the simple linear regression model) or infinite (meaning the model has been highly penalized by the algorithm).
2. Gradient Descent
Gradient Descent is a learning algorithm that works by minimizing a given cost function; so the main differences between Gradient Descent and regularization are:
- regularization methods have a "pre-defined" cost function, unlike Gradient Descent (which has a "given" cost function, but we’ll see later how it works)
- regularization is used when a model overfits, unlike Gradient Descent
There is an important concept to have in mind: models learn by minimizing a cost function, and this is why Gradient Descent is useful.
As you may know, in mathematics, when we have to find the minimum of a function, we use the derivative. When we have a problem with multiple variables we have to perform multiple derivatives: simplifying a lot, this is what a gradient is.
So, the Gradient Descent algorithm uses the gradient (aka, derivatives in multiple dimensions) to find the minimum of a function and it does it with multiple iterations, until it finds the minimum of the function.
For example, suppose we have a function f(x), where x is a tuple of several variables: x = (x1, x2, …, xn). Suppose that the gradient of f(x) is given by ∇f(x) (the inverted triangle is the symbol to identify the gradient). At any iteration t, we’ll denote the value of the tuple x by x[t]. For every iteration t, the training of the Gradient Descent works like that:
x[t] = x[t-1] – 𝜂∇f(x[t-1])
where 𝜂 is called the learning rate (and it’s a hyperparameter). At the end of this process, the algorithm will find the value of the variables (x1, x2, …, xn) that minimize the function f(x) (see also here for more details).
But, in the end, what have we found? As usual in Machine Learning, we found the function that best fits the given data, but we found it minimizing a cost function.
Conclusions
The aim of this article was to clarify the difference between Gradient Descent and Regularization; summarizing, these differences are:
- Regularization methods have a "pre-defined" cost function, unlike Gradient Descent in which the cost function is the gradient of the given function.
- Regularization is used when a model overfits, unlike Gradient Descent which is used regardless of the overfitting (we may find the model overfits after its validation, but we do not use Gradient Descent to prevent overfitting).
Let’s connect together!
LINKEDIN (send me a connection request)
If you want, you can subscribe to my mailing list so you can stay always updated!
Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.